Photo by Monstera from Pexels

A little surprise in speed using pandas: back to the roots helps.

Jose Ferro
3 min readNov 10, 2021

--

TL;DR: If you use pandas for a simple lookup functionality in a relative small table, going back to the built in methods of python might pay off in terms of speed.

The situation: I have a data table that I read into a pandas DF and I only use this table as a lookup table (pandas .loc functionality). What I call a lookup is looking for a row data where column value is equal to a particular value. (Those value names are unique).

Just to be clear, I am only looking for speed, in my particular task I dont need to perform fancy stuff with pandas. I need to keep around 500 to 1000 lookup calls well bellow 1 second, i.e. the user can not feel delay when clicking a button.

The raw data is a table, in whichever form. So logically a pythonista will think about pandas. The raw data can of course be also considered a list of dictionaries as it is evident from the examples bellow.

The example data:

Here I create a mock list of dicts with 1000 values that will the the rows of my pandas DF.

The ways to perform a lookup:

I evaluate there ways of performing a lookup:

a) generator

b) list comprehension

c) pandas loc

The goal here is to look for one of the element of the list of dictionaries with a particular name. I tried to look for an element towards the start of the list and another towards the end.

The speed using any of these two first methods is between 50 microseconds and 1.3 milliseconds. This is around 25 faster compare to the case in which pandas is used. It is hence worth to ask yourself if using pandas always pays off.

Nevertheless it is possible to appreciate that when looking for a value towards the end of the list the time increases for the generator method and the list comprehension one, whereas the pandas dataframe is not much affected.

This leads me to ask myself if making the list much bigger will make pandas to outperform the built in methods for searching a list of data using .loc. Here are the results.

THE POWER OF PANDAS

When dealing with huge data the power of pandas surfaces. I performed the same calculations with a 1 MILLION rows table and the results are the following:

The time consumed by pandas remains more or less constant (little increase) whereas the other two methods get substantially slower. To the point that pandas outperforms the generator and the list comprehension method by being around 10 times faster.

CONCLUSION

Pandas is a library well thought for a lot of data but when using just a lookup functionality, and up to a certain number of rows (in this case I just made my first check at 1000), the build in methods outperform pandas for quite a bit (25x) and are as concise as using .loc in pandas.

--

--

Jose Ferro

Python coder. NLP. Pandas. Currently heavily involved with ipywidgets (ipysheet, ipyvuetify, ipycytoscape)