Comparing groupby, pivot_table, and crosstab on Covid Smell Loss Data

We routinely use groupby, pivot_table, and crosstab along with stack and unstack to pivot data by variables in order to uncover relationships and outliers. I was curious how they actually compared. I came across this comparison of the three operations on varying dataframe sizes by Ramiro Gomez. Today, we’ll demonstrate groupby, pivot_table, and crosstab in achieving the same dataframe and compare their ease of use and runtime.

Groupby, pivot_table, and crosstab can be used interchangeably as a matter of choice and data types, though they have subtle differences. As we’ll see below, groupby is generally fastest, but it’s not as intuitive as pivot_table where the arguments are explicitly stated and requires multiple operations. Pivot_table and crosstab have margin arguments if you want to include aggregates like grand totals. Keep in mind that pivot_table generates a multiindex dataframe and also automatically sorts (groupby has an optional sort argument). Crosstab works similarly to pivot_table, but can take an array or series as input, which may hinder its runtime as it converts to a dataframe at the end.

We’ll work with data from the Global Consortium for Chemosensory Research (GCCR), a group of scientists, physicians, and patient advocates founded in response to the covid pandemic. I came across their work from a recent New York Times longform piece on the mysteries of smell.

While the data seems fascinating to explore, let’s take a rain check and focus on comparing groupby, pivot_table, and crosstab. As an example task, we’ll create a dataframe that breaks out average loss of smell during covid illness by age group and self-reporting of loss of smell as a symptom. Age groups were created using pd.cut and arbitrary cutoffs. As an FYI, researchers quantified change in smell using a 0-100 visual analog scale. If you’re curious, check out GCCR’s study and methodology on using smell loss as a logistic predictor for Covid-19. We’ll finally calculate and plot average runtime for each operation using timeit().

Code below and in more readable form on colab and Git.

Leave a Reply

Discover more from crawstat.

Subscribe now to keep reading and get access to the full archive.

Continue reading