Comparing Filtering Methods on MoMA Artworks

We regularly filter dataframes using boolean masking, loc, query, and numpy.where. I was curious how they actually compared. I came across a similar comparison of the three operations by Ramiro Gomez. Today, we’ll demonstrate four filtering methods in achieving the same filtered dataframe and compare their ease of use and runtime.

Boolean masking, loc, query, and numpy.where can be used interchangeably as a matter of choice and data types, though I find query to be the most elegant because it doesn’t require you to repeatedly call the dataframe (it also filters in place). We’ll see below that, for the size of data set we’re using (>100,000 records), query is also the fastest, followed by loc, boolean, and np.where. np.where is the slowest as it must first create a list of indexes that meet our conditions.

We’ll work with data from the Museum of Modern Art (MoMA). It includes data on >130,000 artworks acquired by MoMA including title, artist, date made, medium, dimensions, and date acquired. MoMA also has data set of artists. If you’re interested in other art data sets, check out these from The Tate and Carnegie Museum of Art.

As an example task emphasizing performance as a filtering method, let’s create a filtered dataframe of sculptures made with bronze by American or German women with first names starting with the letter K that were acquired by MoMA from 1990 onward. Let’s then calculate and plot average runtime for each operation using timeit().

Code below and in more readable form on colab and GitHub.

Comparing Filtering Methods on MoMA Artworks

Like this:

Similar Posts

Moderna: Modeling Volatility with GARCH

Simulating the Effects of Non-Representative Sampling and Sample Size

Leave a ReplyCancel reply

Share this:

Like this:

Similar Posts

Leave a ReplyCancel reply

Discover more from crawstat.