Skip to content

crawstat.

data. analytics. models.

  • Contact

Tag: Data

GMV-Optimizing My Wife’s Portfolio

GMV-Optimizing My Wife’s Portfolio

My wife caught the investing bug. She recently put together a portfolio of healthcare, consumer (cyclical and defensive), technology services,…
March 22, 2021March 22, 2021 hpoola
Product Focus & Recs Using Pareto, Apriori, & Association

Product Focus & Recs Using Pareto, Apriori, & Association

Process Data Let’s clean and process the UCI Machine Learning Online Retail data set so it’s ready to use later…
March 16, 2021March 16, 2021 hpoola
Detecting & Treating Price Outliers

Detecting & Treating Price Outliers

A few outliers can significantly skew our understanding of variables or influence a predictive model. Depending on the source and…
March 11, 2021 hpoola
Cohort Metrics for a Fictitious Online Wine Shop

Cohort Metrics for a Fictitious Online Wine Shop

Today, let’s construct customer time cohorts, specifically first purchase dates, and look at metrics like retention and revenue by cohort.…
March 5, 2021March 5, 2021 hpoola
Customer RFMT Metrics & Segmentation

Customer RFMT Metrics & Segmentation

Today, we’ll construct recency, frequency, monetary value, and tenure (RFMT) metrics and segments using Brazilian ecommerce marketplace Olist’s sales transactions…
March 3, 2021 hpoola
Comparing Filtering Methods on MoMA Artworks

Comparing Filtering Methods on MoMA Artworks

We regularly filter dataframes using boolean masking, loc, query, and numpy.where. I was curious how they actually compared. I came…
February 24, 2021 hpoola
Peloton’s Implied Volatility & Delta Hedge

Peloton’s Implied Volatility & Delta Hedge

Today, let’s dig into Peloton (ticker: PTON). The purpose of this post is to: Forecast PTON’s volatility at a point…
February 22, 2021 hpoola
Ebay Kleinanzeigen Peppa Pig Scrape

Ebay Kleinanzeigen Peppa Pig Scrape

My daughter Danae’s birthday is on the horizon. Since she adores Peppa Pig, I scraped Ebay Kleinanzeigen (Germany’s eBay classifieds)…
February 16, 2021 hpoola
Comparing groupby, pivot_table, and crosstab on Covid Smell Loss Data

Comparing groupby, pivot_table, and crosstab on Covid Smell Loss Data

We routinely use groupby, pivot_table, and crosstab along with stack and unstack to pivot data by variables in order to…
February 8, 2021February 9, 2021 hpoola
Moderna & AbbVie: Dynamic Covariance

Moderna & AbbVie: Dynamic Covariance

Today, let’s look at dynamic covariance for biotech diversification and weighting, building on our previous post on GARCH models for…
February 3, 2021 hpoola
Spotify: Dynamic Conditional Beta

Spotify: Dynamic Conditional Beta

Spotify (ticker: SPOT), which went public cash flow-positive as a direct listing back in April 3, 2018, is one of…
February 1, 2021February 1, 2021 hpoola
Cleaner, Clearer Code with Pipe

Cleaner, Clearer Code with Pipe

I first came across pandas pipe last month at a Berlin Time Series Analysis meetup and in a TDS post.…
January 25, 2021January 25, 2021 hpoola
Gun Sales: Identifying Time Series Structural Change

Gun Sales: Identifying Time Series Structural Change

In a previous post, we explored and visualized population-adjusted gun sales, looking at many dimensions of time series data. We…
January 13, 2021January 21, 2021 hpoola
Cartoonize Your LinkedIn Photo

Cartoonize Your LinkedIn Photo

In our post Fun With Numpy from a few months ago, we loaded a color photo, converted it to a…
January 4, 2021January 4, 2021 hpoola
NBA Team Dashboard

NBA Team Dashboard

It’s hard to believe the NBA 2021 season has started! We’ve created an NBA Team Dashboard that lets fans explore…
December 29, 2020December 29, 2020 hpoola
Imputing Time Series Missing Values

Imputing Time Series Missing Values

Today, let’s see how different missing value impute methods stack up for various types of time series. It was inspired…
December 18, 2020December 19, 2020 hpoola
Berlin Covid Dashboard

Berlin Covid Dashboard

We’ve created an interactive dashboard that explores Berlin’s covid profile and evolution by district. It was borne out of the…
December 16, 2020December 16, 2020 hpoola
Guns: Time Series Analysis and Forecast

Guns: Time Series Analysis and Forecast

Today, we’ll do a deep-dive time series exploration and SARIMAX forecast using data on gun backround checks in the U.S.…
December 15, 2020 hpoola
Ode to Tegel

Ode to Tegel

On November 8, 2020, Tegel (TXL) saw its last flight take off, closing its doors after 47 years in operation…
November 17, 2020November 17, 2020 hpoola
FiveThirtyEight’s Pollster Ratings & Goodreads: Reshaping Wide to Long

FiveThirtyEight’s Pollster Ratings & Goodreads: Reshaping Wide to Long

Data is commonly in wide format, with rows being unique individual observations (no repeated records) and columns being features for…
November 5, 2020 hpoola
Astronaut Flight Records: Reshaping List-Like Data

Astronaut Flight Records: Reshaping List-Like Data

We often encounter messy list-like or string data that needs to be reshaped into separate rows for further analysis, including…
November 4, 2020November 4, 2020 hpoola
Poll Closing & Key Races by State

Poll Closing & Key Races by State

It’s hard to believe that it’s finally Election Day. To curb results-watching anxiety, it would be great to have poll…
November 3, 2020November 3, 2020 hpoola
SpaceX & Wine: Reshaping Nested Data

SpaceX & Wine: Reshaping Nested Data

We often encounter nested data, including in json files or dataframes with nested data columns. Today, we’ll walk through a…
November 2, 2020 hpoola
Creating an IPO Spider

Creating an IPO Spider

350+ companies and counting have IPO’d on U.S. exchanges in 2020, the most since 2000, including big names like Palantir,…
October 29, 2020October 29, 2020 hpoola
’96 Sonics: 3 Regex Methods to Split Names

’96 Sonics: 3 Regex Methods to Split Names

Today, we’ll walk through 3 simple regex methods to split names into first and last names in a dataframe. We’ll…
October 2, 2020October 2, 2020 hpoola
Berlin Temperatures 2009-2019

Berlin Temperatures 2009-2019

Extreme temperature is an increasingly important topic to understand. As Berlin is my home, it would be great to visualize…
September 23, 2020September 29, 2020 hpoola
Going Bananas

Going Bananas

Today, we’ll predict (in-sample) and forecast (out-of-sample) banana prices using time series data from January 1, 1990 through July 1,…
September 3, 2020September 3, 2020 hpoola
Building a Value-Weighted Telehealth Index

Building a Value-Weighted Telehealth Index

Telehealth – remote care delivery, patient monitoring, and education / engagement via a broad range of digital technologies such as…
August 27, 2020 hpoola
Simulating Netflix Equity Price and Returns

Simulating Netflix Equity Price and Returns

In the coming weeks, we’ll focus on some super-fun time-series data, including accessing and working with time-indexed data and building…
August 19, 2020August 24, 2020 hpoola
Berlin’s Airbnb Market & Venue Clusters by Neighborhood

Berlin’s Airbnb Market & Venue Clusters by Neighborhood

As one of Europe’s fastest-growing economies, Berlin has quickly grown into a tourism and residential real estate magnet. Imagine a…
August 3, 2020August 11, 2020 hpoola
Mapping, Segmenting, and Clustering Brooklyn’s Neighborhoods

Mapping, Segmenting, and Clustering Brooklyn’s Neighborhoods

Imagine you’re a retailer, restaurant, bar, or salon looking for the best spot for a new location in Brooklyn. Or,…
July 20, 2020August 11, 2020 hpoola
Cancer Cell Samples: 4 Classification Models

Cancer Cell Samples: 4 Classification Models

What is the likelihood that a patient’s cancer is malignant? Today, we’ll look into this question. We’ll optimize, train, make…
July 15, 2020August 11, 2020 hpoola
KNN, Decision Tree, SVM, and Logistic Regression Classifiers to Predict Loan Status

KNN, Decision Tree, SVM, and Logistic Regression Classifiers to Predict Loan Status

Hi Crawbears, Today, we’ll look into the question: Will a new bank customer default on his or her loan? We’ll…
July 6, 2020August 11, 2020 hpoola
San Francisco Crime Interactive Choropleth, Marker, and Cluster Maps

San Francisco Crime Interactive Choropleth, Marker, and Cluster Maps

Last time, we created FiveThirtyEight-style visuals and choropleth maps after cleaning and wrangling data from an official UN data set…
June 29, 2020August 11, 2020 hpoola
U.S. Immigration FiveThirtyEight-Style Visuals and Choropleth Maps

U.S. Immigration FiveThirtyEight-Style Visuals and Choropleth Maps

Hi Crawbears, Today, we’ll look into the question: How have U.S. immigration trends and distribution changed by country over the…
June 26, 2020August 11, 2020 hpoola
Multivariate Linear & Polynomial Models to Predict Home Price

Multivariate Linear & Polynomial Models to Predict Home Price

Hi Crawbears, Are you getting a fair value for your home? Let’s think about this question by working with a…
June 23, 2020August 11, 2020 hpoola
Predicting Whether You’ve Smoked 100 Cigarettes: a Marginal Logistic Model

Predicting Whether You’ve Smoked 100 Cigarettes: a Marginal Logistic Model

Hi Crawbears, Last time, we constructed linear models, including OLS, marginal, and multilevel, with the NHANES national health and nutrition…
June 19, 2020August 11, 2020 hpoola
Marginal & Multilevel Linear Models

Marginal & Multilevel Linear Models

Hi Crawbears, In a previous post, we dug into the NHANES national health and nutrition data set to answer some…
June 17, 2020August 11, 2020 hpoola
Bayesian Simulation

Bayesian Simulation

Hi Crawbears, In a previous post, we applied chained Bayesian logic to interpret a positive COVID-19 test result. We’ve also…
June 12, 2020August 11, 2020 hpoola
76 Cereal Brands, a Head-to-Head Comparison

76 Cereal Brands, a Head-to-Head Comparison

Hi Crawbears, A reader sent in a fun ask: Can you do an analysis comparing cereal brands? It’s hard not…
June 11, 2020August 12, 2020 hpoola
Uninsured Racial Disparity, Blood Pressure Gender & Smoker Diffs

Uninsured Racial Disparity, Blood Pressure Gender & Smoker Diffs

Hi Crawbears, In a previous post, I examined baby circadian clocks, specifically the difference in night bedtime and sleep duration…
June 9, 2020August 11, 2020 hpoola
Chained Bayesian: Interpreting a Positive COVID-19 Test Result

Chained Bayesian: Interpreting a Positive COVID-19 Test Result

Hi Crawbears, Bayesian logic can be used to compute probabilities in your daily life. It makes your subjective belief (prior…
June 5, 2020August 12, 2020 hpoola
Baby Circadian Clocks

Baby Circadian Clocks

Parents of toddlers carefully scrutinize, guard, and manage when and for how long their babies’ nap during the day and…
June 4, 2020August 12, 2020 hpoola
Exploring & Visualizing Gapminder

Exploring & Visualizing Gapminder

Hi Crawbears, In my previous analysis of the NBA 3 Pointer’s dramatic evolution, I walked through data subsetting, time series…
June 2, 2020August 12, 2020 hpoola
Simulating the Effects of Non-Representative Sampling and Sample Size

Simulating the Effects of Non-Representative Sampling and Sample Size

What I find exhilarating about constructing Python simulations is that they can bring to life concepts and frameworks of how…
May 29, 2020August 12, 2020 hpoola
Detlef Schrempf in the 3 Point Era

Detlef Schrempf in the 3 Point Era

Following my analysis The NBA 3 Pointer’s Staggering Rise, a reader asked how Detlef Schrempf stacks up. As Schrempf is one…
May 28, 2020August 12, 2020 hpoola
The NBA 3-Pointer’s Staggering Rise

The NBA 3-Pointer’s Staggering Rise

Today’s NBA game is, in many ways, unrecognizable from it’s mid-90s self. One of the most prominent shifts has been…
May 27, 2020August 12, 2020 hpoola
Danae & Shaelo on Strawberry Summit

Danae & Shaelo on Strawberry Summit

Hi Crawbears, For Crawstat’s inaugural post, I built a fun, basic simulation with Python that you could try yourself. A…
May 22, 2020August 12, 2020 hpoola
Welcome to Crawstat!

Welcome to Crawstat!

Hi Crawbears,  Does data make your heart beat faster? Do you get a thrill out of building statistical models and…
May 22, 2020March 28, 2021 hpoola
WordPress TwoPoints.
 

Loading Comments...