Creating an IPO Spider

350+ companies and counting have IPO’d on U.S. exchanges in 2020, the most since 2000, including big names like Palantir, Asana, Snowflake, and McAfee. Others like Airbnb, Doordash, Instacart, and Robinhood are in the bullpen. It would be great to see what these companies are all about, learn their key products, and to have that info in one place.

To source info, let’s look at the public website stock analysis. We’ll build a simple web spider that will scrape the links to each company on the site, follow each company link, scrape each company’s description, and output a dictionary with keys as company names & tickers and values as company descriptions. We’ll then convert the dictionary of 353 companies that the spider generated into a styled dataframe.

We’ll create the web spider using scrapy and use both xpath and css methods to access html elements. The heart of the spider are its two parsing methods using css to access html elements – the first to extract links that the spider will follow and the second to scrape the company name & ticker and company description from the new site for each company. Constructing the parsing methods requires time upfront to inspect the site’s html in order to get the right css selector for the target html elements. The purpose is to demonstrate how to build a web spider and to create a single table of high-level company descriptions for prospective investors. Future potential areas of scraping, for anyone interested, includes financial detail, including revenue, EBITDA, and free cash flow.

The gist with the code wasn’t loading here properly so I’ve included images of the code and output below.