Peanut Gallery

IMDb review web scraper that compiles all reviews for a movie as .html or .txt.

Landing page
React.js HTML CSS JavaScript Express Node.js Python Cheerio Clojure

Description

Peanut Gallery is a web scraper that I created to compile IMDb reviews for any given movie. After scraping all of the reviews for an IMDb title, the site makes them available for download as a single .txt or .html file.

Results page with scraped .txt file

The Problem

I wanted to learn more about how natural language corpora are built. Before making this site, I noticed that movie reviews have become a popular source of data for sentiment analysis because they are often opinionated and already labeled with ratings (usually on a scale from 1-5 or 1-10). Since I couldn’t find an API for user-generated movie reviews, I decided to scrape them myself.

Sample page from IMDb

Planning

My first step was to convert movie titles to IMDb ID’s. I decided to use the Open Movie Database API. After that, I had a lot of fun figuring out the best way to scrape reviews from IMDb.

At first, I parsed the HTML from each review page using a library called Cheerio. However, while I was in the process of building my scraper, IMDb went through a total site redesign that broke it. After I discovered that I could access the AJAX for the reviews by changing my request URL, I changed my approach. Using AJAX instead of scraping their HTML, which was prone to change, made my scraper resilient to any future IMDb redesigns.

Sample AJAX url from IMDb

Outcomes

Scraping all the reviews for a single movie title was my first step towards building a corpus. At one point, I re-wrote the scraper in the programming language Clojure, which gave me a better understanding of my code and of functional programming principles. Eventually, I re-worked the code to build a corpus of 1.5 million reviews across five movie genres (animation, comedy, documentary, horror, and romance).

I also wrote a thesis at Yale University on sentiment analysis biases for my degree in Linguistics with a Computational Depth Focus.

Next Steps

  • Support JSON downloads to make the reviews more structured
  • Allow users to upload a list of movie titles and download a single file with all the reviews labeled by movie