Peanut Gallery
IMDb review web scraper that compiles all reviews for a movie as .html or .txt.

Description
Peanut Gallery is a web scraper that I created to compile IMDb reviews for any given movie. After scraping all of the reviews for an IMDb title, the site makes them available for download as a single .txt or .html file.

The Problem
I wanted to learn more about how natural language corpora are built. Before making this site, I noticed that movie reviews have become a popular source of data for sentiment analysis because they are often opinionated and already labeled with ratings (usually on a scale from 1-5 or 1-10). Since I couldn’t find an API for user-generated movie reviews, I decided to scrape them myself.

Planning
My first step was to convert movie titles to IMDb ID’s. I decided to use the Open Movie Database API. After that, I had a lot of fun figuring out the best way to scrape reviews from IMDb.
At first, I parsed the HTML from each review page using a library called Cheerio. However, while I was in the process of building my scraper, IMDb went through a total site redesign that broke it. After I discovered that I could access the AJAX for the reviews by changing my request URL, I changed my approach. Using AJAX instead of scraping their HTML, which was prone to change, made my scraper resilient to any future IMDb redesigns.

Outcomes
Scraping all the reviews for a single movie title was my first step towards building a corpus. At one point, I re-wrote the scraper in the programming language Clojure, which gave me a better understanding of my code and of functional programming principles. Eventually, I re-worked the code to build a corpus of 1.5 million reviews across five movie genres (animation, comedy, documentary, horror, and romance).
I also wrote a thesis at Yale University on sentiment analysis biases for my degree in Linguistics with a Computational Depth Focus.
Next Steps
- Support JSON downloads to make the reviews more structured
- Allow users to upload a list of movie titles and download a single file with all the reviews labeled by movie