Explore the IMDb Dataset with Jupyter Notebook and Pandas

Sept. 25, 2022 • Python, Data Science
Cover

In this tutorial, we will be using Jupyter Notebook, Pandas and Plotly to explore and visualize the official IMDB dataset. You will learn how to apply common data science techniques such as data cleaning, filtering and visualization.

Introduction

Jupyter notebook and Pandas are great tools for data analysis. For a quick introduction to Pandas check out their 10 minutes to pandas guide. Another great tutorial is the Pandas Cookbook by Julia Evans. It starts from the very basics and digs into some great real-world data examples.

Once you're familiar with the fundamentals, you might wonder what other real-world datasets you could start exploring to hone your new skills. In this case you should take a look at the official IMDB datasets! They are available for personal and non-commercial use and contain millions of rows of real data from their website.

Apart from being interesting to explore, another great thing about these datasets is that there are some flaws in the data format. This means that before you can really get started, you will have to do some data cleaning. As most of the datasets in the real world are not perfect, this is a very useful skill to learn and to practice! We will spend a bit of time to go through this process step by step.

In this tutorial, I will be using the command line to set up the environment. If that's something you are not familiar with, check out the Djangogirls Command Line Introduction.

As always, you can find the corresponding Jupyter notebook file on Github.

WARNING: The IMDb dataset is quite large, so make sure you have enough free memory on your machine. On my setup running the notebook takes about 3-4 GB of RAM.

Setup

Create a new a project folder and virtual environment. Activate the virtual environment and install Jupyter notebook and Pandas using pip.

$ mkdir imdbexplore && cd imdbexplore
$ python -m venv .venv
$ source .venv/bin/activate  # macOS
$ .venv\Scripts\Activate.ps1  # Windows
$ (.venv) pip install notebook pandas

Download the IMDB datasets from their website. You can either do it from your browser or directly from the command line, as shown below.

There are a couple of different datasets available. We will be using the title.basics, which contains some basic information about movies like type, title, year, runtime, genre etc. The title.ratings dataset contains average ratings and rating count for movies.

# Windows
wget https://datasets.imdbws.com/title.basics.tsv.gz
wget https://datasets.imdbws.com/title.ratings.tsv.gz
# macOS
curl -o title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz
curl -o title.ratings.tsv.gz https://datasets.imdbws.com/title.ratings.tsv.gz

Start the jupyter notebook server from your command line

$ (.venv) jupyter notebook

The Jupyter Notebook interface should appear in a new tab in your browser. If nothing happens, copy the url from the terminal and paste it into your browser.

Create a new notebook by clicking on New and then selecting Python 3.

Create new notebook

Well done. Let's start exploring!

Load the datasets

In order to load the datasets, we need a few python modules. In a new cell in your notebook, type the following lines.

Hit SHIFT+ENTER to run the cell or click the Run button.

import gzip, shutil
import pandas as pd

The datasets are compressed, which is why we will use gzip to extract the archive. Then we create a new file to write the extracted data to and use shutil to copy the file object. We'll do that with both files.

with gzip.open('title.basics.tsv.gz', 'rb') as f_in:
    with open('title.basics.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
with gzip.open('title.ratings.tsv.gz', 'rb') as f_in:
    with open('title.ratings.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Now we can load the datasets. We'll use the read_csv method provided by Pandas, which will load the data into a Pandas dataframe. The values in the files are separated by tab, which is why we specify the separator sep='\t'. As the datasets are quite large, we also need to set low_memory to False. Try load the title.basics dataset without this argument and you will see that this is raising an error. We also specify the string that we want to be indentified as 'not available' by na_values=['\\N']. Don't worry if it takes a few seconds to load, as it is quite a large dataset.

basics = pd.read_csv('title.basics.tsv', sep='\t',
    low_memory=False, na_values=['\\N'])
ratings = pd.read_csv('title.ratings.tsv', sep='\t',
    low_memory=False, na_values=['\\N'])

Before we look at the data itself, let's see how many rows the datasets contain.

print(len(basics), len(ratings))
9226400 1260984

The basics dataframe has ~9 Mio rows, but there are only ~1.2 Mio rows in ratings. It seems that not all titles have ratings data.

To view the data, you can use the .head() method to look at the first couple of rows.

basics.head()

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
0 tt0000001 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0.0 1892.0 NaN 5 Animation,Short
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0.0 1892.0 NaN 4 Animation,Comedy,Romance
3 tt0000004 short Un bon bock Un bon bock 0.0 1892.0 NaN 12 Animation,Short
4 tt0000005 short Blacksmith Scene Blacksmith Scene 0.0 1893.0 NaN 1 Comedy,Short

ratings.head()

tconst averageRating numVotes
0 tt0000001 5.7 1910
1 tt0000002 5.8 256
2 tt0000003 6.5 1714
3 tt0000004 5.6 169
4 tt0000005 6.2 2528

This first dataset title.basics contains the type, title, year, runtime and genres as columns for each title. The second one contains the average rating and the number of votes for different titles. The column tconst in both datasets is a unique identifier for each title.

Looking at the ratings without seeing the human readable title is actually not very convenient. So why not use the tconst value to find the corresponding rows in both datasets and combine them?

Luckily, pandas already provides us with a convenient method to do that, which is the pd.merge() method. The on argument specifies which column should be used to match the rows.

data = pd.merge(basics, ratings, on="tconst")
data.head()

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes
0 tt0000001 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1910
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0.0 1892.0 NaN 5 Animation,Short 5.8 256
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0.0 1892.0 NaN 4 Animation,Comedy,Romance 6.5 1714
3 tt0000004 short Un bon bock Un bon bock 0.0 1892.0 NaN 12 Animation,Short 5.6 169
4 tt0000005 short Blacksmith Scene Blacksmith Scene 0.0 1893.0 NaN 1 Comedy,Short 6.2 2528

That looks great! Now the basics and the ratings data are combined into one dataframe.

Before we continue, let's have a look at how many rows the new dataframe data has.

len(data)
1260984

The merged dataframe data has the same number of rows as ratings, so it seems that rows from title.basics that didn't have ratings data got dropped. This database talk, this method is called 'inner join', as opposed to 'outer join', which would keep the existing rows and fill the gaps with blanks. More on that can be found in the official pandas documentation.

We can get a summary of our new data dataframe with .info().

data.info()

Int64Index: 1072813 entries, 0 to 1072812
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype
---  ------          --------------    -----
 0   tconst          1072813 non-null  object
 1   titleType       1072813 non-null  object
 2   primaryTitle    1072812 non-null  object
 3   originalTitle   1072812 non-null  object
 4   isAdult         1072813 non-null  int64
 5   startYear       1072660 non-null  float64
 6   endYear         25537 non-null    float64
 7   runtimeMinutes  780476 non-null   object
 8   genres          1051377 non-null  object
 9   averageRating   1072813 non-null  float64
 10  numVotes        1072813 non-null  int64
dtypes: float64(3), int64(2), object(6)
memory usage: 98.2+ MB

You can see some key properties of the dataframe and a list of the columns and the datatype of each column. Another way to find out the datatypes of the columns would be to access the .dtypes property of a dataframe.

data.dtypes
tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

When looking at the datatypes, we ca see that there are three different ones: object, int64 and float64. Pandas uses object for anthing that is not a number, like e.g. strings. int64 and float64 refer to integers and floating points, respectively.

We can see that somehow, pandas is using the object datatype for the column runtimeMinutes, while we would expect this to be a number. To see what's going on here, we need to do some sanity checking. To do that, we will a use a technique called boolean indexing, where we can select and show only the rows that match a certain condition. In this case we only want to display the rows that have a non-numeric value in runtimeMinutes.

is_numeric = data['runtimeMinutes'].str.isnumeric()
is_numeric
0          True
1          True
2          True
3          True
4          True
           ...
1260979    True
1260980    True
1260981    True
1260982    True
1260983     NaN
Name: runtimeMinutes, Length: 1260984, dtype: object

Wait a minutes, the pandas series is_numeric is of dtype object, not bool. This is because the column runtimeMinutes contains a few NaN values. In order to use is_numeric for boolean indexing, it needs to be boolean. So let's convert it to dtype: bool.

is_numeric = is_numeric.astype('bool')
is_numeric
0          True
1          True
2          True
3          True
4          True
           ...
1260979    True
1260980    True
1260981    True
1260982    True
1260983    True
Name: runtimeMinutes, Length: 1260984, dtype: bool

Now the series is of dtype: bool. We'll use the invert operator ~ to invert the is_numeric series and use it for our boolean indexing. This way, we will see only the rows that are not numeric. Smart, he?

data[~is_numeric]

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes
560561 tt12149332 tvEpisode Jeopardy! College Championship Semifinal Game ... 0 2020.0 NaN NaN Game-Show NaN 6.9 8
974363 tt3984412 tvEpisode I'm Not Going to Come Last, I'm Just Going to ... 0 2014.0 NaN NaN Game-Show,Reality-TV NaN 8.0 5

It seems there are two rows where runtimeMinutes contains data that actually belongs to genres. Looking at the other columns, we can see that there is something wrong with the data in the primaryTitle column. The title string seems to be duplicated because of some missing quotation marks and the other columns are shifted to the left. We could try to clean the original data, but instead let's pretend we don't care about these rows and just drop them.

We do that by assigning the filtered data to a new dataframe. Note that we are using the method .copy() explicitly to make sure the cleaned_data is actually a copy and not just a view of the original dataframe. This is to avoid getting a SettingWithCopy warning later on. More on that topic here and here.

cleaned_data = data[is_numeric].copy()

Let's just double-check briefly that the corrupt rows are gone.

is_numeric_cleaned = cleaned_data['runtimeMinutes'].str.isnumeric().astype('bool')
cleaned_data[~is_numeric_cleaned]

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes

Perfect. However, the datatype of the runtimeMinutes column is still object.

cleaned_data.dtypes
tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

To convert the datatype to numeric, we can use the pandas.to_numeric() method. In order to correctly replace the data in the existing dataframe, we use the .loc() method to select a subset of the dataframe. .loc(:, 'runtimeMinutes' means we want to replace all rows of the runtimeMinutes column with the converted data.

cleaned_data.loc[:, 'runtimeMinutes'] = pd.to_numeric(
    cleaned_data.loc[:, 'runtimeMinutes'], errors='coerce'
)
cleaned_data.dtypes
tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes    float64
genres             object
averageRating     float64
numVotes            int64
dtype: object

Great, now the runtimeMinutes column is numeric.

Now that we've cleaned the data, let's do some analysis! We can start by looking into the title type and see how many items there are or each type. We will use the .value_counts() method for that.

titles_by_type = cleaned_data['titleType'].value_counts()
titles_by_type
tvEpisode       611935
movie           282331
short           143507
tvSeries         83396
video            50367
tvMovie          48909
tvMiniSeries     13877
videoGame        13634
tvSpecial        10556
tvShort           2470
Name: titleType, dtype: int64

Interesting. Let's visualize that data in a pie chart!

There are several plotting libraries that you can use within Python and Jupyter notebooks, e.g. matplotlib. One library I like in particular is Plotly, because it looks very nice and has some cool features. You will need to install it from the command line

pipenv install plotly

Now we can import plotly and initialize the notebook mode, so that the graphs show up nicely in our Jupyter notebook.

import plotly.graph_objects as go
import plotly.offline as pyo
# set notebook mode to work in offline
pyo.init_notebook_mode()

Now we can create a new figure from plotly.graph_objects and pass a go.Pie object to the data argument. Appart from providing labels and values, we use textinfo and insidetextorientation to customize the style a bit. Try without the textinfo and insidetextorientation arguments to see the difference.

fig = go.Figure(data=[go.Pie(
    labels=titles_by_type.index,
    values=titles_by_type.values,
    textinfo='label+percent',
    insidetextorientation='radial')])
fig.show()

Piechart

Neat, I like that! Did you realize that Plotly calculates the percentages automatically from the total numbers we provided? Note also that when you hover over the sections of the pie chart, you get a little tool tip with label, percentage and the total number next to your cursor.

Let's try to create our own version of the famous IMDb Top 250 movie list! For that list, IMDb only takes movies 250,000 votes into consideration. So let's filter our data using these criteria. We use the & (AND) operator to combine two filter criteria. Before displaying, we also sort the rows by the average rating, so that movies with a high rating are on top of the list.

filtered_data = cleaned_data[
    (cleaned_data.titleType == "movie") & (cleaned_data.numVotes > 250000)
].copy()
filtered_data = filtered_data.sort_values(by="averageRating", ascending=False)
filtered_data[:10]

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes
82427 tt0111161 movie The Shawshank Redemption The Shawshank Redemption 0.0 1994.0 NaN 142.0 Drama 9.3 2639333
46140 tt0068646 movie The Godfather The Godfather 0.0 1972.0 NaN 175.0 Crime,Drama 9.2 1829481
79844 tt0108052 movie Schindler's List Schindler's List 0.0 1993.0 NaN 195.0 Biography,Drama,History 9.0 1338297
29999 tt0050083 movie 12 Angry Men 12 Angry Men 0.0 1957.0 NaN 96.0 Crime,Drama 9.0 779249
48686 tt0071562 movie The Godfather Part II The Godfather Part II 0.0 1974.0 NaN 202.0 Crime,Drama 9.0 1254338
251144 tt0468569 movie The Dark Knight The Dark Knight 0.0 2008.0 NaN 152.0 Action,Crime,Drama 9.0 2610926
114473 tt0167260 movie The Lord of the Rings: The Return of the King The Lord of the Rings: The Return of the King 0.0 2003.0 NaN 201.0 Action,Adventure,Drama 9.0 1814909
82211 tt0110912 movie Pulp Fiction Pulp Fiction 0.0 1994.0 NaN 154.0 Crime,Drama 8.9 2019908
38862 tt0060196 movie The Good, the Bad and the Ugly Il buono, il brutto, il cattivo 0.0 1966.0 NaN 161.0 Adventure,Western 8.8 753855
634176 tt1375666 movie Inception Inception 0.0 2010.0 NaN 148.0 Action,Adventure,Sci-Fi 8.8 2314403

If you want, you can now compare your list with the official one!

Let's create another graph with the filtered data. How about a histogram of the average movie rating, so we can see what the most common average ratings are?

fig = go.Figure(data=[go.Histogram(x=filtered_data['averageRating'])])
fig.show()

Ratings Histogram

To generate the histogram, Plotly automatically generates equally spaced 'bin' and counts the titles that fall into each interval. Pretty neat. Check out their documentation to find out how you can controll the intervals.

The most common average rating as 7.6 to 7.7 (you can see it when you hover over the highest bar). Another thing to note is that hardly any movies with more than 250,000 ratings have an average rating below 5.5.

Note that we're still looking at the filtered dataset of movies that received more than 250,000 votes. The filtered data only contains a bit more than 600 titles.

len(filtered_data)
789

That brings us to another interesting question: How is the number of votes distributed across the movies? Maybe you have come across the Pareto principle (also known as the 80/20 rule), which states that often, 80% of the effect come from 20% of the cause (see on Wikipedia). E.g. in a business world this could mean that 80% of the turnover come from 20% percent of the clients.

Let's investigate if this rule also applies to our movie data, i.e. whether that 20% of the movies get 80% of the votes.

To do that, we filter our data to get only movies with at least 1 vote. Also, in order to make a nice plot, we need to sort the the movies by the number of votes in descending order and then add a column with the row number, which we can use as the x axis for our plot.

import numpy as np

movie_data = (
    data[(data["titleType"] == "movie") & (data["numVotes"] > 0)]
    .sort_values(by="numVotes", ascending=False)
    .copy()
)
movie_data["row_num"] = np.arange(len(movie_data))
movie_data

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes row_num
82427 tt0111161 movie The Shawshank Redemption The Shawshank Redemption 0.0 1994.0 NaN 142 Drama 9.3 2639333 0
251144 tt0468569 movie The Dark Knight The Dark Knight 0.0 2008.0 NaN 152 Action,Crime,Drama 9.0 2610926 1
634176 tt1375666 movie Inception Inception 0.0 2010.0 NaN 148 Action,Adventure,Sci-Fi 8.8 2314403 2
99042 tt0137523 movie Fight Club Fight Club 0.0 1999.0 NaN 139 Drama 8.8 2083152 3
81333 tt0109830 movie Forrest Gump Forrest Gump 0.0 1994.0 NaN 142 Drama,Romance 8.8 2043215 4
... ... ... ... ... ... ... ... ... ... ... ... ...
1126913 tt6555682 movie Janata Express Janata Express 0.0 1981.0 NaN NaN Drama 6.4 5 282326
552959 tt11991560 movie Parimala Lodge Parimala Lodge 0.0 2020.0 NaN NaN Drama 8.8 5 282327
552848 tt11989824 movie Jeeva (2009 film) Jeeva (2009 film) 0.0 2009.0 NaN NaN Romance 5.4 5 282328
552567 tt11984548 movie The Old House O Casarão 0.0 2021.0 NaN 72 Documentary 8.4 5 282329
1180256 tt7738732 movie La settima onda La settima onda 0.0 2015.0 NaN NaN Drama 6.4 5 282330

282331 rows × 12 columns

Let's create a scatterplot with movies on the x-axis and number of votes on the y-axis.

fig = go.Figure(
    data=[go.Scatter(x=movie_data["row_num"], y=movie_data["numVotes"])]
)
fig.show()

Votes Distribution

In fact, this looks very much like the probability density function of the Pareto distribution (see Wikipedia article). However we can already see from that graph, that there are very few movies with very high numbers of votes, and then there are loads of movies with hardly any number of votes.

In order to get make make it easier to interpret the data, we can calculate the cumulative distribution by cumulatively adding up the number of votes, which we can do with the .cumsum() function from Pandas.

movie_data['cumsum'] = movie_data['numVotes'].cumsum()
movie_data

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes row_num cumsum
82427 tt0111161 movie The Shawshank Redemption The Shawshank Redemption 0.0 1994.0 NaN 142 Drama 9.3 2639333 0 2639333
251144 tt0468569 movie The Dark Knight The Dark Knight 0.0 2008.0 NaN 152 Action,Crime,Drama 9.0 2610926 1 5250259
634176 tt1375666 movie Inception Inception 0.0 2010.0 NaN 148 Action,Adventure,Sci-Fi 8.8 2314403 2 7564662
99042 tt0137523 movie Fight Club Fight Club 0.0 1999.0 NaN 139 Drama 8.8 2083152 3 9647814
81333 tt0109830 movie Forrest Gump Forrest Gump 0.0 1994.0 NaN 142 Drama,Romance 8.8 2043215 4 11691029
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1126913 tt6555682 movie Janata Express Janata Express 0.0 1981.0 NaN NaN Drama 6.4 5 282326 996061058
552959 tt11991560 movie Parimala Lodge Parimala Lodge 0.0 2020.0 NaN NaN Drama 8.8 5 282327 996061063
552848 tt11989824 movie Jeeva (2009 film) Jeeva (2009 film) 0.0 2009.0 NaN NaN Romance 5.4 5 282328 996061068
552567 tt11984548 movie The Old House O Casarão 0.0 2021.0 NaN 72 Documentary 8.4 5 282329 996061073
1180256 tt7738732 movie La settima onda La settima onda 0.0 2015.0 NaN NaN Drama 6.4 5 282330 996061078

282331 rows × 13 columns

Let's plot this.

fig = go.Figure(data=[go.Scatter(x=movie_data['row_num'], y=movie_data['cumsum'])])
fig.show()

Votes Cumulated

That looks good. We can now use that data to calculate which percentage of the total votes the first 20% of movies receive.

First, we get the total number of movies and votes.

rows, cols = movie_data.shape
print(rows, cols)
282331 13
total_movies = rows
total_votes = movie_data.iloc[rows-1, cols-1]  # 'cumsum' of last row
print(total_movies, total_votes)
253451 835070653

Let's get the movie number that marks the first 20%. We use the python function math.floor() to get an integer value.

from math import floor

num_movies_20perc = floor(total_movies * .2)
num_movies_20perc
56466

From the this row of the movie dataframe, we can now extract the cumulative number of votes of the first 20% of movies.

cum_votes_20perc = movie_data.iloc[num_movies_20perc, cols-1]
cum_votes_20perc
977776286

Now we can calculate the percentage of votes that the first 20% of movies recieve.

perc = cum_votes_20perc / total_votes
perc
0.9816429008181765

That means that 98.2% of all votes on IMDb go to only 20% of the movies! The remaining 80% of movies receive only 1.8% of all votes!

With that remarkable insight (feel free to use it as a fun fact of the day and tell your colleagues) this tutorial has reached it's

-- END --

Thanks for following along, I hope you found it interesting. If you have any questions or feedback, please leave a comment below!

As a bonus, here are a couple of ideas for further explorations:


Recent posts