Explore the IMDb Dataset with Jupyter Notebook and Pandas

In this tutorial, we will be using Jupyter Notebook, Pandas and Plotly to explore and visualize the official IMDB dataset. You will learn how to apply common data science techniques such as data cleaning, filtering and visualization.

Introduction

Jupyter notebook and Pandas are great tools for data analysis. For a quick introduction to Pandas check out their 10 minutes to pandas guide. Another great tutorial is the Pandas Cookbook by Julia Evans. It starts from the very basics and digs into some great real-world data examples.

Once you're familiar with the fundamentals, you might wonder what other real-world datasets you could start exploring to hone your new skills. In this case you should take a look at the official IMDB datasets! They are available for personal and non-commercial use and contain millions of rows of real data from their website.

Apart from being interesting to explore, another great thing about these datasets is that there are some flaws in the data format. This means that before you can really get started, you will have to do some data cleaning. As most of the datasets in the real world are not perfect, this is a very useful skill to learn and to practice! We will spend a bit of time to go through this process step by step.

In this tutorial, I will be using the command line to set up the environment. If that's something you are not familiar with, check out the Djangogirls Command Line Introduction.

As always, you can find the corresponding Jupyter notebook file on Github.

WARNING: The IMDb dataset is quite large, so make sure you have enough free memory on your machine. On my setup running the notebook takes about 3-4 GB of RAM.

Setup

Create a new a project folder and virtual environment. Activate the virtual environment and install Jupyter notebook and Pandas using pip.

$ mkdir imdbexplore && cd imdbexplore
$ python -m venv .venv
$ source .venv/bin/activate  # macOS
$ .venv\Scripts\Activate.ps1  # Windows
$ (.venv) pip install notebook pandas

Download the IMDB datasets from their website. You can either do it from your browser or directly from the command line, as shown below.

There are a couple of different datasets available. We will be using the title.basics, which contains some basic information about movies like type, title, year, runtime, genre etc. The title.ratings dataset contains average ratings and rating count for movies.

# Windows
wget https://datasets.imdbws.com/title.basics.tsv.gz
wget https://datasets.imdbws.com/title.ratings.tsv.gz
# macOS
curl -o title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz
curl -o title.ratings.tsv.gz https://datasets.imdbws.com/title.ratings.tsv.gz

Start the jupyter notebook server from your command line

$ (.venv) jupyter notebook

The Jupyter Notebook interface should appear in a new tab in your browser. If nothing happens, copy the url from the terminal and paste it into your browser.

Create a new notebook by clicking on New and then selecting Python 3.

Create new notebook

Well done. Let's start exploring!

Load the datasets

In order to load the datasets, we need a few python modules. In a new cell in your notebook, type the following lines.

Hit SHIFT+ENTER to run the cell or click the Run button.

import gzip, shutil
import pandas as pd

The datasets are compressed, which is why we will use gzip to extract the archive. Then we create a new file to write the extracted data to and use shutil to copy the file object. We'll do that with both files.

with gzip.open('title.basics.tsv.gz', 'rb') as f_in:
    with open('title.basics.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

with gzip.open('title.ratings.tsv.gz', 'rb') as f_in:
    with open('title.ratings.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Now we can load the datasets. We'll use the read_csv method provided by Pandas, which will load the data into a Pandas dataframe. The values in the files are separated by tab, which is why we specify the separator sep='\t'. As the datasets are quite large, we also need to set low_memory to False. Try load the title.basics dataset without this argument and you will see that this is raising an error. We also specify the string that we want to be indentified as 'not available' by na_values=['\\N']. Don't worry if it takes a few seconds to load, as it is quite a large dataset.

basics = pd.read_csv('title.basics.tsv', sep='\t',
    low_memory=False, na_values=['\\N'])

ratings = pd.read_csv('title.ratings.tsv', sep='\t',
    low_memory=False, na_values=['\\N'])

Before we look at the data itself, let's see how many rows the datasets contain.

print(len(basics), len(ratings))

9226400 1260984

The basics dataframe has ~9 Mio rows, but there are only ~1.2 Mio rows in ratings. It seems that not all titles have ratings data.

To view the data, you can use the .head() method to look at the first couple of rows.

basics.head()

	tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres
0	tt0000001	short	Carmencita	Carmencita	1894.0	NaN	1	Documentary,Short
1	tt0000002	short	Le clown et ses chiens	Le clown et ses chiens	1892.0	NaN	5	Animation,Short
2	tt0000003	short	Pauvre Pierrot	Pauvre Pierrot	1892.0	NaN	4	Animation,Comedy,Romance
3	tt0000004	short	Un bon bock	Un bon bock	1892.0	NaN	12	Animation,Short
4	tt0000005	short	Blacksmith Scene	Blacksmith Scene	1893.0	NaN	1	Comedy,Short

ratings.head()

	tconst	averageRating	numVotes
0	tt0000001	5.7	1910
1	tt0000002	5.8	256
2	tt0000003	6.5	1714
3	tt0000004	5.6	169
4	tt0000005	6.2	2528

This first dataset title.basics contains the type, title, year, runtime and genres as columns for each title. The second one contains the average rating and the number of votes for different titles. The column tconst in both datasets is a unique identifier for each title.

Looking at the ratings without seeing the human readable title is actually not very convenient. So why not use the tconst value to find the corresponding rows in both datasets and combine them?

Luckily, pandas already provides us with a convenient method to do that, which is the pd.merge() method. The on argument specifies which column should be used to match the rows.

data = pd.merge(basics, ratings, on="tconst")
data.head()

	tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes
0	tt0000001	short	Carmencita	Carmencita	1894.0	NaN	1	Documentary,Short	5.7	1910
1	tt0000002	short	Le clown et ses chiens	Le clown et ses chiens	1892.0	NaN	5	Animation,Short	5.8	256
2	tt0000003	short	Pauvre Pierrot	Pauvre Pierrot	1892.0	NaN	4	Animation,Comedy,Romance	6.5	1714
3	tt0000004	short	Un bon bock	Un bon bock	1892.0	NaN	12	Animation,Short	5.6	169
4	tt0000005	short	Blacksmith Scene	Blacksmith Scene	1893.0	NaN	1	Comedy,Short	6.2	2528

That looks great! Now the basics and the ratings data are combined into one dataframe.

Before we continue, let's have a look at how many rows the new dataframe data has.

len(data)

The merged dataframe data has the same number of rows as ratings, so it seems that rows from title.basics that didn't have ratings data got dropped. This database talk, this method is called 'inner join', as opposed to 'outer join', which would keep the existing rows and fill the gaps with blanks. More on that can be found in the official pandas documentation.

We can get a summary of our new data dataframe with .info().

data.info()


Int64Index: 1072813 entries, 0 to 1072812
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype
---  ------          --------------    -----
 0   tconst          1072813 non-null  object
 1   titleType       1072813 non-null  object
 2   primaryTitle    1072812 non-null  object
 3   originalTitle   1072812 non-null  object
 4   isAdult         1072813 non-null  int64
 5   startYear       1072660 non-null  float64
 6   endYear         25537 non-null    float64
 7   runtimeMinutes  780476 non-null   object
 8   genres          1051377 non-null  object
 9   averageRating   1072813 non-null  float64
 10  numVotes        1072813 non-null  int64
dtypes: float64(3), int64(2), object(6)
memory usage: 98.2+ MB

You can see some key properties of the dataframe and a list of the columns and the datatype of each column. Another way to find out the datatypes of the columns would be to access the .dtypes property of a dataframe.

data.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

When looking at the datatypes, we ca see that there are three different ones: object, int64 and float64. Pandas uses object for anthing that is not a number, like e.g. strings. int64 and float64 refer to integers and floating points, respectively.

We can see that somehow, pandas is using the object datatype for the column runtimeMinutes, while we would expect this to be a number. To see what's going on here, we need to do some sanity checking. To do that, we will a use a technique called boolean indexing, where we can select and show only the rows that match a certain condition. In this case we only want to display the rows that have a non-numeric value in runtimeMinutes.

is_numeric = data['runtimeMinutes'].str.isnumeric()
is_numeric

0          True
1          True
2          True
3          True
4          True
           ...
1260979    True
1260980    True
1260981    True
1260982    True
1260983     NaN
Name: runtimeMinutes, Length: 1260984, dtype: object

Wait a minutes, the pandas series is_numeric is of dtype object, not bool. This is because the column runtimeMinutes contains a few NaN values. In order to use is_numeric for boolean indexing, it needs to be boolean. So let's convert it to dtype: bool.

is_numeric = is_numeric.astype('bool')
is_numeric

0          True
1          True
2          True
3          True
4          True
           ...
1260979    True
1260980    True
1260981    True
1260982    True
1260983    True
Name: runtimeMinutes, Length: 1260984, dtype: bool

Now the series is of dtype: bool. We'll use the invert operator ~ to invert the is_numeric series and use it for our boolean indexing. This way, we will see only the rows that are not numeric. Smart, he?

data[~is_numeric]

	tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes
560561	tt12149332	tvEpisode	Jeopardy! College Championship Semifinal Game ...	0	2020.0	NaN	NaN	Game-Show	NaN	6.9	8
974363	tt3984412	tvEpisode	I'm Not Going to Come Last, I'm Just Going to ...	0	2014.0	NaN	NaN	Game-Show,Reality-TV	NaN	8.0	5

It seems there are two rows where runtimeMinutes contains data that actually belongs to genres. Looking at the other columns, we can see that there is something wrong with the data in the primaryTitle column. The title string seems to be duplicated because of some missing quotation marks and the other columns are shifted to the left. We could try to clean the original data, but instead let's pretend we don't care about these rows and just drop them.

We do that by assigning the filtered data to a new dataframe. Note that we are using the method .copy() explicitly to make sure the cleaned_data is actually a copy and not just a view of the original dataframe. This is to avoid getting a SettingWithCopy warning later on. More on that topic here and here.

cleaned_data = data[is_numeric].copy()

Let's just double-check briefly that the corrupt rows are gone.

is_numeric_cleaned = cleaned_data['runtimeMinutes'].str.isnumeric().astype('bool')
cleaned_data[~is_numeric_cleaned]

	tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes

Perfect. However, the datatype of the runtimeMinutes column is still object.

cleaned_data.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes     object
genres             object
averageRating     float64
numVotes            int64
dtype: object

To convert the datatype to numeric, we can use the pandas.to_numeric() method. In order to correctly replace the data in the existing dataframe, we use the .loc() method to select a subset of the dataframe. .loc(:, 'runtimeMinutes' means we want to replace all rows of the runtimeMinutes column with the converted data.

cleaned_data.loc[:, 'runtimeMinutes'] = pd.to_numeric(
    cleaned_data.loc[:, 'runtimeMinutes'], errors='coerce'
)
cleaned_data.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult           float64
startYear         float64
endYear           float64
runtimeMinutes    float64
genres             object
averageRating     float64
numVotes            int64
dtype: object

Great, now the runtimeMinutes column is numeric.

Now that we've cleaned the data, let's do some analysis! We can start by looking into the title type and see how many items there are or each type. We will use the .value_counts() method for that.

titles_by_type = cleaned_data['titleType'].value_counts()
titles_by_type

tvEpisode       611935
movie           282331
short           143507
tvSeries         83396
video            50367
tvMovie          48909
tvMiniSeries     13877
videoGame        13634
tvSpecial        10556
tvShort           2470
Name: titleType, dtype: int64

Interesting. Let's visualize that data in a pie chart!

There are several plotting libraries that you can use within Python and Jupyter notebooks, e.g. matplotlib. One library I like in particular is Plotly, because it looks very nice and has some cool features. You will need to install it from the command line

pipenv install plotly

Now we can import plotly and initialize the notebook mode, so that the graphs show up nicely in our Jupyter notebook.

import plotly.graph_objects as go
import plotly.offline as pyo
# set notebook mode to work in offline
pyo.init_notebook_mode()

Now we can create a new figure from plotly.graph_objects and pass a go.Pie object to the data argument. Appart from providing labels and values, we use textinfo and insidetextorientation to customize the style a bit. Try without the textinfo and insidetextorientation arguments to see the difference.

fig = go.Figure(data=[go.Pie(
    labels=titles_by_type.index,
    values=titles_by_type.values,
    textinfo='label+percent',
    insidetextorientation='radial')])
fig.show()

Piechart

Neat, I like that! Did you realize that Plotly calculates the percentages automatically from the total numbers we provided? Note also that when you hover over the sections of the pie chart, you get a little tool tip with label, percentage and the total number next to your cursor.

Let's try to create our own version of the famous IMDb Top 250 movie list! For that list, IMDb only takes movies 250,000 votes into consideration. So let's filter our data using these criteria. We use the & (AND) operator to combine two filter criteria. Before displaying, we also sort the rows by the average rating, so that movies with a high rating are on top of the list.

filtered_data = cleaned_data[
    (cleaned_data.titleType == "movie") & (cleaned_data.numVotes > 250000)
].copy()
filtered_data = filtered_data.sort_values(by="averageRating", ascending=False)
filtered_data[:10]

	tconst	titleType	primaryTitle	originalTitle	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes
82427	tt0111161	movie	The Shawshank Redemption	The Shawshank Redemption	1994.0	NaN	142.0	Drama	9.3	2639333
46140	tt0068646	movie	The Godfather	The Godfather	1972.0	NaN	175.0	Crime,Drama	9.2	1829481
79844	tt0108052	movie	Schindler's List	Schindler's List	1993.0	NaN	195.0	Biography,Drama,History	9.0	1338297
29999	tt0050083	movie	12 Angry Men	12 Angry Men	1957.0	NaN	96.0	Crime,Drama	9.0	779249
48686	tt0071562	movie	The Godfather Part II	The Godfather Part II	1974.0	NaN	202.0	Crime,Drama	9.0	1254338
251144	tt0468569	movie	The Dark Knight	The Dark Knight	2008.0	NaN	152.0	Action,Crime,Drama	9.0	2610926
114473	tt0167260	movie	The Lord of the Rings: The Return of the King	The Lord of the Rings: The Return of the King	2003.0	NaN	201.0	Action,Adventure,Drama	9.0	1814909
82211	tt0110912	movie	Pulp Fiction	Pulp Fiction	1994.0	NaN	154.0	Crime,Drama	8.9	2019908
38862	tt0060196	movie	The Good, the Bad and the Ugly	Il buono, il brutto, il cattivo	1966.0	NaN	161.0	Adventure,Western	8.8	753855
634176	tt1375666	movie	Inception	Inception	2010.0	NaN	148.0	Action,Adventure,Sci-Fi	8.8	2314403

If you want, you can now compare your list with the official one!

Let's create another graph with the filtered data. How about a histogram of the average movie rating, so we can see what the most common average ratings are?

fig = go.Figure(data=[go.Histogram(x=filtered_data['averageRating'])])
fig.show()

Ratings Histogram

To generate the histogram, Plotly automatically generates equally spaced 'bin' and counts the titles that fall into each interval. Pretty neat. Check out their documentation to find out how you can controll the intervals.

The most common average rating as 7.6 to 7.7 (you can see it when you hover over the highest bar). Another thing to note is that hardly any movies with more than 250,000 ratings have an average rating below 5.5.

Note that we're still looking at the filtered dataset of movies that received more than 250,000 votes. The filtered data only contains a bit more than 600 titles.

len(filtered_data)

That brings us to another interesting question: How is the number of votes distributed across the movies? Maybe you have come across the Pareto principle (also known as the 80/20 rule), which states that often, 80% of the effect come from 20% of the cause (see on Wikipedia). E.g. in a business world this could mean that 80% of the turnover come from 20% percent of the clients.

Let's investigate if this rule also applies to our movie data, i.e. whether that 20% of the movies get 80% of the votes.

To do that, we filter our data to get only movies with at least 1 vote. Also, in order to make a nice plot, we need to sort the the movies by the number of votes in descending order and then add a column with the row number, which we can use as the x axis for our plot.

import numpy as np

movie_data = (
    data[(data["titleType"] == "movie") & (data["numVotes"] > 0)]
    .sort_values(by="numVotes", ascending=False)
    .copy()
)
movie_data["row_num"] = np.arange(len(movie_data))
movie_data

	tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	row_num
82427	tt0111161	movie	The Shawshank Redemption	The Shawshank Redemption	0.0	1994.0	NaN	142	Drama	9.3	2639333	0
251144	tt0468569	movie	The Dark Knight	The Dark Knight	0.0	2008.0	NaN	152	Action,Crime,Drama	9.0	2610926	1
634176	tt1375666	movie	Inception	Inception	0.0	2010.0	NaN	148	Action,Adventure,Sci-Fi	8.8	2314403	2
99042	tt0137523	movie	Fight Club	Fight Club	0.0	1999.0	NaN	139	Drama	8.8	2083152	3
81333	tt0109830	movie	Forrest Gump	Forrest Gump	0.0	1994.0	NaN	142	Drama,Romance	8.8	2043215	4
...	...	...	...	...	...	...	...	...	...	...	...	...
1126913	tt6555682	movie	Janata Express	Janata Express	0.0	1981.0	NaN	NaN	Drama	6.4	5	282326
552959	tt11991560	movie	Parimala Lodge	Parimala Lodge	0.0	2020.0	NaN	NaN	Drama	8.8	5	282327
552848	tt11989824	movie	Jeeva (2009 film)	Jeeva (2009 film)	0.0	2009.0	NaN	NaN	Romance	5.4	5	282328
552567	tt11984548	movie	The Old House	O Casarão	0.0	2021.0	NaN	72	Documentary	8.4	5	282329
1180256	tt7738732	movie	La settima onda	La settima onda	0.0	2015.0	NaN	NaN	Drama	6.4	5	282330

282331 rows × 12 columns

Let's create a scatterplot with movies on the x-axis and number of votes on the y-axis.

fig = go.Figure(
    data=[go.Scatter(x=movie_data["row_num"], y=movie_data["numVotes"])]
)
fig.show()

Votes Distribution

In fact, this looks very much like the probability density function of the Pareto distribution (see Wikipedia article). However we can already see from that graph, that there are very few movies with very high numbers of votes, and then there are loads of movies with hardly any number of votes.

In order to get make make it easier to interpret the data, we can calculate the cumulative distribution by cumulatively adding up the number of votes, which we can do with the .cumsum() function from Pandas.

movie_data['cumsum'] = movie_data['numVotes'].cumsum()
movie_data

	tconst	titleType	primaryTitle	originalTitle	isAdult	startYear	endYear	runtimeMinutes	genres	averageRating	numVotes	row_num	cumsum
82427	tt0111161	movie	The Shawshank Redemption	The Shawshank Redemption	0.0	1994.0	NaN	142	Drama	9.3	2639333	0	2639333
251144	tt0468569	movie	The Dark Knight	The Dark Knight	0.0	2008.0	NaN	152	Action,Crime,Drama	9.0	2610926	1	5250259
634176	tt1375666	movie	Inception	Inception	0.0	2010.0	NaN	148	Action,Adventure,Sci-Fi	8.8	2314403	2	7564662
99042	tt0137523	movie	Fight Club	Fight Club	0.0	1999.0	NaN	139	Drama	8.8	2083152	3	9647814
81333	tt0109830	movie	Forrest Gump	Forrest Gump	0.0	1994.0	NaN	142	Drama,Romance	8.8	2043215	4	11691029
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1126913	tt6555682	movie	Janata Express	Janata Express	0.0	1981.0	NaN	NaN	Drama	6.4	5	282326	996061058
552959	tt11991560	movie	Parimala Lodge	Parimala Lodge	0.0	2020.0	NaN	NaN	Drama	8.8	5	282327	996061063
552848	tt11989824	movie	Jeeva (2009 film)	Jeeva (2009 film)	0.0	2009.0	NaN	NaN	Romance	5.4	5	282328	996061068
552567	tt11984548	movie	The Old House	O Casarão	0.0	2021.0	NaN	72	Documentary	8.4	5	282329	996061073
1180256	tt7738732	movie	La settima onda	La settima onda	0.0	2015.0	NaN	NaN	Drama	6.4	5	282330	996061078

282331 rows × 13 columns

Let's plot this.

fig = go.Figure(data=[go.Scatter(x=movie_data['row_num'], y=movie_data['cumsum'])])
fig.show()

Votes Cumulated

That looks good. We can now use that data to calculate which percentage of the total votes the first 20% of movies receive.

First, we get the total number of movies and votes.

rows, cols = movie_data.shape
print(rows, cols)

282331 13

total_movies = rows
total_votes = movie_data.iloc[rows-1, cols-1]  # 'cumsum' of last row
print(total_movies, total_votes)

253451 835070653

Let's get the movie number that marks the first 20%. We use the python function math.floor() to get an integer value.

from math import floor

num_movies_20perc = floor(total_movies * .2)
num_movies_20perc

From the this row of the movie dataframe, we can now extract the cumulative number of votes of the first 20% of movies.

cum_votes_20perc = movie_data.iloc[num_movies_20perc, cols-1]
cum_votes_20perc

977776286

Now we can calculate the percentage of votes that the first 20% of movies recieve.

perc = cum_votes_20perc / total_votes
perc

0.9816429008181765

That means that 98.2% of all votes on IMDb go to only 20% of the movies! The remaining 80% of movies receive only 1.8% of all votes!

With that remarkable insight (feel free to use it as a fun fact of the day and tell your colleagues) this tutorial has reached it's

-- END --

Thanks for following along, I hope you found it interesting. If you have any questions or feedback, please leave a comment below!

As a bonus, here are a couple of ideas for further explorations:

scatterplot of averageRating vs numVotes (be careful not to try plotting the unfiltered dataset, this might crash your browser, believe me, I've tried)
bar chart of number of movies per year
average runtime per year or category
explore the other IMDb datasets, like title.principals or title.crew, to investigate popular actors, directors etc.

Explore the IMDb Dataset with Jupyter Notebook and Pandas

Introduction

Setup

Load the datasets

Recent posts

Notes on Agentic Engineering in Action with Mitchell Hashimoto

TIL: Custom alias for pretty printing in Python debugger with .pdbrc (including Django models!)

TIL: Convert Markdown to Rich Text and Copy to Clipboard on macOS

Navigating a Large Python Repository: Semantic Code Search with Local Vector Embeddings

TIL: Ignoring Files Without Changing .gitignore