In this tutorial we look at how data can be grouped (for example by year) and how the resulting grouped data can be viewed and analysed.
The data source for this tutorial is the Top 1000 Movies by IMDB Rating.
The data will be grouped by year, genre and actor in this tutorial to provide a variety of insights into the data.
Reading the data
import pandas as pd
We begin by reading data from IMDB into the movies_df
dataframe. We then show the first 5 rows in this dataframe.
= pd.read_csv("data/IMDB-Movie-Data.csv", index_col="Rank")
movies_df 5) movies_df.head(
Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag… | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar… | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea… | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma… | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th… | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D… | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
Next, for convenience, we rename all of the column titles so that they are all lowercase.
= [col.lower() for col in movies_df] movies_df.columns
Grouping by year
The data is grouped by year. This creates a grouped dataframe – this is essentially a collection of dataframes that are index by the year.
=movies_df.groupby(["year"]) grouped
Inspecting a particular year
If wanted to view the information for a particular group (in this case for a particular year) we can use the get_group
function. The code below will get the dataframe for 2009 and show the first 5 entries.
2009).head(5) grouped.get_group(
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
78 | Inglourious Basterds | Adventure,Drama,War | In Nazi-occupied France during World War II, a… | Quentin Tarantino | Brad Pitt, Diane Kruger, Eli Roth,Mélanie Laurent | 2009 | 153 | 8.3 | 959065 | 120.52 | 69.0 |
88 | Avatar | Action,Adventure,Fantasy | A paraplegic marine dispatched to the moon Pan… | James Cameron | Sam Worthington, Zoe Saldana, Sigourney Weaver… | 2009 | 162 | 7.8 | 935408 | 760.51 | 83.0 |
141 | Star Trek | Action,Adventure,Sci-Fi | The brash James T. Kirk tries to live up to hi… | J.J. Abrams | Chris Pine, Zachary Quinto, Simon Pegg, Leonar… | 2009 | 127 | 8.0 | 526324 | 257.70 | 82.0 |
148 | Watchmen | Action,Drama,Mystery | In 1985 where former superheroes exist, the mu… | Zack Snyder | Jackie Earle Haley, Patrick Wilson, Carla Gugi… | 2009 | 162 | 7.6 | 410249 | 107.50 | 56.0 |
252 | Kynodontas | Drama,Thriller | Three teenagers live isolated, without leaving… | Yorgos Lanthimos | Christos Stergioglou, Michele Valley, Angeliki… | 2009 | 94 | 7.3 | 50946 | 0.11 | 73.0 |
Sorting values
In the following code we use the sort_values
function to sort the movies released in 2009 into from best to worst (using ascending=False
) and from worst to best (using ascending=True
or omitting this option).
The first three entries from each of these sorted dataframes are displayed.
2009).sort_values(by="rating", ascending=False).head(3) grouped.get_group(
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
431 | 3 Idiots | Comedy,Drama | Two friends are searching for their long lost … | Rajkumar Hirani | Aamir Khan, Madhavan, Mona Singh, Sharman Joshi | 2009 | 170 | 8.4 | 238789 | 6.52 | 67.0 |
78 | Inglourious Basterds | Adventure,Drama,War | In Nazi-occupied France during World War II, a… | Quentin Tarantino | Brad Pitt, Diane Kruger, Eli Roth,Mélanie Laurent | 2009 | 153 | 8.3 | 959065 | 120.52 | 69.0 |
500 | Up | Animation,Adventure,Comedy | Seventy-eight year old Carl Fredricksen travel… | Pete Docter | Edward Asner, Jordan Nagai, John Ratzenberger,… | 2009 | 96 | 8.3 | 722203 | 292.98 | 88.0 |
2009).sort_values(by="rating").head(3) grouped.get_group(
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
872 | Dragonball Evolution | Action,Adventure,Fantasy | The young warrior Son Goku sets out on a quest… | James Wong | Justin Chatwin, James Marsters, Yun-Fat Chow, … | 2009 | 85 | 2.7 | 59512 | 9.35 | 45.0 |
937 | The Human Centipede (First Sequence) | Horror | A mad scientist kidnaps and mutilates a trio o… | Tom Six | Dieter Laser, Ashley C. Williams, Ashlynn Yenn… | 2009 | 92 | 4.4 | 60655 | 0.18 | 33.0 |
349 | Jennifer’s Body | Comedy,Horror | A newly possessed high school cheerleader turn… | Karyn Kusama | Megan Fox, Amanda Seyfried, Adam Brody, Johnny… | 2009 | 102 | 5.1 | 96617 | 16.20 | 47.0 |
2009).sort_values(by="metascore", ascending=False).head(3) grouped.get_group(
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
500 | Up | Animation,Adventure,Comedy | Seventy-eight year old Carl Fredricksen travel… | Pete Docter | Edward Asner, Jordan Nagai, John Ratzenberger,… | 2009 | 96 | 8.3 | 722203 | 292.98 | 88.0 |
815 | Fantastic Mr. Fox | Animation,Adventure,Comedy | An urbane fox cannot resist returning to his f… | Wes Anderson | George Clooney, Meryl Streep, Bill Murray, Jas… | 2009 | 87 | 7.8 | 149779 | 21.00 | 83.0 |
88 | Avatar | Action,Adventure,Fantasy | A paraplegic marine dispatched to the moon Pan… | James Cameron | Sam Worthington, Zoe Saldana, Sigourney Weaver… | 2009 | 162 | 7.8 | 935408 | 760.51 | 83.0 |
2009).sort_values(by="metascore", ascending=True).head(3) grouped.get_group(
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Rank | |||||||||||
398 | Couples Retreat | Comedy | A comedy centered around four couples who sett… | Peter Billingsley | Vince Vaughn, Malin Akerman, Jon Favreau, Jaso… | 2009 | 113 | 5.5 | 86417 | 109.18 | 23.0 |
900 | Bride Wars | Comedy,Romance | Two best friends become rivals when they sched… | Gary Winick | Kate Hudson, Anne Hathaway, Candice Bergen, Br… | 2009 | 89 | 5.5 | 83976 | 58.72 | 24.0 |
453 | Pandorum | Action,Horror,Mystery | A pair of crew members aboard a spaceship wake… | Christian Alvart | Dennis Quaid, Ben Foster, Cam Gigandet, Antje … | 2009 | 108 | 6.8 | 126656 | 10.33 | 28.0 |
Summary information
Summary calculations can be done on the grouped data. For example we can calculate the mean rating for each year using the mean
function.
"rating"]].mean() grouped[[
rating | |
---|---|
year | |
2006 | 7.125000 |
2007 | 7.133962 |
2008 | 6.784615 |
2009 | 6.960784 |
2010 | 6.826667 |
2011 | 6.838095 |
2012 | 6.925000 |
2013 | 6.812088 |
2014 | 6.837755 |
2015 | 6.602362 |
2016 | 6.436700 |
Similarly, the maximum metascore value for each year can be calculated using the max
function. Whilst this tables shows us what the top metascore was for each year, it does not show the associated movie title (or any other associated information). Other aggregate functions also have a similar limitation. We will show in the next section how to create a more detailed summary of the best (or worst) movies.
"metascore"]].max() grouped[[
metascore | |
---|---|
year | |
2006 | 98.0 |
2007 | 96.0 |
2008 | 94.0 |
2009 | 88.0 |
2010 | 95.0 |
2011 | 94.0 |
2012 | 95.0 |
2013 | 96.0 |
2014 | 100.0 |
2015 | 95.0 |
2016 | 99.0 |
Top rated movie for each year
In the following code we create a new dataframe showing the top rated movie from each year from 2006 up until 2016. For each entry we show the year of release, movie title, rating and (for comparison) the metascore.
To create this dataframe we begin by creating a new dictionary and add entries for the four columns of interest noted above. Then for each group in the grouped data frame we get the row with the highest rating. This is done by sorting the group on the rating in descending order, then getting the first row in the sorted group.
= {}
data "year"]=[]
data["title"]=[]
data["rating"]=[]
data["metascore"]=[]
data[for year, group in grouped:
=group.sort_values(by="rating", ascending=False).iloc[0]
best"year"].append(best['year'])
data["title"].append(best['title'])
data["rating"].append(best['rating'])
data["metascore"].append(best['metascore'])
data[=pd.DataFrame(data)
best_df best_df
year | title | rating | metascore | |
---|---|---|---|---|
0 | 2006 | The Prestige | 8.5 | 66.0 |
1 | 2007 | Taare Zameen Par | 8.5 | 42.0 |
2 | 2008 | The Dark Knight | 9.0 | 82.0 |
3 | 2009 | 3 Idiots | 8.4 | 67.0 |
4 | 2010 | Inception | 8.8 | 74.0 |
5 | 2011 | The Intouchables | 8.6 | 57.0 |
6 | 2012 | The Dark Knight Rises | 8.5 | 78.0 |
7 | 2013 | The Wolf of Wall Street | 8.2 | 75.0 |
8 | 2014 | Interstellar | 8.6 | 74.0 |
9 | 2015 | Bahubali: The Beginning | 8.3 | NaN |
10 | 2016 | Dangal | 8.8 | NaN |
Best movies by metascore
For comparison we also create a dataframe to store the top movie (ranked by metascore) for each year.
= {}
data "year"]=[]
data["title"]=[]
data["rating"]=[]
data["metascore"]=[]
data[for year, group in grouped:
=group.sort_values(by="metascore", ascending=False).iloc[0]
best"year"].append(best['year'])
data["title"].append(best['title'])
data["rating"].append(best['rating'])
data["metascore"].append(best['metascore'])
data[=pd.DataFrame(data)
best_df best_df
year | title | rating | metascore | |
---|---|---|---|---|
0 | 2006 | Pan’s Labyrinth | 8.2 | 98.0 |
1 | 2007 | Ratatouille | 8.0 | 96.0 |
2 | 2008 | The Hurt Locker | 7.6 | 94.0 |
3 | 2009 | Up | 8.3 | 88.0 |
4 | 2010 | The Social Network | 7.7 | 95.0 |
5 | 2011 | Megan Is Missing | 4.9 | 94.0 |
6 | 2012 | Zero Dark Thirty | 7.4 | 95.0 |
7 | 2013 | Gravity | 7.8 | 96.0 |
8 | 2014 | Boyhood | 7.9 | 100.0 |
9 | 2015 | Carol | 7.2 | 95.0 |
10 | 2016 | Moonlight | 7.5 | 99.0 |
Summarising the distributions of ratings using a boxplot
In the following code we show how boxplots can be created for each year of release and shown on the same graph for easy comparison of movie ratings across each year.
import matplotlib.pyplot as plt
= []
data = []
years for year, group in grouped:
"rating"])
data.append(group[
years.append(year)=years, vert=False)
plt.boxplot(data, labels"Rating (0..10)")
plt.xlabel("Year") plt.ylabel(
Text(0, 0.5, 'Year')
Grouping by genre
Next we group the movies by genre. However we note that in the original datafile many of the movies had multiple comma separated genres. The following code creates a new data frame with movies that had multiple genres split across multiple rows with one genre list per row.
={}
datafor c in movies_df.columns:
=[]
datafor i in range(0, movies_df.shape[0]):
=movies_df.iloc[i]['genre'].split(',')
genre_list
for g in genre_list:
"genre"].append(g)
data[for c in movies_df.columns:
if c != "genre":
data.append(movies_df.iloc[i])= pd.DataFrame(data) movies_genre_df
movies_genre_df
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Guardians of the Galaxy | Action | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | Guardians of the Galaxy | Adventure | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
2 | Guardians of the Galaxy | Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S… | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
3 | Prometheus | Adventure | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
4 | Prometheus | Mystery | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa… | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
… | … | … | … | … | … | … | … | … | … | … | … |
2550 | Search Party | Adventure | A pair of friends embark on a mission to reuni… | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh… | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
2551 | Search Party | Comedy | A pair of friends embark on a mission to reuni… | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh… | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
2552 | Nine Lives | Comedy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
2553 | Nine Lives | Family | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
2554 | Nine Lives | Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch… | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
2555 rows × 11 columns
"genre")["rating"].mean().sort_values(ascending=False) movies_genre_df.groupby(
genre
War 7.353846
Animation 7.324490
Biography 7.290123
History 7.127586
Music 7.075000
Sport 7.011111
Drama 6.953801
Musical 6.940000
Mystery 6.886792
Crime 6.786667
Adventure 6.772201
Western 6.771429
Sci-Fi 6.716667
Romance 6.685816
Family 6.684314
Comedy 6.647670
Action 6.614521
Thriller 6.593333
Fantasy 6.548515
Horror 6.089916
Name: rating, dtype: float64
"genre")["rating"].count().sort_values(ascending=False) movies_genre_df.groupby(
genre
Drama 513
Action 303
Comedy 279
Adventure 259
Thriller 195
Crime 150
Romance 141
Sci-Fi 120
Horror 119
Mystery 106
Fantasy 101
Biography 81
Family 51
Animation 49
History 29
Sport 18
Music 16
War 13
Western 7
Musical 5
Name: rating, dtype: int64
Grouping by actor
In the following code a new dataframe is created containing one actor per row. Before the individual actor names are added to the dataframe any leading spaces at the beginning of the line are removed using the lstrip
function.
={}
datafor c in movies_df.columns:
=[]
datafor i in range(0, movies_df.shape[0]):
=movies_df.iloc[i]['actors'].split(',')
actor_list
for a in actor_list:
"actors"].append(a.lstrip())
data[for c in movies_df.columns:
if c != "actors":
data.append(movies_df.iloc[i])= pd.DataFrame(data) movies_actor_df
movies_actor_df
title | genre | description | director | actors | year | runtime (minutes) | rating | votes | revenue (millions) | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Chris Pratt | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Vin Diesel | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
2 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Bradley Cooper | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
3 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced … | James Gunn | Zoe Saldana | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
4 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te… | Ridley Scott | Noomi Rapace | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
… | … | … | … | … | … | … | … | … | … | … | … |
3994 | Search Party | Adventure,Comedy | A pair of friends embark on a mission to reuni… | Scot Armstrong | Shannon Woodward | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
3995 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Kevin Spacey | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
3996 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Jennifer Garner | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
3997 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Robbie Amell | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
3998 | Nine Lives | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins… | Barry Sonnenfeld | Cheryl Hines | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
3999 rows × 11 columns
Sorting by number of movie appearances
The count
method is used to determine how many movies each actor has appeared in. The list is sorted from highest to lowest.
"actors")["title"].count().sort_values(ascending=False) movies_actor_df.groupby(
actors
Mark Wahlberg 15
Hugh Jackman 14
Brad Pitt 13
Christian Bale 13
Scarlett Johansson 12
..
Jackie Earle Haley 1
Jackie Chan 1
Jacki Weaver 1
Jack Taylor 1
Óscar Jaenada 1
Name: title, Length: 1985, dtype: int64
Ranking actors
Next we rank actors based on the average metascore of all movies they have appeared in.
In this case the list is skewed by actors who have only appeared in a single movie.
=movies_actor_df.groupby("actors")
grouped_by_actors "metascore"].mean().sort_values(ascending=False) grouped_by_actors[
actors
Ellar Coltrane 100.0
Elijah Smith 100.0
Patricia Arquette 100.0
Shariff Earp 99.0
Mahershala Ali 99.0
...
Tom Hughes NaN
Val Kilmer NaN
Vanessa Ferlito NaN
Zoë Bell NaN
Émilie Leclerc NaN
Name: metascore, Length: 1985, dtype: float64
Adding a minimum number of movies filter
To ensure that the ranking of actors is not skewed by those who have only appeared in a small number of movies, we filter the data to only include actors who have appeared in more than 6 movies.
The high flyers
= grouped_by_actors.filter(lambda x: x['rating'].count() > 6).groupby("actors")
filtered_grouped_by_actors"metascore"].mean().sort_values(ascending=False).head(10) filtered_grouped_by_actors[
actors
Jeremy Renner 75.625000
George Clooney 73.142857
Amy Adams 72.625000
Brad Pitt 72.076923
Joel Edgerton 71.625000
Rooney Mara 70.428571
Ryan Gosling 70.400000
Joseph Gordon-Levitt 70.250000
Matt Damon 70.250000
Jonah Hill 70.111111
Name: metascore, dtype: float64
The stinkers
"metascore"].mean().sort_values(ascending=True).head(10) filtered_grouped_by_actors[
actors
Adam Sandler 30.888889
Jennifer Aniston 40.625000
Ryan Reynolds 44.714286
Zac Efron 45.500000
Will Smith 47.111111
Teresa Palmer 47.333333
Liam Neeson 47.600000
Jason Sudeikis 48.833333
Gerard Butler 49.600000
Robert De Niro 50.125000
Name: metascore, dtype: float64