===================
== Thomas Pinder ==
===================
Bayesian ML, Causal Inference, and JAX


Lollipop Plot of The Office US IMDb Reviews

Five months ago I started an internship with Amazon and I’ve learned an inordinate amount from my colleagues. The use of plots to debug models is perhaps the most useful learning. 

Inspired by my colleagues, I’d like to improve my plotting skills. So, from time-to-time, I’ll try creating a figure that I would not usually make to try and broaden my knowledge. To focus on the skill of plotting, I’ll remove any need for creativity by trying to replicate plots I’ve seen and liked. Cédric Scherer is someone I’ve followed on Twitter for a while, and I’ve always admired the plots that he creates. In this first post I’ll try and replicate the lollipop chart Cédric made the IMDb reviews of the The Office US.

The Office by Cédric Scherer

Unlike Cédric’s plot, I’ll be creating my plot in Python so we’ll have to do a bit of data wrangling first.

Data Wrangling

1
2
3
4
5
6
7
8
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
from plotting_utils import adjust_lightness
from tempfile import NamedTemporaryFile
import urllib
import matplotlib.font_manager as fm

We’ll read our data in from the R for Data Science Github repository. We’ll store our data in a Pandas dataframe, the first 5 lines of which look as follows:

1
2
3
4
office_data = pd.read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv"
)
office_data.head()

seasonepisodetitleimdb_ratingtotal_votesair_date
011Pilot7.637062005-03-24
112Diversity Day8.335662005-03-29
213Health Care7.929832005-04-05
314The Alliance8.128862005-04-12
415Basketball8.431792005-04-19

We’ll now process the data, the steps of which are as follows:

  • Create a continous label that indexes the epsiode and season number
  • Re-case the season number as a categorical variable
  • Calculate the mean IMDb rating at a Season-level
  • Scale the total number of votes per episode down into the range $[10, 30]$
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Sort the data and add an epsiode index column
office_data.sort_values(by=["season", "episode"])
office_data["episode_idx"] = np.arange(1, len(office_data) + 1)

# Recast the season variable as a category
office_data["season_category"] = office_data["season"].astype("category")

# Calculate the average IMDb rating per season
office_data["avg_season_rating"] = (
    office_data["imdb_rating"].groupby(office_data["season"]).transform("mean")
)

# Rescale the total number of votes
office_data["scaled_total_votes"] = np.interp(
    office_data["total_votes"],
    (office_data["total_votes"].min(), office_data["total_votes"].max()),
    (10, 30),
)

The data is now in a form that we can work with. Before we start making a plot though, we should define some upfront variables. The only two we’ll need are a list of hex colours that I’ve lifted out of Cédric’s original plotting code, and a unique list of season numbers. I’ll also map the colours into our main dataframe as an additional columns; this will come in handy later one when scatter the observations as points.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
seasons = office_data["season_category"].drop_duplicates()
colours = (
    "#486090",
    "#D7BFA6",
    "#6078A8",
    "#9CCCCC",
    "#7890A8",
    "#C7B0C1",
    "#B5C9C9",
    "#90A8C0",
    "#A8A890",
)
col_map = {int(i) + 1: c for i, c in enumerate(colours)}
office_data["colour"] = office_data["season"].map(col_map)

The final piece of wrangling required for our data is a little bit of a hack, but we must introduce some pseudo-spacing on the x-axis when there is a jump from the last episode of one season to the first episode of the next season. This is purely for aesthetics to ensure that there is some spacing between the points and office_datatween seasons. To do this, we’ll loop over our earlier created epsiode index variable and increment it by 3 when the season number jumps.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
xpoints = []
first_season = 1
addition = 0
xinc = 3
for idx, row in office_data.iterrows():
    season = row["season"]
    if season == first_season:
        label = row["episode_idx"]
    else:
        first_season += 1
        label = row["episode_idx"]
        addition += xinc
    xpoints.append(label + addition)
office_data["x_points"] = xpoints

Custom Font

In the original plot, the Special Elite font is used. This is a case where I had no idea that one could load in custom .tff files to use within their figures. However, it is incredibly simple through the following code snippet where we read in the font’s .tff file and load it into Matplotlib’s FontProperties object.

1
2
3
4
5
6
7
font_url = "https://github.com/jenskutilek/free-fonts/blob/master/Special%20Elite/TTF/SpecialElite.ttf?raw=true"
response = urllib.request.urlopen(font_url)
f = NamedTemporaryFile(delete=False, suffix=".ttf")
f.write(response.read())
f.close()

font_prop = fm.FontProperties(fname=f.name)

Making the plot

We can now make the plot! Now the plot had several layers to it, so I’ll break out here the steps that I’ll be taking in the below code to avoid disrupting the code. Those steps are as follows:

  • Define our figure’s canvas and box properties that will be used for labelling each season.
  • Create some lightly coloured horizontal lines at 0.5 increments.
  • Loop over each season’s subset of data and do the following.
    • Create a horizontal line to represent the season’s mean rating score.
    • Softly round the line’s edges using a point place on the line’s periphary.
    • Centrally add a label above the horizontal line to indicate the corresponding season number.
    • For each of these steps, there is a unique colour per season.
  • Plot a point per episode for the corresponding IMDb review score.
  • Create a vertical line that connects the score of each episode to the constituent series’ mean score. This aesthetic is why the plot is called a lollipop chart!
  • Remove ticks from the x-axis as they only correspond to an arbitrary indexing number so aren’t particularly useful here.
  • Set the plot’s background colour.
  • Label the plot’s y-axis.
  • Despine all the but the left-hand spine.
  • Add a caption to the plot.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
fig, ax = plt.subplots(figsize=(16, 7))
props = dict(boxstyle="round,pad=0.5", facecolor="none")

# Create horizontal rule lines
[ax.axhline(y=i, color="lightgray", alpha=0.4) for i in np.linspace(6.5, 10, num=8)]

for season, colour in zip(seasons, colours):
    season_df = office_data[office_data["season_category"] == season]
    # Plot mean line
    ln = ax.hlines(
        y=season_df["avg_season_rating"],
        xmin=season_df["x_points"].min() - 1,
        xmax=season_df["x_points"].max() + 1,
        color=colour,
        linewidth=5,
        alpha=1.0,
    )

    # Slight hack to "round" the corners of this line
    ax.plot(
        season_df["x_points"].min() - 0.95,
        season_df["avg_season_rating"].min(),
        "o",
        color=colour,
        markersize=4,
    )
    ax.plot(
        season_df["x_points"].max() + 0.95,
        season_df["avg_season_rating"].min(),
        "o",
        color=colour,
        markersize=4,
    )

    # Add season label above point selection
    props["edgecolor"] = colour
    ax.text(
        season_df["x_points"].mean() - 6,
        10.24,
        f"Season {season}",
        fontsize=10,
        verticalalignment="top",
        bbox=props,
        color=colour,
        fontproperties=font_prop,
        alpha=1.0,
    )

ax.scatter(
    office_data["x_points"],
    office_data["imdb_rating"],
    office_data["scaled_total_votes"],
    c=office_data["colour"],
)

for idx, row in office_data.iterrows():
    mean, sample = row["avg_season_rating"], row["imdb_rating"]
    points = np.sort((mean, sample))
    ax.vlines(
        x=row["x_points"],
        ymin=points[0],
        ymax=points[1],
        color=colours[row["season"] - 1],
        linewidth=1.0,
    )

# Turn off x-axis labels
ax.set_xticks([])  # , minor=True)

# Set plot background
ax.set_facecolor("#fafaf5")
fig.patch.set_facecolor("#fafaf5")

# Create labels
ax.set_ylabel("IMDb Rating", font_properties=font_prop)
ax.set_ylim(6.4, 10.4)


# Despine the plot
ax.spines["bottom"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)


# Add caption
fig.text(
    0.5,
    0.08,
    "Visualisation inspired by Cédric Scherer  •  Data by IMDb via data.world",
    ha="center",
    font_properties=font_prop,
)

png

Conclusion

That’s it, we’re done! I’m pretty happy with how this turned out and it’s been an interesting plot to make as it’s simply a collection of carefully place circles and lines. This is certainly a refreshing way to think about creating plots as it removes any mentblockers

1
%watermark -n -u -v -iv -w -a 'Thomas Pinder'
Author: Thomas Pinder

Last updated: Sat Nov 06 2021

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.29.0

pandas    : 1.3.4
numpy     : 1.21.4
matplotlib: 3.4.3

Watermark: 2.2.0