How to Visualize Simply with Python

title

In this blog post, we will be visualizing data from the Palmer Penguins Dataset, which provides us details about penguins across three unique species.

Getting the Data

We will obtain the data from an online github repository with read.csv().

# importing the data 
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 PAL0910 120 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N38A2 No 12/1/09 NaN NaN NaN NaN NaN NaN NaN NaN
340 PAL0910 121 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A1 Yes 11/22/09 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832 NaN
341 PAL0910 122 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A2 Yes 11/22/09 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117 NaN
342 PAL0910 123 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A1 Yes 11/22/09 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969 NaN
343 PAL0910 124 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A2 Yes 11/22/09 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531 NaN

344 rows × 17 columns

Cleaning the Data

Before me move on to perform advanced visualizations functions, we need to make sure our data is nice and clean to avoid errors and unintended outputs.

We are going to create a function clean_data that will:

  • Relabel Species column to enhance readability to the viewer (For example: “Adelie Penguin (Pygoscelis adelie)” is relabeled to “Adelie”)

  • Relabel Sex column (“MALE” is relabeled to “M” and “FEMALE” is relabeled to “F”)

  • Return a clean dataset

def clean_data(df):
    
    """
    This function does the following:

    1. Relabel Species column to enhance readability to the viewer (For example: "Adelie Penguin (Pygoscelis adelie)" is relabeled to "Adelie")
    2. Relabel Sex column ("MALE" is relabeled to "M" and "FEMALE" is relabeled to "F")
    3. Return a clean dataset 

    """
    
    # create a copy 
    df_copy = df.copy()
    
    # making the labels using dictionaries 
    species_labels = { 'Adelie Penguin (Pygoscelis adeliae)': 'Adelie',
                       'Chinstrap penguin (Pygoscelis antarctica)': 'Chinstrap',
                       'Gentoo penguin (Pygoscelis papua)': 'Gentoo' }
    
    sex_labels = { "MALE": 'M',
                   "FEMALE": 'F',
                   ".": np.nan }
    
    # relabeling the columns 
    df_copy["Species"] =  df_copy["Species"].map(species_labels)
    df_copy["Sex"] = df_copy["Sex"].map(sex_labels)
    
    return df_copy
# the new clean and shiny dataframe
df = clean_data(penguins)
df.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 M NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 F 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 F 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 F 8.76651 -25.32426 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 PAL0910 120 Gentoo Anvers Biscoe Adult, 1 Egg Stage N38A2 No 12/1/09 NaN NaN NaN NaN NaN NaN NaN NaN
340 PAL0910 121 Gentoo Anvers Biscoe Adult, 1 Egg Stage N39A1 Yes 11/22/09 46.8 14.3 215.0 4850.0 F 8.41151 -26.13832 NaN
341 PAL0910 122 Gentoo Anvers Biscoe Adult, 1 Egg Stage N39A2 Yes 11/22/09 50.4 15.7 222.0 5750.0 M 8.30166 -26.04117 NaN
342 PAL0910 123 Gentoo Anvers Biscoe Adult, 1 Egg Stage N43A1 Yes 11/22/09 45.2 14.8 212.0 5200.0 F 8.24246 -26.11969 NaN
343 PAL0910 124 Gentoo Anvers Biscoe Adult, 1 Egg Stage N43A2 Yes 11/22/09 49.9 16.1 213.0 5400.0 M 8.36390 -26.15531 NaN

344 rows × 17 columns

But What to Visualize?

There are many visualizations we can attempt with this data in order to facilitate our understanding of these penguins. But for now, let’s assess how big each species is relative to each other. We can get valuable insights to answer this question by analyzing the Body Mass (g) feature of our dataset.

We can also analyze if being on certain islands affects the size of our penguins.

The Task: We will plot 3 figures representing each island and visualize penguin mass with histograms for each species within the figure.

Let’s Start to Visualize!

Let’s import our powerful visualization packages first, Seaborn & Matplotlib, and start visualizing penguins!

import seaborn as sns
from matplotlib import pyplot as plt

1. Make a figure.

We make a figure in which are plots will be embedded in.

This is done by the command:

#Fill in the figsize parameter to set the (width,height) of the figure.
plt.figure(figsize = (10,20)). 

Now, we have a space ready to fill in with our data visualizations.

2. Make sub plots if you want multiple plots.

Our aim is visualize 3 histograms together so we will use a structure with 1 row and 3 columns; the first plot is represented by:

plt.subplots(1,3,1) # (rows, columns, the plot)

3. Start Plotting!

Before we move on to plot multi-histograms in one figure, let’s try and learn the basics of plotting a simple histogram first.

Let’s plot a simple histogram visualizing culmen lengths of penguins using our seaborn package.

# make the figure
plt.figure(figsize = (10,5))

# make the plot
plot = sns.histplot(data = df, 
                    x = "Culmen Length (mm)", 
                    color = "yellow" 
                    )
    
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

sns.histplot calls the histplot function of the seaborn package.

We have given this function the following parameters:

  • data: specify the dataframe
  • x: the column to plot
  • color: the color of the histogram

title

  • Preferably, we would want to see how the distribution of Culmen Length (mm) is different for different species. Let’s add color to differentiate among different species.

# make the figure
plt.figure(figsize = (10,5)) 

plot = sns.histplot(data = df, 
                    x = "Culmen Length (mm)", 
                    hue = "Species", 
                    palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "red")
                    )
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

Explanation of the new parameters:

  • hue: groups Species by colour
  • palette: choose your favorite colors!

title

  • The plot is still a bit hard to read…let’s change the boxy shape of the bins to better visualize the variation between species better.

# make the figure
plt.figure(figsize = (10,5)) 

plot = sns.histplot(data = df,
                     x = "Culmen Length (mm)", 
                     hue = "Species", 
                     element = "poly", 
                     palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "green") 
                     )
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

title

Now that is much better!

We can now explore some more advanced seaborn features and combine them to produce the visualization below to answer key question we posed earlier in the blog:

The Task: Plotting Body Mass(g) of Penguins Across Islands

plt.figure(figsize = (20,5))

# a list containing the unique island names
islands = list(set(df["Island"])) 

# create the visualization

#iterate over the subplots
for i in range(1, len(islands) + 1): 
    
    plt.subplot(1,3,i) # the subplot in the grid
    
    # get the subset of data according to each island
    sub_df = df[df["Island"] == islands[i-1]] 
    
    plot = sns.histplot(data = sub_df, 
                        x = "Body Mass (g)", 
                        hue = "Species", 
                        element = "poly", 
                        palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "green")) 
    
    # we want the every x-axis to have the same scale to compare the penguins 
    plot.set(xlim =(2500,6500)) 

    # set title of subplot to the island name 
    plt.title(islands[i-1])    
    
    # Plotting the vertical lines for mean body mass of Adelie Penguins

    # mean of Body Mass of Adelie penguins
    group = sub_df.groupby("Species")["Body Mass (g)"].mean() 

    # add a vertical line
    plt.axvline(group[0], ls = '--', color = 'black', label = "Adelie mean") 
    
    # adding text
    if i == 1:
        plt.text(4750, 12, "The vertical line is the mean \nbody mass for Adelie Penguins", size = 10, style = 'italic')

Explanation of the new arguments and functions:

  • element: smoothen the boxy histogram

  • plt.axvline: add a vertical line at the given coords

  • plt.text: add text to the plot

title

Written on April 5, 2021