How to Visualize Simply with Python

title

In this blog post, we will be visualizing data from the Palmer Penguins Dataset, which provides us details about penguins across three unique species.

Getting the Data

We will obtain the data from an online github repository with read.csv().

# importing the data 
import pandas as pd
import numpy as np
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	PAL0910	120	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N38A2	No	12/1/09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
340	PAL0910	121	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N39A1	Yes	11/22/09	46.8	14.3	215.0	4850.0	FEMALE	8.41151	-26.13832	NaN
341	PAL0910	122	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N39A2	Yes	11/22/09	50.4	15.7	222.0	5750.0	MALE	8.30166	-26.04117	NaN
342	PAL0910	123	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N43A1	Yes	11/22/09	45.2	14.8	212.0	5200.0	FEMALE	8.24246	-26.11969	NaN
343	PAL0910	124	Gentoo penguin (Pygoscelis papua)	Anvers	Biscoe	Adult, 1 Egg Stage	N43A2	Yes	11/22/09	49.9	16.1	213.0	5400.0	MALE	8.36390	-26.15531	NaN

344 rows × 17 columns

Cleaning the Data

Before me move on to perform advanced visualizations functions, we need to make sure our data is nice and clean to avoid errors and unintended outputs.

We are going to create a function clean_data that will:

Relabel Species column to enhance readability to the viewer (For example: “Adelie Penguin (Pygoscelis adelie)” is relabeled to “Adelie”)
Relabel Sex column (“MALE” is relabeled to “M” and “FEMALE” is relabeled to “F”)
Return a clean dataset

def clean_data(df):
    
    """
    This function does the following:

    1. Relabel Species column to enhance readability to the viewer (For example: "Adelie Penguin (Pygoscelis adelie)" is relabeled to "Adelie")
    2. Relabel Sex column ("MALE" is relabeled to "M" and "FEMALE" is relabeled to "F")
    3. Return a clean dataset 

    """
    
    # create a copy 
    df_copy = df.copy()
    
    # making the labels using dictionaries 
    species_labels = { 'Adelie Penguin (Pygoscelis adeliae)': 'Adelie',
                       'Chinstrap penguin (Pygoscelis antarctica)': 'Chinstrap',
                       'Gentoo penguin (Pygoscelis papua)': 'Gentoo' }
    
    sex_labels = { "MALE": 'M',
                   "FEMALE": 'F',
                   ".": np.nan }
    
    # relabeling the columns 
    df_copy["Species"] =  df_copy["Species"].map(species_labels)
    df_copy["Sex"] = df_copy["Sex"].map(sex_labels)
    
    return df_copy

# the new clean and shiny dataframe
df = clean_data(penguins)
df.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	11/11/07	39.1	18.7	181.0	3750.0	M	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	11/11/07	39.5	17.4	186.0	3800.0	F	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	11/16/07	40.3	18.0	195.0	3250.0	F	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	11/16/07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	11/16/07	36.7	19.3	193.0	3450.0	F	8.76651	-25.32426	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	PAL0910	120	Gentoo	Anvers	Biscoe	Adult, 1 Egg Stage	N38A2	No	12/1/09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
340	PAL0910	121	Gentoo	Anvers	Biscoe	Adult, 1 Egg Stage	N39A1	Yes	11/22/09	46.8	14.3	215.0	4850.0	F	8.41151	-26.13832	NaN
341	PAL0910	122	Gentoo	Anvers	Biscoe	Adult, 1 Egg Stage	N39A2	Yes	11/22/09	50.4	15.7	222.0	5750.0	M	8.30166	-26.04117	NaN
342	PAL0910	123	Gentoo	Anvers	Biscoe	Adult, 1 Egg Stage	N43A1	Yes	11/22/09	45.2	14.8	212.0	5200.0	F	8.24246	-26.11969	NaN
343	PAL0910	124	Gentoo	Anvers	Biscoe	Adult, 1 Egg Stage	N43A2	Yes	11/22/09	49.9	16.1	213.0	5400.0	M	8.36390	-26.15531	NaN

344 rows × 17 columns

But What to Visualize?

There are many visualizations we can attempt with this data in order to facilitate our understanding of these penguins. But for now, let’s assess how big each species is relative to each other. We can get valuable insights to answer this question by analyzing the Body Mass (g) feature of our dataset.

We can also analyze if being on certain islands affects the size of our penguins.

The Task: We will plot 3 figures representing each island and visualize penguin mass with histograms for each species within the figure.

Let’s Start to Visualize!

Let’s import our powerful visualization packages first, Seaborn & Matplotlib, and start visualizing penguins!

import seaborn as sns
from matplotlib import pyplot as plt

1. Make a figure.

We make a figure in which are plots will be embedded in.

This is done by the command:

#Fill in the figsize parameter to set the (width,height) of the figure.
plt.figure(figsize = (10,20)). 

Now, we have a space ready to fill in with our data visualizations.

2. Make sub plots if you want multiple plots.

Our aim is visualize 3 histograms together so we will use a structure with 1 row and 3 columns; the first plot is represented by:

plt.subplots(1,3,1) # (rows, columns, the plot)

3. Start Plotting!

Before we move on to plot multi-histograms in one figure, let’s try and learn the basics of plotting a simple histogram first.

Let’s plot a simple histogram visualizing culmen lengths of penguins using our seaborn package.

# make the figure
plt.figure(figsize = (10,5))

# make the plot
plot = sns.histplot(data = df, 
                    x = "Culmen Length (mm)", 
                    color = "yellow" 
                    )
    
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

sns.histplot calls the histplot function of the seaborn package.

We have given this function the following parameters:

data: specify the dataframe
x: the column to plot
color: the color of the histogram

title

Preferably, we would want to see how the distribution of Culmen Length (mm) is different for different species. Let’s add color to differentiate among different species.

# make the figure
plt.figure(figsize = (10,5)) 

plot = sns.histplot(data = df, 
                    x = "Culmen Length (mm)", 
                    hue = "Species", 
                    palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "red")
                    )
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

Explanation of the new parameters:

hue: groups Species by colour
palette: choose your favorite colors!

title

The plot is still a bit hard to read…let’s change the boxy shape of the bins to better visualize the variation between species better.

# make the figure
plt.figure(figsize = (10,5)) 

plot = sns.histplot(data = df,
                     x = "Culmen Length (mm)", 
                     hue = "Species", 
                     element = "poly", 
                     palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "green") 
                     )
# set the title of the plot
plot.set_title("Culmen Length of Penguins") 

title

Now that is much better!

We can now explore some more advanced seaborn features and combine them to produce the visualization below to answer key question we posed earlier in the blog:

The Task: Plotting `Body Mass(g)` of Penguins Across Islands

plt.figure(figsize = (20,5))

# a list containing the unique island names
islands = list(set(df["Island"])) 

# create the visualization

#iterate over the subplots
for i in range(1, len(islands) + 1): 
    
    plt.subplot(1,3,i) # the subplot in the grid
    
    # get the subset of data according to each island
    sub_df = df[df["Island"] == islands[i-1]] 
    
    plot = sns.histplot(data = sub_df, 
                        x = "Body Mass (g)", 
                        hue = "Species", 
                        element = "poly", 
                        palette = dict( Adelie = "yellow", Gentoo = "blue", Chinstrap = "green")) 
    
    # we want the every x-axis to have the same scale to compare the penguins 
    plot.set(xlim =(2500,6500)) 

    # set title of subplot to the island name 
    plt.title(islands[i-1])    
    
    # Plotting the vertical lines for mean body mass of Adelie Penguins

    # mean of Body Mass of Adelie penguins
    group = sub_df.groupby("Species")["Body Mass (g)"].mean() 

    # add a vertical line
    plt.axvline(group[0], ls = '--', color = 'black', label = "Adelie mean") 
    
    # adding text
    if i == 1:
        plt.text(4750, 12, "The vertical line is the mean \nbody mass for Adelie Penguins", size = 10, style = 'italic')

Explanation of the new arguments and functions:

element: smoothen the boxy histogram
plt.axvline: add a vertical line at the given coords
plt.text: add text to the plot

title

Written on April 5, 2021

Data + Code

How to Visualize Simply with Python

Getting the Data

Cleaning the Data

But What to Visualize?

Let’s Start to Visualize!

1. Make a figure.

2. Make sub plots if you want multiple plots.

3. Start Plotting!

Preferably, we would want to see how the distribution of `Culmen Length (mm)` is different for different species. Let’s add color to differentiate among different species.

The plot is still a bit hard to read…let’s change the boxy shape of the bins to better visualize the variation between species better.

The Task: Plotting `Body Mass(g)` of Penguins Across Islands

How to Visualize Simply with Python

Getting the Data

Cleaning the Data

But What to Visualize?

Let’s Start to Visualize!

1. Make a figure.

2. Make sub plots if you want multiple plots.

3. Start Plotting!

Preferably, we would want to see how the distribution of Culmen Length (mm) is different for different species. Let’s add color to differentiate among different species.

The plot is still a bit hard to read…let’s change the boxy shape of the bins to better visualize the variation between species better.

The Task: Plotting Body Mass(g) of Penguins Across Islands

Preferably, we would want to see how the distribution of `Culmen Length (mm)` is different for different species. Let’s add color to differentiate among different species.

The Task: Plotting `Body Mass(g)` of Penguins Across Islands