workshops

Building Plots with Matplotlib

Matplotlib in Python is based on a procedural method of building plots – you sequentially add elements to a plot. To construct a graphic using Matplotlib, follow these general steps:

Levels of Measurement & Data Types in Python with Matplotlib

When visualizing data with Matplotlib in Python, it’s important to choose the appropriate plot type based on the level of measurement of your data.

Level Order Description Example Visualization Type Python Data Type
Nominal No Classifies data into distinct categories. Marital Status Bar chart, Pie chart object, category
Ordinal Yes Categorizes data with an inherent order. Education Level Ordered Bar chart, Line plot category (ordered)
Interval Yes Measures differences, not ratios. Zero is not true zero. Temperature (°C) Scatter plot, Line plot int64, float64
Ratio Yes True zero, allowing for meaningful ratios. Height, Weight Scatter plot, Histogram, Line plot int64, float64

Visualization Considerations

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
plt.figure(figsize=(10, 6))
<Figure size 1000x600 with 0 Axes>




<Figure size 1000x600 with 0 Axes>
# Filter the data for the year 2007
gm_2007 = gapminder[gapminder['year'] == 2007]
gm_2007.head()
country continent year lifeExp pop gdpPercap
11 Afghanistan Asia 2007 43.828 31889923 974.580338
23 Albania Europe 2007 76.423 3600523 5937.029526
35 Algeria Africa 2007 72.301 33333216 6223.367465
47 Angola Africa 2007 42.731 12420476 4797.231267
59 Argentina Americas 2007 75.320 40301927 12779.379640

Creating Scatter plots

sns.scatterplot(
    x=None,
    y=None,
    data=None,
    s=None,
    alpha=None,
    hue=None,
    palette=None,
    style=None
)



- `x`: The data for the x-axis.
- `y`: The data for the y-axis.
- `data`: The DataFrame or data source containing the variables.
- `s`: Specifies the marker size.
- `alpha`: Sets the transparency (opacity) of markers.
- `hue`: Groups data points by a categorical variable and assigns different colors to each group.
- `palette`: Defines the color palette to use for `hue` groups.
- `style`: Groups data points by a categorical variable and assigns different marker styles to each group.



```python
sns.scatterplot(data=gm_2007, x='gdpPercap', y='lifeExp')
<Axes: xlabel='gdpPercap', ylabel='lifeExp'>

png

# Basic scatter plot
sns.scatterplot(data=gm_2007, x='gdpPercap', y='lifeExp')
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007: Life Expectancy vs GDP per Capita')
plt.show()

png

Creating Bar Charts with matplotlib and seaborn

In matplotlib and seaborn, you can create bar charts similar to ggplot2’s geom_bar() and geom_col(), respectively:

Creating a Bar Chart with a Count of Categories

To create a bar chart when you have a count of categories (like geom_bar() in ggplot2):

# Create a bar chart using seaborn (count of categories)
plt.figure(figsize=(8, 6))
sns.countplot(data=gm_2007, x='continent')
plt.xlabel('Continent')
plt.ylabel('Number of Countries')
plt.title('Number of Countries per Continent in 2007')
plt.show()

png

Creating a Bar Chart with a Tally per Category

To create a bar chart when you have a tally per category (similar to geom_col() in ggplot2):

filtered_data = gapminder[gapminder['year'] == 2007]['continent']
country_freqtable = filtered_data.value_counts().reset_index()
country_freqtable.columns = ['continent', 'freq']
country_freqtable
continent freq
0 Africa 52
1 Asia 33
2 Europe 30
3 Americas 25
4 Oceania 2
# Basic bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='continent', y='freq', data=country_freqtable)
plt.xlabel('Continent')
plt.ylabel('Frequency')
plt.title('Frequency of Countries per Continent')
plt.show()

png

sns.scatterplot(
    x=None,
    y=None,
    data=None,
    s=None,
    alpha=None,
    hue=None,
    palette=None,
    style=None
)

### Changing Hue, Size and Alpha parameters in scatterplot
- `hue`: Assigns different colors to data points based on a categorical variable, aiding in group distinction.

- `size`: Controls marker size, allowing emphasis or representation of a numeric variable.

- `alpha`: Adjusts marker transparency, managing overlapping points and enhancing visibility.



```python
# Create a scatter plot with color and size mappings
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007)
plt.title('Scatter Plot with Color and Size Mappings')
plt.show()

png

plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='continent', data=gm_2007, sizes=(10, 200))
plt.title('Scatter Plot with Size Mapping to a Categorical Variable')
plt.show()

png

plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', data=gm_2007, s=100)  # Set 's' for point size
plt.title('Color Mapped to Continent and Uniform Size')
plt.show()

png

Scales and Labels in Seaborn

Adjusting Scales

In seaborn, you can adjust scales for variables like color, size, and axes. Unlike R’s ggplot2, seaborn doesn’t use RColorBrewer directly, but it offers various color palettee x-axis plt.show()

sns.color_palette()
# List the names of available color palettes in Seaborn
palette_names = sns.color_palette().as_hex()

# Print the list of palette names
print(palette_names)
['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']

Adjusting Scales

In the following code snippet, we apply a logarithmic transformation to the x-axis variable:

# Create a scatter plot with adjusted scales
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')  # Apply log transformation to the x-axis
plt.show()

png

Creating a scatter plot with customized labels, title, and subtitle

# Create a scatter plot with customized labels, title, and subtitle
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007,palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot with smaller font size
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Display the plot
plt.legend()
plt.show()

png

Formatting Axis

from matplotlib.ticker import StrMethodFormatter
# Create a scatter plot with Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = StrMethodFormatter('${x:,.0f}')
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Display the plot
plt.legend()
plt.show()

png

The line sns.set_theme(style="whitegrid") in Seaborn is used to set the visual theme for plots. “whitegrid” is one of the built-in themes that configures the plot with a white background and grid lines, creating a clean and minimalistic appearance that helps emphasize the data points.

# Create a scatter plot with Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x))
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Display the plot
plt.legend()

sns.set_theme(style="whitegrid")

plt.show()

png

Customizing the legend position and axis text rotation

Customizing the legend position and axis text rotation, as well as removing minor grid lines, are done at the level of the individual plot functions and not within the theme settings. These customizations are specific to the current plot and affect its appearance and behavior.

# Set Seaborn theme to "whitegrid" (you can choose other themes as well)
#sns.set_theme(style="whitegrid")

# Create a scatter plot with Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop', data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x))
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Customize the legend position and axis text rotation
plt.legend(loc='upper left')
plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees


# Display the plot
plt.show()

png

Statisical overlays

plt.figure(figsize=(10, 8))
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop',
                data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Add a linear regression line (overlay)
# Using logx=True to account for the log-scaled x-axis
sns.regplot(x='gdpPercap', y='lifeExp', data=gm_2007, scatter=False,
            logx=True, color='black', truncate=False)

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = plt.FuncFormatter(lambda x, _: f'${x:,.0f}')
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Customize the legend position and axis text rotation
plt.legend(title='Continent', loc='upper left')
plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees

# Remove minor grid lines
plt.grid(axis='y', which='minor', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

png

gm_2007.info()
<class 'pandas.core.frame.DataFrame'>
Index: 142 entries, 11 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    142 non-null    object
 1   continent  142 non-null    object
 2   year       142 non-null    int64  
 3   lifeExp    142 non-null    float64
 4   pop        142 non-null    int64  
 5   gdpPercap  142 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 7.8+ KB

More than one plot: FacetGrid


# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Create a FacetGrid for different continents
g = sns.FacetGrid(gm_2007, col="continent", hue="continent", col_wrap=3, height=4, palette="Dark2")

# Adding scatter plots for each facet
g.map_dataframe(sns.scatterplot, x="gdpPercap", y="lifeExp")

# Adding regression lines for each facet
g.map(sns.regplot, "gdpPercap", "lifeExp", scatter=False, logx=True, truncate=False, ci=False)

# Customizing each facet
g.set(xscale="log")
g.set_titles("{col_name}")
g.set_axis_labels("GDP per Capita (2007)", "Life Expectancy")

# Customizing x-axis ticks
formatter = plt.FuncFormatter(lambda x, _: f'${x:,.0f}')
for ax in g.axes.flat:
    ax.xaxis.set_major_formatter(formatter)
    ax.xaxis.set_major_locator(plt.FixedLocator([1000, 10000, 100000]))
    ax.tick_params(axis='x', rotation=45)

# Customizing legend for each subplot
for ax in g.axes.flat:
    ax.legend(loc='upper left')

# Adjusting the subplot layout and adding title
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Gapminder 2007: Life Expectancy vs GDP per Capita by Continent', fontsize=16)

# Show the plot
plt.show()


/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/seaborn/regression.py:315: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/numpy/lib/nanfunctions.py:1577: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanquantile_1d, axis, a, q,
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/seaborn/regression.py:315: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/numpy/lib/nanfunctions.py:1577: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanquantile_1d, axis, a, q,
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/seaborn/regression.py:315: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/numpy/lib/nanfunctions.py:1577: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanquantile_1d, axis, a, q,
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/seaborn/regression.py:315: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/numpy/lib/nanfunctions.py:1577: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanquantile_1d, axis, a, q,
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/seaborn/regression.py:315: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
/Users/nijiatiabulizi/anaconda3/lib/python3.11/site-packages/numpy/lib/nanfunctions.py:1577: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanquantile_1d, axis, a, q,

png

# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Create the scatter plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st subplot
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop',
                data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Add a linear regression line (overlay)
# Using logx=True to account for the log-scaled x-axis
sns.regplot(x='gdpPercap', y='lifeExp', data=gm_2007, scatter=False,
            logx=True, color='black', truncate=False)

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = plt.FuncFormatter(lambda x, _: f'${x:,.0f}')
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Customize the legend position and axis text rotation
plt.legend(title='Continent', loc='upper left')
plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees

# Create the box plot
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd subplot
sns.boxplot(data=gm_2007, x='continent', y='lifeExp', palette='Dark2')
plt.xlabel('Continent')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy by Continent')
plt.xticks(rotation=45)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()


/var/folders/pk/263cmy6n21j3y3cqybw1dwq40000gn/T/ipykernel_39282/3466973353.py:37: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=gm_2007, x='continent', y='lifeExp', palette='Dark2')

png

# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Adjust the overall figure size if necessary
plt.figure(figsize=(6, 12))

# Create the scatter plot
plt.subplot(2, 1, 1)  # 2 rows, 1 column, 1st subplot
sns.scatterplot(x='gdpPercap', y='lifeExp', hue='continent', size='pop',
                data=gm_2007, palette='Dark2', sizes=(10, 200))
plt.xscale('log')

# Add a linear regression line (overlay)
# Using logx=True to account for the log-scaled x-axis
sns.regplot(x='gdpPercap', y='lifeExp', data=gm_2007, scatter=False,
            logx=True, color='black', truncate=False)

# Customize labels and title
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007')

# Add the subtitle at the bottom of the plot
plt.figtext(0.5, 0.02, '2007 life expectancy plotted against per capita GDP.', fontsize=10, ha='center')

# Automatically format x-axis ticks as currency (e.g., $10,000)
formatter = plt.FuncFormatter(lambda x, _: f'${x:,.0f}')
plt.gca().xaxis.set_major_formatter(formatter)

# Automatically determine x-axis tick intervals
plt.gca().xaxis.set_major_locator(plt.AutoLocator())

# Customize the legend position and axis text rotation
plt.legend(title='Continent', loc='upper left')
plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees

# Create the box plot
plt.subplot(2, 1, 2)  # 2 rows, 1 column, 2nd subplot
sns.boxplot(data=gm_2007, x='continent', y='lifeExp', palette='Dark2')
plt.xlabel('Continent')
plt.ylabel('Life Expectancy')
plt.title('Life Expectancy by Continent')
plt.xticks(rotation=45)

# Adjust layout and display the plot
plt.tight_layout()
plt.show()


/var/folders/pk/263cmy6n21j3y3cqybw1dwq40000gn/T/ipykernel_39282/1593930546.py:39: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=gm_2007, x='continent', y='lifeExp', palette='Dark2')

png

scatter = sns.scatterplot(data=gm_2007, x='gdpPercap', y='lifeExp', hue='continent', size='pop', sizes=(10, 200), palette='Dark2')
plt.xscale('log')
plt.xlabel('GDP per Capita (2007)')
plt.ylabel('Life Expectancy')
plt.title('Gapminder 2007: Life Expectancy vs GDP per Capita')
plt.xticks(rotation=45)
plt.legend(loc='upper left')

# Define the palette for continents
palette = sns.color_palette("Dark2", n_colors=len(gm_2007['continent'].unique()))

# Add linear regression lines to the scatter plot
for idx, continent in enumerate(gm_2007['continent'].unique()):
    subset = gm_2007[gm_2007['continent'] == continent]
    sns.regplot(x='gdpPercap', y='lifeExp', data=subset, scatter=False, logx=True, color=palette[idx])

Reference