workshops

Subsetting and Filtering Data in Python

Let’s import out libraries in Python.

#import all libraries required for this data. 
import pandas as pd
import numpy as np

Now, just as before, let’s import data files we will use. Simply run the following commands to load them in your notebook.

# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)

Select a Single Variable of a Dataframe

Task: Extract the ‘lifeExp’ column from the ‘gapminder’ dataframe and display its first five entries along with its data type.

# Select a single variable of a dataframe
lifeExp = gapminder['lifeExp']
lifeExp.head()
0    28.801
1    30.332
2    31.997
3    34.020
4    36.088
Name: lifeExp, dtype: float64
print(type(lifeExp))
<class 'pandas.core.series.Series'>

But why do we need a DataFrame instead?

We might want to return a DataFrame because:

pop_preserved = gapminder[['pop']]
type(pop_preserved)
pandas.core.frame.DataFrame
pop_simple = gapminder['pop']
type(pop_simple)
pandas.core.series.Series

The .loc equivalent of the code above

pop_simple_loc = gapminder.loc[:, 'pop']
#[:,'pop'] retuns all rows in the gapminder dataset
type(pop_simple_loc)

pandas.core.series.Series

Selecting Specific Ranges with .iloc

.iloc is a pandas method used for integer-location based indexing.

gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
gapminder.iloc[0, 0]
'Afghanistan'

This is the same as the following:

gapminder['country'][0]
'Afghanistan'
gapminder.iloc[1]
country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object
gapminder.iloc[0:5, 0:3]
country continent year
0 Afghanistan Asia 1952
1 Afghanistan Asia 1957
2 Afghanistan Asia 1962
3 Afghanistan Asia 1967
4 Afghanistan Asia 1972

Differences between .loc and .iloc

Filtering Based on Conditions

Filtering Based on Conditions in Pandas

When working with a Pandas DataFrame, you can filter your data based on specific conditions. This involves creating boolean masks using comparison operators and then applying these masks to your DataFrame using logical operators.

Comparison Operators

Operator Description
!= not equal to
== equal to
< less than
> greater than
<= less than or equal to
>= equal to or greater than

Logical Operators

Operator Description
~ not (negation)
| or (element-wise OR)
& and (element-wise AND)

Remember to use parentheses around each condition when combining them with the logical operators to ensure correct order of operations. For instance:

filtered_df = df[(df['column1'] > 10) & (df['column2'] != 'value')]


#### Task: Filter the `lifeExp` column to select values greater than 70.


```python
# Filtering based on conditions
lifeExp_gt_70 = lifeExp[lifeExp > 70]
lifeExp_gt_70.head()
18    70.420
19    72.000
20    71.581
21    72.950
22    75.651
Name: lifeExp, dtype: float64
len(lifeExp_gt_70)
493

How it Works: - lifeExp > 70 creates a Boolean mask where each value of lifeExp is checked against the condition (> 70). It returns True for values meeting the condition and False otherwise. - lifeExp[lifeExp > 70] uses this Boolean mask to select and keep only the True values from lifeExp.

Filter the gapminder DataFrame to select rows where the value in the ‘year’ column is 1952.

gapminder_1952 = gapminder[gapminder['year'] == 1952]
gapminder_1952.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
12 Albania Europe 1952 55.230 1282697 1601.056136
24 Algeria Africa 1952 43.077 9279525 2449.008185
36 Angola Africa 1952 30.015 4232095 3520.610273
48 Argentina Americas 1952 62.485 17876956 5911.315053

Task: Extract the coloumns ‘country’ and ‘gdpPercap’ for the Year 1952

gdp_1952 = gapminder_1952[['country', 'gdpPercap']]
gdp_1952.head()
country gdpPercap
0 Afghanistan 779.445314
12 Albania 1601.056136
24 Algeria 2449.008185
36 Angola 3520.610273
48 Argentina 5911.315053

Subsetting Rows Using loc

Task: use the .loc method to subset rows in the gapminder DataFrame based on a specific condition: the year being 1952

gapminder.loc[gapminder['year'] == 1952].head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
12 Albania Europe 1952 55.230 1282697 1601.056136
24 Algeria Africa 1952 43.077 9279525 2449.008185
36 Angola Africa 1952 30.015 4232095 3520.610273
48 Argentina Americas 1952 62.485 17876956 5911.315053

gapminder[gapminder['year'] == 1952] is the same as gapminder.loc[gapminder['year'] == 1952]

Why Do We Use .loc ?

gapminder.loc[gapminder['year'] == 1952, ['country', 'pop']].head()
country pop
0 Afghanistan 8425333
12 Albania 1282697
24 Algeria 9279525
36 Angola 4232095
48 Argentina 17876956

Subsetting Data Using Multiple Conditions

Task: Filter rows from the gapminder DataFrame based on two conditions: the year being 1952 and the continent being ‘Americas’.

# Multiple conditions
am_52 = gapminder[(gapminder['year'] == 1952) & (gapminder['continent'] == 'Americas')]
print(am_52.head())
       country continent  year  lifeExp       pop     gdpPercap
48   Argentina  Americas  1952   62.485  17876956   5911.315053
132    Bolivia  Americas  1952   40.414   2883315   2677.326347
168     Brazil  Americas  1952   50.917  56602560   2108.944355
240     Canada  Americas  1952   68.750  14785584  11367.161120
276      Chile  Americas  1952   54.745   6377619   3939.978789

In Python, we typically use pandas methods directly which is the equivalent of tidyverse in R

# select 1 column
selected_columns = gapminder[['country']]
print(selected_columns.head())
       country
0  Afghanistan
1  Afghanistan
2  Afghanistan
3  Afghanistan
4  Afghanistan
# select multiple columns
selected_range = gapminder[['continent', 'year', 'lifeExp', 'pop']]
print(selected_range.head())
  continent  year  lifeExp       pop
0      Asia  1952   28.801   8425333
1      Asia  1957   30.332   9240934
2      Asia  1962   31.997  10267083
3      Asia  1967   34.020  11537966
4      Asia  1972   36.088  13079460
# select only numeric columns
numeric_cols = gapminder.select_dtypes(include=['number'])
print(numeric_cols.head())
   year  lifeExp       pop   gdpPercap
0  1952   28.801   8425333  779.445314
1  1957   30.332   9240934  820.853030
2  1962   31.997  10267083  853.100710
3  1967   34.020  11537966  836.197138
4  1972   36.088  13079460  739.981106
negation_selection = gapminder.loc[:, gapminder.columns != 'country']
print(negation_selection.head())

  continent  year  lifeExp       pop   gdpPercap
0      Asia  1952   28.801   8425333  779.445314
1      Asia  1957   30.332   9240934  820.853030
2      Asia  1962   31.997  10267083  853.100710
3      Asia  1967   34.020  11537966  836.197138
4      Asia  1972   36.088  13079460  739.981106
negation_selection = gapminder.loc[:, ~(gapminder.columns == 'country')]
print(negation_selection.head())
  continent  year  lifeExp       pop   gdpPercap
0      Asia  1952   28.801   8425333  779.445314
1      Asia  1957   30.332   9240934  820.853030
2      Asia  1962   31.997  10267083  853.100710
3      Asia  1967   34.020  11537966  836.197138
4      Asia  1972   36.088  13079460  739.981106

References

The provided content and techniques are based on documentation and resources from official Python, Pandas, and NumPy websites: