Let’s import out libraries in Python.
#import all libraries required for this data.
import pandas as pd
import numpy as np
Now, just as before, let’s import data files we will use. Simply run the following commands to load them in your notebook.
# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
# Select a single variable of a dataframe
lifeExp = gapminder['lifeExp']
lifeExp.head()
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
Name: lifeExp, dtype: float64
print(type(lifeExp))
<class 'pandas.core.series.Series'>
[[]]
to return DataFrame[]
or loc to return SeriesWe might want to return a DataFrame because:
it offers the flexibility to easily extend the structure with new columns as needed
[[]]
with a dataframe
pop_preserved
is a DataFrame containing the ‘pop’ column.pop_preserved = gapminder[['pop']]
type(pop_preserved)
displays that pop_preserved
is of type DataFrame.type(pop_preserved)
pandas.core.frame.DataFrame
.loc
or a single bracket []
pop_simple
is a Series.pop_simple = gapminder['pop']
type(pop_simple)
pandas.core.series.Series
type(pop_simple)
shows that pop_simple
is of type Series..loc
equivalent of the code abovepop_simple_loc = gapminder.loc[:, 'pop']
#[:,'pop'] retuns all rows in the gapminder dataset
type(pop_simple_loc)
pandas.core.series.Series
.iloc
.iloc
is a pandas method used for integer-location based indexing.
gapminder.head()
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
gapminder.iloc[0, 0]
'Afghanistan'
gapminder['country'][0]
'Afghanistan'
print(gapminder.iloc[0, 0])
gapminder
DataFrame.gapminder.iloc[1]
country Afghanistan
continent Asia
year 1957
lifeExp 30.332
pop 9240934
gdpPercap 820.85303
Name: 1, dtype: object
print(gapminder.iloc[1])
gapminder
DataFrame.Series
.gapminder.iloc[0:5, 0:3]
country | continent | year | |
---|---|---|---|
0 | Afghanistan | Asia | 1952 |
1 | Afghanistan | Asia | 1957 |
2 | Afghanistan | Asia | 1962 |
3 | Afghanistan | Asia | 1967 |
4 | Afghanistan | Asia | 1972 |
print(gapminder.iloc[0:5, 0:3])
DataFrame
with 5 rows and 3 columns..loc
and .iloc
.loc
:
df.loc['row_name', 'column_name']
.iloc
:
df.iloc[2, 3]
(third row, fourth column)When working with a Pandas DataFrame, you can filter your data based on specific conditions. This involves creating boolean masks using comparison operators and then applying these masks to your DataFrame using logical operators.
Operator | Description |
---|---|
!= | not equal to |
== | equal to |
< | less than |
> | greater than |
<= | less than or equal to |
>= | equal to or greater than |
Operator | Description |
---|---|
~ | not (negation) |
| | or (element-wise OR) |
& | and (element-wise AND) |
Remember to use parentheses around each condition when combining them with the logical operators to ensure correct order of operations. For instance:
filtered_df = df[(df['column1'] > 10) & (df['column2'] != 'value')]
#### Task: Filter the `lifeExp` column to select values greater than 70.
```python
# Filtering based on conditions
lifeExp_gt_70 = lifeExp[lifeExp > 70]
lifeExp_gt_70.head()
18 70.420
19 72.000
20 71.581
21 72.950
22 75.651
Name: lifeExp, dtype: float64
len(lifeExp_gt_70)
493
How it Works:
- lifeExp > 70
creates a Boolean mask where each value of lifeExp
is checked against the condition (> 70). It returns True
for values meeting the condition and False
otherwise.
- lifeExp[lifeExp > 70]
uses this Boolean mask to select and keep only the True
values from lifeExp
.
gapminder
DataFrame to select rows where the value in the ‘year’ column is 1952.gapminder_1952 = gapminder[gapminder['year'] == 1952]
gapminder_1952.head()
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 |
24 | Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.008185 |
36 | Angola | Africa | 1952 | 30.015 | 4232095 | 3520.610273 |
48 | Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.315053 |
gapminder['year'] == 1952
creates a Boolean mask. Each row’s ‘year’ value is checked against the condition (== 1952). This returns True
for rows where the condition is met and False
otherwise.gapminder[gapminder['year'] == 1952]
applies this Boolean mask to the gapminder
DataFrame. This results in a new DataFrame with only the rows from the year 1952.gdp_1952 = gapminder_1952[['country', 'gdpPercap']]
gdp_1952.head()
country | gdpPercap | |
---|---|---|
0 | Afghanistan | 779.445314 |
12 | Albania | 1601.056136 |
24 | Algeria | 2449.008185 |
36 | Angola | 3520.610273 |
48 | Argentina | 5911.315053 |
[[]]
) are used to select multiple columns from a DataFrame. This ensures the result is still a DataFrame.gapminder_1952[['country', 'gdpPercap']]
specifies the desired columns and extracts them from gapminder_1952
, resulting in a new DataFrame gdp_1952
containing only these two columns.loc
.loc
method to subset rows in the gapminder
DataFrame based on a specific condition: the year being 1952gapminder.loc[gapminder['year'] == 1952].head()
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 |
24 | Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.008185 |
36 | Angola | Africa | 1952 | 30.015 | 4232095 | 3520.610273 |
48 | Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.315053 |
gapminder[gapminder['year'] == 1952]
is the same as gapminder.loc[gapminder['year'] == 1952]
.loc
?gapminder.loc[gapminder['year'] == 1952, ['country', 'pop']].head()
country | pop | |
---|---|---|
0 | Afghanistan | 8425333 |
12 | Albania | 1282697 |
24 | Algeria | 9279525 |
36 | Angola | 4232095 |
48 | Argentina | 17876956 |
gapminder['year'] == 1952
creates a Boolean mask. Each row’s ‘year’ value is checked against the condition (== 1952). This returns True
for rows meeting the condition and False
otherwise..loc[]
is a label-based indexing method, but here it’s used with a Boolean mask to select the rows that satisfy the condition.gapminder
DataFrame based on two conditions: the year being 1952 and the continent being ‘Americas’.# Multiple conditions
am_52 = gapminder[(gapminder['year'] == 1952) & (gapminder['continent'] == 'Americas')]
print(am_52.head())
country continent year lifeExp pop gdpPercap
48 Argentina Americas 1952 62.485 17876956 5911.315053
132 Bolivia Americas 1952 40.414 2883315 2677.326347
168 Brazil Americas 1952 50.917 56602560 2108.944355
240 Canada Americas 1952 68.750 14785584 11367.161120
276 Chile Americas 1952 54.745 6377619 3939.978789
gapminder['year'] == 1952
: Creates a Boolean mask based on whether the ‘year’ column value is 1952.gapminder['continent'] == 'Americas'
: Creates another Boolean mask based on whether the ‘continent’ column value is ‘Americas’.&
operator combines the two Boolean masks element-wise, returning True
only where both conditions are met, and False
otherwise.gapminder
DataFrame, resulting in a new DataFrame am_52
that satisfies both conditions.# select 1 column
selected_columns = gapminder[['country']]
print(selected_columns.head())
country
0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
# select multiple columns
selected_range = gapminder[['continent', 'year', 'lifeExp', 'pop']]
print(selected_range.head())
continent year lifeExp pop
0 Asia 1952 28.801 8425333
1 Asia 1957 30.332 9240934
2 Asia 1962 31.997 10267083
3 Asia 1967 34.020 11537966
4 Asia 1972 36.088 13079460
# select only numeric columns
numeric_cols = gapminder.select_dtypes(include=['number'])
print(numeric_cols.head())
year lifeExp pop gdpPercap
0 1952 28.801 8425333 779.445314
1 1957 30.332 9240934 820.853030
2 1962 31.997 10267083 853.100710
3 1967 34.020 11537966 836.197138
4 1972 36.088 13079460 739.981106
select_dtypes
is a method available on pandas DataFrames. It allows selection of columns based on their datatype.include
parameter specifies which data types to select. By passing ['number']
, we are indicating that we want to select all numeric types (both integer and float columns).numeric_cols
variable, which will be a DataFrame containing only the numeric columns from gapminder
.negation_selection = gapminder.loc[:, gapminder.columns != 'country']
print(negation_selection.head())
continent year lifeExp pop gdpPercap
0 Asia 1952 28.801 8425333 779.445314
1 Asia 1957 30.332 9240934 820.853030
2 Asia 1962 31.997 10267083 853.100710
3 Asia 1967 34.020 11537966 836.197138
4 Asia 1972 36.088 13079460 739.981106
negation_selection = gapminder.loc[:, ~(gapminder.columns == 'country')]
print(negation_selection.head())
continent year lifeExp pop gdpPercap
0 Asia 1952 28.801 8425333 779.445314
1 Asia 1957 30.332 9240934 820.853030
2 Asia 1962 31.997 10267083 853.100710
3 Asia 1967 34.020 11537966 836.197138
4 Asia 1972 36.088 13079460 739.981106
The provided content and techniques are based on documentation and resources from official Python, Pandas, and NumPy websites:
Python: Python Official Documentation
.loc[]
and other selection methods.