Let’s import the libraries we need.
#import all libraries required for this data.
import pandas as pd
import numpy as np
Now, just as before, let’s import data files we will use. Simply run the following commands to load them in your notebook.
# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 continent 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
4 pop 1704 non-null int64
5 gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
From our last workshop, we discussed methods to retrieve data from DataFrames:
[]
[[]]
Remember:
Series
is essentially a column in a DataFrame.DataFrame
can represent both rows and columns of data.apply
and map
functions to iterate over and transform data.apply()
: Used with both Series and DataFrames.map()
: Specifically for Series, useful for element-wise operations.applymap()
: For DataFrames, applying a function to each element.Note: When working with Pandas, it’s generally recommended to use vectorization whenever possible because of its efficiency advantages.
some_numbers = pd.Series(range(1, 11)) # remember that rnage function excludes the last number
some_numbers
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
np.sqrt(some_numbers + 2)
0 1.732051
1 2.000000
2 2.236068
3 2.449490
4 2.645751
5 2.828427
6 3.000000
7 3.162278
8 3.316625
9 3.464102
dtype: float64
For this, let’s first select the numeric columns from the gapminder dataframe using the .select_dtypes() function.
# We can select only the numeric columns from the DataFrame
gapminder_num = gapminder.select_dtypes(include=[np.number])
gapminder_num
year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|
0 | 1952 | 28.801 | 8425333 | 779.445314 |
1 | 1957 | 30.332 | 9240934 | 820.853030 |
2 | 1962 | 31.997 | 10267083 | 853.100710 |
3 | 1967 | 34.020 | 11537966 | 836.197138 |
4 | 1972 | 36.088 | 13079460 | 739.981106 |
... | ... | ... | ... | ... |
1699 | 1987 | 62.351 | 9216418 | 706.157306 |
1700 | 1992 | 60.377 | 10704340 | 693.420786 |
1701 | 1997 | 46.809 | 11404948 | 792.449960 |
1702 | 2002 | 39.989 | 11926563 | 672.038623 |
1703 | 2007 | 43.487 | 12311143 | 469.709298 |
1704 rows × 4 columns
Now let’s clacualte the mean of each of these columns:
gapminder_num.mean()
year 1.979500e+03
lifeExp 5.947444e+01
pop 2.960121e+07
gdpPercap 7.215327e+03
dtype: float64
GDP is often calculated as gdpPercap (GDP per capita) multiplied by pop (population).
gapminder['gdp'] = gapminder['gdpPercap'] * gapminder['pop']
gapminder[['country', 'year', 'gdp']]
country | year | gdp | |
---|---|---|---|
0 | Afghanistan | 1952 | 6.567086e+09 |
1 | Afghanistan | 1957 | 7.585449e+09 |
2 | Afghanistan | 1962 | 8.758856e+09 |
3 | Afghanistan | 1967 | 9.648014e+09 |
4 | Afghanistan | 1972 | 9.678553e+09 |
... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 6.508241e+09 |
1700 | Zimbabwe | 1992 | 7.422612e+09 |
1701 | Zimbabwe | 1997 | 9.037851e+09 |
1702 | Zimbabwe | 2002 | 8.015111e+09 |
1703 | Zimbabwe | 2007 | 5.782658e+09 |
1704 rows × 3 columns
apply
, map
and applymap
in Pandass.# 1.Sum across columns
gapminder_num.apply(sum, axis=0) # the apply() function takes a function and axis. axis=0 means rows.
year 3.373068e+06
lifeExp 1.013444e+05
pop 5.044047e+10
gdpPercap 1.229492e+07
dtype: float64
It might seem a bit counter-intuitive initially, but a helpful way to remember this is:
# 2.Sort values of columns for rows where 'year' is 1952 and display top values
gapminder_num[gapminder_num['year'] == 1952].apply(sorted, axis=0).head()
year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|
0 | 1952 | 28.801 | 60011 | 298.846212 |
12 | 1952 | 30.000 | 63149 | 299.850319 |
24 | 1952 | 30.015 | 120447 | 328.940557 |
36 | 1952 | 30.331 | 147962 | 331.000000 |
48 | 1952 | 31.286 | 153936 | 339.296459 |
# 3.Get maximum values across columns
gapminder_num.apply(max, axis=0)
year 2.007000e+03
lifeExp 8.260300e+01
pop 1.318683e+09
gdpPercap 1.135231e+05
dtype: float64
# Extract the `pop` column as Series and then use map() to take the log of all values in the column.
year_series = gapminder_num['pop']
year_series.head()
0 8425333
1 9240934
2 10267083
3 11537966
4 13079460
Name: pop, dtype: int64
year_series.map(np.log).head()
0 15.946754
1 16.039154
2 16.144454
3 16.261154
4 16.386554
Name: pop, dtype: float64
# If we want to take the log value of all values in our gamp minder numeric columns, the we can use applymap()
gapminder_num.applymap(np.log).head()
year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|
0 | 7.576610 | 3.360410 | 15.946754 | 6.658583 |
1 | 7.579168 | 3.412203 | 16.039154 | 6.710344 |
2 | 7.581720 | 3.465642 | 16.144454 | 6.748878 |
3 | 7.584265 | 3.526949 | 16.261154 | 6.728864 |
4 | 7.586804 | 3.585960 | 16.386554 | 6.606625 |
In Python, just like R, vectorization often provides the fastest way to perform operations on data. When specific, custom operations are required, loops might be needed.gs:
To iterate over any iterable in Python, you can use the following syntax:
for variable in iterable:
# do something
For example:
```python
for int in range(1, 11): # for each int in the range 1 through 10
print(int) # print that int
1
2
3
4
5
6
7
8
9
10
To demonstrate simple usage of the for loops, let’s sample 20 rows of data from our gapminder dataset
# Sample the gapminder data (random 20 rows)
gapminder_sample = gapminder_num.sample(20)
gapminder_sample
year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|
1303 | 1987 | 61.728 | 110812 | 1516.525457 |
1124 | 1992 | 47.391 | 8392818 | 581.182725 |
632 | 1992 | 43.266 | 1050938 | 745.539871 |
682 | 2002 | 72.590 | 10083313 | 14843.935560 |
36 | 1952 | 30.015 | 4232095 | 3520.610273 |
1413 | 1997 | 60.236 | 42835005 | 7479.188244 |
428 | 1992 | 51.604 | 384156 | 2377.156192 |
1634 | 1962 | 60.770 | 8143375 | 8422.974165 |
1677 | 1997 | 58.020 | 15826497 | 2117.484526 |
762 | 1982 | 74.450 | 3858421 | 15367.029200 |
352 | 1972 | 67.849 | 1834796 | 5118.146939 |
419 | 2007 | 78.332 | 5468120 | 35278.418740 |
717 | 1997 | 66.041 | 199278000 | 3119.335603 |
144 | 1952 | 53.820 | 2791000 | 973.533195 |
1612 | 1972 | 71.340 | 209896000 | 21806.035940 |
1007 | 2007 | 66.803 | 2874127 | 3095.772271 |
1038 | 1982 | 42.795 | 12587223 | 462.211415 |
999 | 1967 | 51.253 | 1149500 | 1226.041130 |
1138 | 2002 | 46.608 | 119901274 | 1615.286395 |
463 | 1987 | 59.797 | 52799062 | 3885.460710 |
# Print the log base 10 of the first value in the column 'gdpPercap'
print(np.log10(gapminder_sample['gdpPercap'].iloc[0]))
3.180849704947923
for index, value in enumerate(gapminder_sample['gdpPercap']):
print(index + 1) # +1 because Python uses 0-based indexing
print(np.log10(value))
1
3.180849704947923
2
2.7643126969049336
3
2.872470873970842
4
4.1715490603360585
5
3.546617951893976
6
3.8738544641368757
7
3.376057718158622
8
3.92546546862278
9
3.325820245308592
10
4.186589916507445
11
3.7091127500702807
12
4.547509110816849
13
3.4940621021345812
14
2.988350764353931
15
4.338576723540454
16
3.490769005923762
17
2.6648406667312905
18
3.0885050397031306
19
3.2082495351755087
20
3.589442521707857
We first need to create an empty dictionary
# Create an empty dictionary to store the log values:
gdp_percapita_log_dict = {}
# Update the values of the dictionary with log10 values:
for index, value in enumerate(gapminder_sample['gdpPercap']):
gdp_percapita_log_dict[index] = np.log10(value)
gdp_percapita_log_dict
{0: 3.180849704947923,
1: 2.7643126969049336,
2: 2.872470873970842,
3: 4.1715490603360585,
4: 3.546617951893976,
5: 3.8738544641368757,
6: 3.376057718158622,
7: 3.92546546862278,
8: 3.325820245308592,
9: 4.186589916507445,
10: 3.7091127500702807,
11: 4.547509110816849,
12: 3.4940621021345812,
13: 2.988350764353931,
14: 4.338576723540454,
15: 3.490769005923762,
16: 2.6648406667312905,
17: 3.0885050397031306,
18: 3.2082495351755087,
19: 3.589442521707857}