workshops

Let’s import the libraries we need.

#import all libraries required for this data.
import pandas as pd
import numpy as np

Now, just as before, let’s import data files we will use. Simply run the following commands to load them in your notebook.

# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object
 1   continent  1704 non-null   object
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

In the previous workshop:

From our last workshop, we discussed methods to retrieve data from DataFrames:


Remember:

Data Iteration Methods in Pandas

1. Vectorization

2. Apply/Map Functions

3. Loops


Note: When working with Pandas, it’s generally recommended to use vectorization whenever possible because of its efficiency advantages.

1. Vectorization

some_numbers = pd.Series(range(1, 11)) # remember that rnage function excludes the last number
some_numbers
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64
np.sqrt(some_numbers + 2)
0    1.732051
1    2.000000
2    2.236068
3    2.449490
4    2.645751
5    2.828427
6    3.000000
7    3.162278
8    3.316625
9    3.464102
dtype: float64

And similarly with DataFrames:

For this, let’s first select the numeric columns from the gapminder dataframe using the .select_dtypes() function.

# We can select only the numeric columns from the DataFrame
gapminder_num = gapminder.select_dtypes(include=[np.number])
gapminder_num
year lifeExp pop gdpPercap
0 1952 28.801 8425333 779.445314
1 1957 30.332 9240934 820.853030
2 1962 31.997 10267083 853.100710
3 1967 34.020 11537966 836.197138
4 1972 36.088 13079460 739.981106
... ... ... ... ...
1699 1987 62.351 9216418 706.157306
1700 1992 60.377 10704340 693.420786
1701 1997 46.809 11404948 792.449960
1702 2002 39.989 11926563 672.038623
1703 2007 43.487 12311143 469.709298

1704 rows × 4 columns

Now let’s clacualte the mean of each of these columns:

gapminder_num.mean()
year         1.979500e+03
lifeExp      5.947444e+01
pop          2.960121e+07
gdpPercap    7.215327e+03
dtype: float64

Task: Calculate the GDP (Gross Domestic Product) for each country-year combination:

GDP is often calculated as gdpPercap (GDP per capita) multiplied by pop (population).

gapminder['gdp'] = gapminder['gdpPercap'] * gapminder['pop']
gapminder[['country', 'year', 'gdp']]
country year gdp
0 Afghanistan 1952 6.567086e+09
1 Afghanistan 1957 7.585449e+09
2 Afghanistan 1962 8.758856e+09
3 Afghanistan 1967 9.648014e+09
4 Afghanistan 1972 9.678553e+09
... ... ... ...
1699 Zimbabwe 1987 6.508241e+09
1700 Zimbabwe 1992 7.422612e+09
1701 Zimbabwe 1997 9.037851e+09
1702 Zimbabwe 2002 8.015111e+09
1703 Zimbabwe 2007 5.782658e+09

1704 rows × 3 columns

2. Using apply, map and applymap in Pandass.

1) apply()

# 1.Sum across columns
gapminder_num.apply(sum, axis=0) # the apply() function takes a function and axis. axis=0 means rows.
year         3.373068e+06
lifeExp      1.013444e+05
pop          5.044047e+10
gdpPercap    1.229492e+07
dtype: float64

It might seem a bit counter-intuitive initially, but a helpful way to remember this is:

# 2.Sort values of columns for rows where 'year' is 1952 and display top values
gapminder_num[gapminder_num['year'] == 1952].apply(sorted, axis=0).head()
year lifeExp pop gdpPercap
0 1952 28.801 60011 298.846212
12 1952 30.000 63149 299.850319
24 1952 30.015 120447 328.940557
36 1952 30.331 147962 331.000000
48 1952 31.286 153936 339.296459
# 3.Get maximum values across columns
gapminder_num.apply(max, axis=0)
year         2.007000e+03
lifeExp      8.260300e+01
pop          1.318683e+09
gdpPercap    1.135231e+05
dtype: float64

2) map()

# Extract the `pop` column as Series and then use map() to take the log of all values in the column.
year_series = gapminder_num['pop']
year_series.head()
0     8425333
1     9240934
2    10267083
3    11537966
4    13079460
Name: pop, dtype: int64
year_series.map(np.log).head()
0    15.946754
1    16.039154
2    16.144454
3    16.261154
4    16.386554
Name: pop, dtype: float64

3) applymap()

# If we want to take the log value of all values in our gamp minder numeric columns, the we can use applymap()
gapminder_num.applymap(np.log).head()
year lifeExp pop gdpPercap
0 7.576610 3.360410 15.946754 6.658583
1 7.579168 3.412203 16.039154 6.710344
2 7.581720 3.465642 16.144454 6.748878
3 7.584265 3.526949 16.261154 6.728864
4 7.586804 3.585960 16.386554 6.606625

3. Loops

In Python, just like R, vectorization often provides the fastest way to perform operations on data. When specific, custom operations are required, loops might be needed.gs:

To iterate over any iterable in Python, you can use the following syntax:

for variable in iterable:
    # do something


For example:


```python
for int in range(1, 11):  # for each int in the range 1 through 10
    print(int)  # print that int
1
2
3
4
5
6
7
8
9
10

To demonstrate simple usage of the for loops, let’s sample 20 rows of data from our gapminder dataset

# Sample the gapminder data (random 20 rows)
gapminder_sample = gapminder_num.sample(20)
gapminder_sample
year lifeExp pop gdpPercap
1303 1987 61.728 110812 1516.525457
1124 1992 47.391 8392818 581.182725
632 1992 43.266 1050938 745.539871
682 2002 72.590 10083313 14843.935560
36 1952 30.015 4232095 3520.610273
1413 1997 60.236 42835005 7479.188244
428 1992 51.604 384156 2377.156192
1634 1962 60.770 8143375 8422.974165
1677 1997 58.020 15826497 2117.484526
762 1982 74.450 3858421 15367.029200
352 1972 67.849 1834796 5118.146939
419 2007 78.332 5468120 35278.418740
717 1997 66.041 199278000 3119.335603
144 1952 53.820 2791000 973.533195
1612 1972 71.340 209896000 21806.035940
1007 2007 66.803 2874127 3095.772271
1038 1982 42.795 12587223 462.211415
999 1967 51.253 1149500 1226.041130
1138 2002 46.608 119901274 1615.286395
463 1987 59.797 52799062 3885.460710

Task1: Print the log base 10 of the first value in the column ‘gdpPercap’

# Print the log base 10 of the first value in the column 'gdpPercap'
print(np.log10(gapminder_sample['gdpPercap'].iloc[0]))
3.180849704947923

Task2: using the for loop, loop over the gdpPercap column, printing index and log10 of the value:

for index, value in enumerate(gapminder_sample['gdpPercap']):
    print(index + 1)  # +1 because Python uses 0-based indexing
    print(np.log10(value))
1
3.180849704947923
2
2.7643126969049336
3
2.872470873970842
4
4.1715490603360585
5
3.546617951893976
6
3.8738544641368757
7
3.376057718158622
8
3.92546546862278
9
3.325820245308592
10
4.186589916507445
11
3.7091127500702807
12
4.547509110816849
13
3.4940621021345812
14
2.988350764353931
15
4.338576723540454
16
3.490769005923762
17
2.6648406667312905
18
3.0885050397031306
19
3.2082495351755087
20
3.589442521707857

Task3. How do we store these index and computed values in a new variable?

We first need to create an empty dictionary

# Create an empty dictionary to store the log values:
gdp_percapita_log_dict = {}
# Update the values of the dictionary with log10 values:
for index, value in enumerate(gapminder_sample['gdpPercap']):
    gdp_percapita_log_dict[index] = np.log10(value)
gdp_percapita_log_dict
{0: 3.180849704947923,
 1: 2.7643126969049336,
 2: 2.872470873970842,
 3: 4.1715490603360585,
 4: 3.546617951893976,
 5: 3.8738544641368757,
 6: 3.376057718158622,
 7: 3.92546546862278,
 8: 3.325820245308592,
 9: 4.186589916507445,
 10: 3.7091127500702807,
 11: 4.547509110816849,
 12: 3.4940621021345812,
 13: 2.988350764353931,
 14: 4.338576723540454,
 15: 3.490769005923762,
 16: 2.6648406667312905,
 17: 3.0885050397031306,
 18: 3.2082495351755087,
 19: 3.589442521707857}