workshops

Let’s import the libraries we need.

#import all libraries required for this data.
import pandas as pd
import numpy as np

Now, just as before, let’s import data files we will use. Simply run the following commands to load them in your notebook.

# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)

gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object
 1   continent  1704 non-null   object
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

In the previous workshop:

From our last workshop, we discussed methods to retrieve data from DataFrames:

Single Square Brackets []
- Returns a Series.
- Used for selecting a single column.
Double Square Brackets [[]]
- Returns a DataFrame.
- Can be used to select multiple columns or rows.

Remember:

A Series is essentially a column in a DataFrame.
A DataFrame can represent both rows and columns of data.

Data Iteration Methods in Pandas

1. Vectorization

Definition: Applying operations on entire arrays, Series, or DataFrames simultaneously without explicit loops.
Advantages: Uses optimized low-level languages like C under the hood, providing a significant performance boost.

2. Apply/Map Functions

Definition: Leveraging apply and map functions to iterate over and transform data.
Advantages: Can be more flexible than vectorization for complex operations.
Methods:
- apply(): Used with both Series and DataFrames.
- map(): Specifically for Series, useful for element-wise operations.
- applymap(): For DataFrames, applying a function to each element.

3. Loops

Definition: Traditional Python method to iterate over data.
Advantages/Disadvantages: Offers finer control but can be slower than vectorized operations.

Note: When working with Pandas, it’s generally recommended to use vectorization whenever possible because of its efficiency advantages.

1. Vectorization

some_numbers = pd.Series(range(1, 11)) # remember that rnage function excludes the last number
some_numbers

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
dtype: int64

np.sqrt(some_numbers + 2)

  1.732051
  2.000000
  2.236068
  2.449490
  2.645751
  2.828427
  3.000000
  3.162278
  3.316625
  3.464102
dtype: float64

And similarly with DataFrames:

For this, let’s first select the numeric columns from the gapminder dataframe using the .select_dtypes() function.

# We can select only the numeric columns from the DataFrame
gapminder_num = gapminder.select_dtypes(include=[np.number])
gapminder_num

	year	lifeExp	pop	gdpPercap
0	1952	28.801	8425333	779.445314
1	1957	30.332	9240934	820.853030
2	1962	31.997	10267083	853.100710
3	1967	34.020	11537966	836.197138
4	1972	36.088	13079460	739.981106
...	...	...	...	...
1699	1987	62.351	9216418	706.157306
1700	1992	60.377	10704340	693.420786
1701	1997	46.809	11404948	792.449960
1702	2002	39.989	11926563	672.038623
1703	2007	43.487	12311143	469.709298

1704 rows × 4 columns

Now let’s clacualte the mean of each of these columns:

gapminder_num.mean()

year         1.979500e+03
lifeExp      5.947444e+01
pop          2.960121e+07
gdpPercap    7.215327e+03
dtype: float64

Task: Calculate the GDP (Gross Domestic Product) for each country-year combination:

GDP is often calculated as gdpPercap (GDP per capita) multiplied by pop (population).

gapminder['gdp'] = gapminder['gdpPercap'] * gapminder['pop']
gapminder[['country', 'year', 'gdp']]

	country	year	gdp
0	Afghanistan	1952	6.567086e+09
1	Afghanistan	1957	7.585449e+09
2	Afghanistan	1962	8.758856e+09
3	Afghanistan	1967	9.648014e+09
4	Afghanistan	1972	9.678553e+09
...	...	...	...
1699	Zimbabwe	1987	6.508241e+09
1700	Zimbabwe	1992	7.422612e+09
1701	Zimbabwe	1997	9.037851e+09
1702	Zimbabwe	2002	8.015111e+09
1703	Zimbabwe	2007	5.782658e+09

1704 rows × 3 columns

2. Using `apply`, `map` and `applymap` in Pandass.

1) apply()

In Pandas:
- Used with both Series and DataFrames.
- Operates on a matrix or array.
- Can specify an axis to apply a funcmax) ```

# 1.Sum across columns
gapminder_num.apply(sum, axis=0) # the apply() function takes a function and axis. axis=0 means rows.

year         3.373068e+06
lifeExp      1.013444e+05
pop          5.044047e+10
gdpPercap    1.229492e+07
dtype: float64

It might seem a bit counter-intuitive initially, but a helpful way to remember this is:

axis=0: The function gets applied vertically (down the rows, so column-wise).
axis=1: The function gets applied horizontally (across the columns, so row-wise).

# 2.Sort values of columns for rows where 'year' is 1952 and display top values
gapminder_num[gapminder_num['year'] == 1952].apply(sorted, axis=0).head()

	year	lifeExp	pop	gdpPercap
0	1952	28.801	60011	298.846212
12	1952	30.000	63149	299.850319
24	1952	30.015	120447	328.940557
36	1952	30.331	147962	331.000000
48	1952	31.286	153936	339.296459

# 3.Get maximum values across columns
gapminder_num.apply(max, axis=0)

year         2.007000e+03
lifeExp      8.260300e+01
pop          1.318683e+09
gdpPercap    1.135231e+05
dtype: float64

2) map()

In Pandas:
- Specifically for Series.
- Useful for element-wise operations on a series.

# Extract the `pop` column as Series and then use map() to take the log of all values in the column.
year_series = gapminder_num['pop']
year_series.head()

   8425333
   9240934
  10267083
  11537966
  13079460
Name: pop, dtype: int64

year_series.map(np.log).head()

  15.946754
  16.039154
  16.144454
  16.261154
  16.386554
Name: pop, dtype: float64

3) applymap()

In Pandas:
- Used for DataFrames.
- Applies a function to each element of the DataFrame.

# If we want to take the log value of all values in our gamp minder numeric columns, the we can use applymap()
gapminder_num.applymap(np.log).head()

	year	lifeExp	pop	gdpPercap
0	7.576610	3.360410	15.946754	6.658583
1	7.579168	3.412203	16.039154	6.710344
2	7.581720	3.465642	16.144454	6.748878
3	7.584265	3.526949	16.261154	6.728864
4	7.586804	3.585960	16.386554	6.606625

3. Loops

In Python, just like R, vectorization often provides the fastest way to perform operations on data. When specific, custom operations are required, loops might be needed.gs:

To iterate over any iterable in Python, you can use the following syntax:

for variable in iterable:
    # do something


For example:


```python
for int in range(1, 11):  # for each int in the range 1 through 10
    print(int)  # print that int

To demonstrate simple usage of the for loops, let’s sample 20 rows of data from our gapminder dataset

# Sample the gapminder data (random 20 rows)
gapminder_sample = gapminder_num.sample(20)
gapminder_sample

	year	lifeExp	pop	gdpPercap
1303	1987	61.728	110812	1516.525457
1124	1992	47.391	8392818	581.182725
632	1992	43.266	1050938	745.539871
682	2002	72.590	10083313	14843.935560
36	1952	30.015	4232095	3520.610273
1413	1997	60.236	42835005	7479.188244
428	1992	51.604	384156	2377.156192
1634	1962	60.770	8143375	8422.974165
1677	1997	58.020	15826497	2117.484526
762	1982	74.450	3858421	15367.029200
352	1972	67.849	1834796	5118.146939
419	2007	78.332	5468120	35278.418740
717	1997	66.041	199278000	3119.335603
144	1952	53.820	2791000	973.533195
1612	1972	71.340	209896000	21806.035940
1007	2007	66.803	2874127	3095.772271
1038	1982	42.795	12587223	462.211415
999	1967	51.253	1149500	1226.041130
1138	2002	46.608	119901274	1615.286395
463	1987	59.797	52799062	3885.460710

Task1: Print the log base 10 of the first value in the column ‘gdpPercap’

# Print the log base 10 of the first value in the column 'gdpPercap'
print(np.log10(gapminder_sample['gdpPercap'].iloc[0]))

3.180849704947923

Task2: using the for loop, loop over the gdpPercap column, printing index and log10 of the value:

for index, value in enumerate(gapminder_sample['gdpPercap']):
    print(index + 1)  # +1 because Python uses 0-based indexing
    print(np.log10(value))

1
3.180849704947923
2
2.7643126969049336
3
2.872470873970842
4
4.1715490603360585
5
3.546617951893976
6
3.8738544641368757
7
3.376057718158622
8
3.92546546862278
9
3.325820245308592
10
4.186589916507445
11
3.7091127500702807
12
4.547509110816849
13
3.4940621021345812
14
2.988350764353931
15
4.338576723540454
16
3.490769005923762
17
2.6648406667312905
18
3.0885050397031306
19
3.2082495351755087
20
3.589442521707857

Task3. How do we store these index and computed values in a new variable?

We first need to create an empty dictionary

# Create an empty dictionary to store the log values:
gdp_percapita_log_dict = {}

# Update the values of the dictionary with log10 values:
for index, value in enumerate(gapminder_sample['gdpPercap']):
    gdp_percapita_log_dict[index] = np.log10(value)

gdp_percapita_log_dict

{0: 3.180849704947923,
2.7643126969049336,
2.872470873970842,
4.1715490603360585,
3.546617951893976,
3.8738544641368757,
3.376057718158622,
3.92546546862278,
3.325820245308592,
4.186589916507445,
3.7091127500702807,
4.547509110816849,
3.4940621021345812,
2.988350764353931,
4.338576723540454,
3.490769005923762,
2.6648406667312905,
3.0885050397031306,
3.2082495351755087,
3.589442521707857}

workshops

In the previous workshop:

Data Iteration Methods in Pandas

1. Vectorization

2. Apply/Map Functions

3. Loops

1. Vectorization

And similarly with DataFrames:

Task: Calculate the GDP (Gross Domestic Product) for each country-year combination:

2. Using apply, map and applymap in Pandass.

1) apply()

2) map()

3) applymap()

3. Loops

Task1: Print the log base 10 of the first value in the column ‘gdpPercap’

Task2: using the for loop, loop over the gdpPercap column, printing index and log10 of the value:

Task3. How do we store these index and computed values in a new variable?

2. Using `apply`, `map` and `applymap` in Pandass.