import pandas as pd
import numpy as np
import math
# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
x = 5
if x < 6:
print("x is less than 6")
if x == 5:
print("x is equal to 5")
if x != 2:
print("x is some other number than 2")
x is less than 6
x is equal to 5
x is some other number than 2
x against different conditions.x is assigned a value of 5.if statement:
x is less than 6.true, prints “x is less than 6”.x is 5, this condition is true, so “x is less than 6” will be printed.if statement:
x is exactly equal to 5.true, prints “x is equal to 5”.x is indeed 5, this condition is true, so “x is equal to 5” will be printed.if statement:
x is not equal to 2.true, prints “x is some other number than 2”.x is 5, not 2, this condition is true, so “x is some other number than 2” will be printed.if x == 2:
print("x is equal to 2")
else:
print("x is not equal to 2")
x is not equal to 2
x and execute code based on its value.if statement to check if x is equal to the number 2.==):
== is the comparison operator for equality in Python.x == 2) evaluates to True:
print function within the if block is called."x is equal to 2" to the console.x is anything other than 2, the condition evaluates to False.
print function within the else block is executed."x is not equal to 2" to the console.print statements will execute based on the value of x.if and else blocks, which is critical in Python syntax.This conditional structure ensures that the program can appropriately respond to the specific condition of x being equal to 2 or not.
elif condition2:
else:
if x < 5:
print("x is less than 5")
elif x > 5:
print("x is greater than 5")
else:
print("x is equal to 5")
x is equal to 5
x and compares it against the number 5.if):
if statement checks if x is less than 5.x is indeed less than 5, the condition evaluates to True."x is less than 5" will be printed to the console.elif):
elif (short for ‘else if’) statement checks if x is greater than 5.if condition was False.x is greater than 5, it prints "x is greater than 5".else):
else statement covers the scenario where x is neither less than nor greater than 5.x is equal to 5."x is equal to 5" in this case.True.x is less than 5, the first block runs.x is greater than 5, the second block runs.x is neither (which means it must be equal to 5), the else block runs.This structure is an efficient way to handle multiple related conditions by checking them in a sequence until one of the conditions is met.
gapminder_cond = gapminder
gapminder_cond = gapminder.copy()
# Create an empty variable to hold our ordered categorical data
# Initially, all values are set to NaN
gapminder_cond['income_level'] = pd.NA
# Define the categories and their order
income_levels = pd.CategoricalDtype(categories=["low-income", "middle-income", "high-income"], ordered=True)
# Start the loop to add values to income_level based on GDP values
for i in gapminder_cond.index: # Iterating over the DataFrame index
if gapminder_cond.loc[i, 'gdpPercap'] <= 10000:
gapminder_cond.loc[i, 'income_level'] = 'low-income'
elif gapminder_cond.loc[i, 'gdpPercap'] <= 75000:
gapminder_cond.loc[i, 'income_level'] = 'middle-income'
else:
gapminder_cond.loc[i, 'income_level'] = 'high-income'
# Convert the 'income_level' column to ordered categorical type
gapminder_cond['income_level'] = gapminder_cond['income_level'].astype(income_levels)
# Summary of the 'income_level' column
print(gapminder_cond['income_level'].describe())
count 1704
unique 3
top low-income
freq 1312
Name: income_level, dtype: object
gapminder_cond with missing values (NaN)..loc[], it accesses and evaluates the ‘gdpPercap’ value for each row.describe() function is called on the ‘income_level’ column, printing a summary that includes count, unique, top, and frequency of the categories.gapminder_cond['income_level'] = pd.Categorical(
np.where(gapminder_cond['gdpPercap'] <= 10000, 'low-income',
np.where(gapminder_cond['gdpPercap'] <= 75000, 'middle-income', 'high-income')),
categories=['low-income', 'middle-income', 'high-income'],
ordered=True
)
print(gapminder_cond['income_level'].describe())
count 1704
unique 3
top low-income
freq 1312
Name: income_level, dtype: object
Objective: Assign a categorical variable income_level based on gdpPercap values in gapminder_cond DataFrame.
pd.Categorical from pandas to create a categorical column.np.where from numpy for conditional assignments.gdpPercap <= 10,000.gdpPercap between 10,000 and 75,000.gdpPercap > 75,000.['low-income', 'middle-income', 'high-income'].True).income_level column to gapminder_cond with the appropriate labels.print(gapminder_cond['income_level'].describe()) to display the descriptive statistics of the income_level column.# Assign values of 'below-average' if lifeExp is equal to or below 59.47
# and 'above-average' if above 59.47
lifeExp_cat = []
# Loop through each life expectancy value in the 'lifeExp' column
for x in gapminder_cond['lifeExp']:
# Check if the life expectancy is below or equal to 59.47
if x <= 59.47:
# If it is, append 'below-average' to the list
lifeExp_cat.append('below-average')
else:
# If it is not, append 'above-average' to the list
lifeExp_cat.append('above-average')
# Assign the categorized list to the 'lifeExp_cat' column in the DataFrame
gapminder_cond['lifeExp_cat'] = lifeExp_cat
# Assign appropriate data type
# Define the categorical type with the specific order
life_exp_cat_type = pd.CategoricalDtype(categories=["below-average", "above-average"], ordered=True)
# Convert 'lifeExp_cat' to ordered categorical type
gapminder_cond['lifeExp_cat'] = gapminder_cond['lifeExp_cat'].astype(life_exp_cat_type)
# Summary of the 'lifeExp_cat' column
print(gapminder_cond['lifeExp_cat'].describe())
count 1704
unique 2
top above-average
freq 895
Name: lifeExp_cat, dtype: object
gapminder_cond DataFrame.describe() function to the ‘lifeExp_cat’ column to get a summary.gapminder_cond is a pandas DataFrame that has already been defined and contains a column named ‘lifeExp’.gapminder_cond['lifeExp_cat'] = pd.Categorical(
np.where(gapminder_cond['lifeExp'] <= 59.47, 'below-average', 'above-average'),
categories=['below-average', 'above-average'],
ordered=True
)
print(gapminder_cond['lifeExp_cat'].describe())
count 1704
unique 2
top above-average
freq 895
Name: lifeExp_cat, dtype: object
Purpose: To categorize countries into ‘below-average’ or ‘above-average’ life expectancy groups within the gapminder_cond DataFrame.
pd.Categorical: Constructs a categorical variable in pandas.np.where: Applies a vectorized conditional logic from numpy.lifeExp less than or equal to 59.47 years are categorized as ‘below-average’.lifeExp greater than 59.47 years are labeled ‘above-average’.categories: Defines the two categories - ‘below-average’ and ‘above-average’.ordered: Indicates that there is a meaningful order to the categories (True).lifeExp_cat in the gapminder_cond DataFrame with the assigned categories.describe(): Provides a summary of the lifeExp_cat column, including counts and frequency of each category.print: Displays the result of gapminder_cond['lifeExp_cat'].describe() to the console, showing statistics of the new categorical column.# create a numeric list, equivalent to R's c(6:-4)
some_numbers = list(range(6, -5, -1))
some_numbers
[6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4]
# use numpy to take the square root, numpy will automatically generate NaN for negative numbers
sqrt_numbers = np.sqrt(some_numbers)
# print the result
print(sqrt_numbers)
[2.44948974 2.23606798 2. 1.73205081 1.41421356 1.
0. nan nan nan nan]
/tmp/ipykernel_141/2103455644.py:2: RuntimeWarning: invalid value encountered in sqrt
sqrt_numbers = np.sqrt(some_numbers)
# using a list comprehension with conditional to emulate R's ifelse()
# this will replace negative numbers with NaN before taking the square root
sqrt_numbers_ifelse = [x**0.5 if x >= 0 else np.nan for x in some_numbers]
# print the result
print(sqrt_numbers_ifelse)
[2.449489742783178, 2.23606797749979, 2.0, 1.7320508075688772, 1.4142135623730951, 1.0, 0.0, nan, nan, nan, nan]
sqrt_numbers_ifelse is a list that is created by iterating over each element in the list some_numbers.x in some_numbers, the expression x**0.5 is evaluated if x is greater than or equal to 0.x is less than 0, np.nan is used instead of calculating the square root. np.nan stands for “Not a Number” and is part of the numpy library, representing undefined or unrepresentable numerical results.sqrt_numbers_ifelse contains the square roots of all non-negative numbers from some_numbers, and np.nan for all negative numbers where the square root is not defined in the real number system.print(sqrt_numbers_ifelse) statement outputs the content of sqrt_numbers_ifelse to the console, showing the computed square roots and np.nan values for negative inputs.# Iterate over the DataFrame using the iterrows() function
for i, row in gapminder_cond.iterrows():
# Check the 'gdpPercap' column to determine the income level
if row['gdpPercap'] <= 10000:
gapminder_cond.at[i, 'income_level'] = 'low-income'
elif row['gdpPercap'] <= 75000:
gapminder_cond.at[i, 'income_level'] = 'middle-income'
else:
gapminder_cond.at[i, 'income_level'] = 'high-income'
# Using multiple conditions with numpy's select()
conditions = [
gapminder_cond['pop'] <= 1000000,
gapminder_cond['pop'] <= 100000000,
gapminder_cond['pop'] > 100000000
]
choices = ['small', 'medium', 'large']
gapminder_cond['pop_size'] = pd.Categorical(
np.select(conditions, choices, default=np.nan),
categories=['small', 'medium', 'large'],
ordered=True
)
print(gapminder_cond['pop_size'].describe())
count 1704
unique 3
top medium
freq 1447
Name: pop_size, dtype: object
['small', 'medium', 'large'].np.select():
np.select() function is used to apply the conditions and map each to its corresponding choice.np.nan is used as the default value.np.select() is used to create a new column ‘pop_size’ in the gapminder_cond DataFrame.describe() method is used to generate a summary of the ‘pop_size’ column.The provided content and techniques are based on documentation and resources from official Python, Pandas, and NumPy websites:
Python: Python Official Documentation
.loc[] and other selection methods.