import pandas as pd
import numpy as np
import math
# Reading CSV files from GitHub
gapminder = pd.read_csv('https://raw.githubusercontent.com/csc-ubc-okanagan/workshops/a091bc6eae8b9045866c28dbd1848c7e072db5b1/data/gapminder.csv')
gapminder.to_csv('gapminder.csv', index=False)
x = 5
if x < 6:
print("x is less than 6")
if x == 5:
print("x is equal to 5")
if x != 2:
print("x is some other number than 2")
x is less than 6
x is equal to 5
x is some other number than 2
x
against different conditions.x
is assigned a value of 5
.if
statement:
x
is less than 6
.true
, prints “x is less than 6”.x
is 5
, this condition is true
, so “x is less than 6” will be printed.if
statement:
x
is exactly equal to 5
.true
, prints “x is equal to 5”.x
is indeed 5
, this condition is true
, so “x is equal to 5” will be printed.if
statement:
x
is not equal to 2
.true
, prints “x is some other number than 2”.x
is 5
, not 2
, this condition is true
, so “x is some other number than 2” will be printed.if x == 2:
print("x is equal to 2")
else:
print("x is not equal to 2")
x is not equal to 2
x
and execute code based on its value.if
statement to check if x
is equal to the number 2
.==
):
==
is the comparison operator for equality in Python.x == 2
) evaluates to True
:
print
function within the if
block is called."x is equal to 2"
to the console.x
is anything other than 2
, the condition evaluates to False
.
print
function within the else
block is executed."x is not equal to 2"
to the console.print
statements will execute based on the value of x
.if
and else
blocks, which is critical in Python syntax.This conditional structure ensures that the program can appropriately respond to the specific condition of x
being equal to 2
or not.
elif condition2:
else:
if x < 5:
print("x is less than 5")
elif x > 5:
print("x is greater than 5")
else:
print("x is equal to 5")
x is equal to 5
x
and compares it against the number 5
.if
):
if
statement checks if x
is less than 5
.x
is indeed less than 5
, the condition evaluates to True
."x is less than 5"
will be printed to the console.elif
):
elif
(short for ‘else if’) statement checks if x
is greater than 5
.if
condition was False
.x
is greater than 5
, it prints "x is greater than 5"
.else
):
else
statement covers the scenario where x
is neither less than nor greater than 5
.x
is equal to 5
."x is equal to 5"
in this case.True
.x
is less than 5
, the first block runs.x
is greater than 5
, the second block runs.x
is neither (which means it must be equal to 5
), the else
block runs.This structure is an efficient way to handle multiple related conditions by checking them in a sequence until one of the conditions is met.
gapminder_cond = gapminder
gapminder_cond = gapminder.copy()
# Create an empty variable to hold our ordered categorical data
# Initially, all values are set to NaN
gapminder_cond['income_level'] = pd.NA
# Define the categories and their order
income_levels = pd.CategoricalDtype(categories=["low-income", "middle-income", "high-income"], ordered=True)
# Start the loop to add values to income_level based on GDP values
for i in gapminder_cond.index: # Iterating over the DataFrame index
if gapminder_cond.loc[i, 'gdpPercap'] <= 10000:
gapminder_cond.loc[i, 'income_level'] = 'low-income'
elif gapminder_cond.loc[i, 'gdpPercap'] <= 75000:
gapminder_cond.loc[i, 'income_level'] = 'middle-income'
else:
gapminder_cond.loc[i, 'income_level'] = 'high-income'
# Convert the 'income_level' column to ordered categorical type
gapminder_cond['income_level'] = gapminder_cond['income_level'].astype(income_levels)
# Summary of the 'income_level' column
print(gapminder_cond['income_level'].describe())
count 1704
unique 3
top low-income
freq 1312
Name: income_level, dtype: object
gapminder_cond
with missing values (NaN)..loc[]
, it accesses and evaluates the ‘gdpPercap’ value for each row.describe()
function is called on the ‘income_level’ column, printing a summary that includes count, unique, top, and frequency of the categories.gapminder_cond['income_level'] = pd.Categorical(
np.where(gapminder_cond['gdpPercap'] <= 10000, 'low-income',
np.where(gapminder_cond['gdpPercap'] <= 75000, 'middle-income', 'high-income')),
categories=['low-income', 'middle-income', 'high-income'],
ordered=True
)
print(gapminder_cond['income_level'].describe())
count 1704
unique 3
top low-income
freq 1312
Name: income_level, dtype: object
Objective: Assign a categorical variable income_level
based on gdpPercap
values in gapminder_cond
DataFrame.
pd.Categorical
from pandas to create a categorical column.np.where
from numpy for conditional assignments.gdpPercap
<= 10,000.gdpPercap
between 10,000 and 75,000.gdpPercap
> 75,000.['low-income', 'middle-income', 'high-income']
.True
).income_level
column to gapminder_cond
with the appropriate labels.print(gapminder_cond['income_level'].describe())
to display the descriptive statistics of the income_level
column.# Assign values of 'below-average' if lifeExp is equal to or below 59.47
# and 'above-average' if above 59.47
lifeExp_cat = []
# Loop through each life expectancy value in the 'lifeExp' column
for x in gapminder_cond['lifeExp']:
# Check if the life expectancy is below or equal to 59.47
if x <= 59.47:
# If it is, append 'below-average' to the list
lifeExp_cat.append('below-average')
else:
# If it is not, append 'above-average' to the list
lifeExp_cat.append('above-average')
# Assign the categorized list to the 'lifeExp_cat' column in the DataFrame
gapminder_cond['lifeExp_cat'] = lifeExp_cat
# Assign appropriate data type
# Define the categorical type with the specific order
life_exp_cat_type = pd.CategoricalDtype(categories=["below-average", "above-average"], ordered=True)
# Convert 'lifeExp_cat' to ordered categorical type
gapminder_cond['lifeExp_cat'] = gapminder_cond['lifeExp_cat'].astype(life_exp_cat_type)
# Summary of the 'lifeExp_cat' column
print(gapminder_cond['lifeExp_cat'].describe())
count 1704
unique 2
top above-average
freq 895
Name: lifeExp_cat, dtype: object
gapminder_cond
DataFrame.describe()
function to the ‘lifeExp_cat’ column to get a summary.gapminder_cond
is a pandas DataFrame that has already been defined and contains a column named ‘lifeExp’.gapminder_cond['lifeExp_cat'] = pd.Categorical(
np.where(gapminder_cond['lifeExp'] <= 59.47, 'below-average', 'above-average'),
categories=['below-average', 'above-average'],
ordered=True
)
print(gapminder_cond['lifeExp_cat'].describe())
count 1704
unique 2
top above-average
freq 895
Name: lifeExp_cat, dtype: object
Purpose: To categorize countries into ‘below-average’ or ‘above-average’ life expectancy groups within the gapminder_cond
DataFrame.
pd.Categorical
: Constructs a categorical variable in pandas.np.where
: Applies a vectorized conditional logic from numpy.lifeExp
less than or equal to 59.47 years are categorized as ‘below-average’.lifeExp
greater than 59.47 years are labeled ‘above-average’.categories
: Defines the two categories - ‘below-average’ and ‘above-average’.ordered
: Indicates that there is a meaningful order to the categories (True).lifeExp_cat
in the gapminder_cond
DataFrame with the assigned categories.describe()
: Provides a summary of the lifeExp_cat
column, including counts and frequency of each category.print
: Displays the result of gapminder_cond['lifeExp_cat'].describe()
to the console, showing statistics of the new categorical column.# create a numeric list, equivalent to R's c(6:-4)
some_numbers = list(range(6, -5, -1))
some_numbers
[6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4]
# use numpy to take the square root, numpy will automatically generate NaN for negative numbers
sqrt_numbers = np.sqrt(some_numbers)
# print the result
print(sqrt_numbers)
[2.44948974 2.23606798 2. 1.73205081 1.41421356 1.
0. nan nan nan nan]
/tmp/ipykernel_141/2103455644.py:2: RuntimeWarning: invalid value encountered in sqrt
sqrt_numbers = np.sqrt(some_numbers)
# using a list comprehension with conditional to emulate R's ifelse()
# this will replace negative numbers with NaN before taking the square root
sqrt_numbers_ifelse = [x**0.5 if x >= 0 else np.nan for x in some_numbers]
# print the result
print(sqrt_numbers_ifelse)
[2.449489742783178, 2.23606797749979, 2.0, 1.7320508075688772, 1.4142135623730951, 1.0, 0.0, nan, nan, nan, nan]
sqrt_numbers_ifelse
is a list that is created by iterating over each element in the list some_numbers
.x
in some_numbers
, the expression x**0.5
is evaluated if x
is greater than or equal to 0
.x
is less than 0
, np.nan
is used instead of calculating the square root. np.nan
stands for “Not a Number” and is part of the numpy
library, representing undefined or unrepresentable numerical results.sqrt_numbers_ifelse
contains the square roots of all non-negative numbers from some_numbers
, and np.nan
for all negative numbers where the square root is not defined in the real number system.print(sqrt_numbers_ifelse)
statement outputs the content of sqrt_numbers_ifelse
to the console, showing the computed square roots and np.nan
values for negative inputs.# Iterate over the DataFrame using the iterrows() function
for i, row in gapminder_cond.iterrows():
# Check the 'gdpPercap' column to determine the income level
if row['gdpPercap'] <= 10000:
gapminder_cond.at[i, 'income_level'] = 'low-income'
elif row['gdpPercap'] <= 75000:
gapminder_cond.at[i, 'income_level'] = 'middle-income'
else:
gapminder_cond.at[i, 'income_level'] = 'high-income'
# Using multiple conditions with numpy's select()
conditions = [
gapminder_cond['pop'] <= 1000000,
gapminder_cond['pop'] <= 100000000,
gapminder_cond['pop'] > 100000000
]
choices = ['small', 'medium', 'large']
gapminder_cond['pop_size'] = pd.Categorical(
np.select(conditions, choices, default=np.nan),
categories=['small', 'medium', 'large'],
ordered=True
)
print(gapminder_cond['pop_size'].describe())
count 1704
unique 3
top medium
freq 1447
Name: pop_size, dtype: object
['small', 'medium', 'large']
.np.select()
:
np.select()
function is used to apply the conditions and map each to its corresponding choice.np.nan
is used as the default value.np.select()
is used to create a new column ‘pop_size’ in the gapminder_cond
DataFrame.describe()
method is used to generate a summary of the ‘pop_size’ column.The provided content and techniques are based on documentation and resources from official Python, Pandas, and NumPy websites:
Python: Python Official Documentation
.loc[]
and other selection methods.