Last Updates: 2023-11-08
Conditionally iterating over data can be a powerful tool, for such things as catching errors, testing a user input, or, perhaps more commonly in data analysis tasks, creating categorical values from numeric interval or ratio data.
if else statements apply a test to a value, and if it returns TRUE
, one thing is done, and if it returns FALSE
an alternative thing is done.
At its simplest, we can do just an if statement, which follows the syntax of
if (test-condition) {
do this
}
x <- 5
if (x < 6) {
print("x is less than 6")
}
## [1] "x is less than 6"
if (x == 5) {
print("x is equal to 5")
}
## [1] "x is equal to 5"
if (x != 2) {
print("x not 2")
}
## [1] "x not 2"
The if else construct allows us to string multiple test conditions together, and follows the syntax of
if (test-condition) {
do this
} else {
do this
}
if (x == 2) {
print("x is equal to 2")
} else {
print("x is not equal to 2")
}
## [1] "x is not equal to 2"
We can begin to string things together with still more test conditions using else if
if (test-condition) {
do this
} else if (new test-condition) {
do this
} else if (new test-condition) {
do this
} else {
do this
}
if (x < 5) {
print("x is less than 5")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is equal to 5")
}
## [1] "x is equal to 5"
A more practical implementation of an if else statement would be creating a new variable with values based off of those in another variable. This process can get a little convoluted as it combines an if else
statement with a for loop
.
As an example, lets assume that we want to categorize countries based on their GDP as either high, middle, or low income.
# load the data
library(gapminder)
gapminder_cond <- gapminder
# create an empty variable to hold our ordered categorical data
gapminder_cond$income_level <- ordered(NA, levels = c("low-income",
"middle-income",
"high-income"))
# start the for loop to add values to income-level based on GDP values
for (i in seq_along(gapminder_cond$income_level)) { # for each item along the dataframe
if (gapminder_cond$gdpPercap[i] <= 10000) { # if gdpPercap is LT = to 10000
gapminder_cond$income_level[i] <- 'low-income' # add 'low-income' to that observation's income level variable
} else if (gapminder_cond$gdpPercap[i] <= 75000) { # else, if gdpPercap is LT = 75000
gapminder_cond$income_level[i] <- "middle-income" # add 'middle-income' to that observation's income level variable
} else { # otherwise
gapminder_cond$income_level[i] <- "high-income" # add 'high-income' to that observation's income level variable
}
}
summary(gapminder_cond$income_level)
## low-income middle-income high-income
## 1312 387 5
This can be cumbersome. Luckily R offers some simpler options for iterating over data with an if else construct.
R has a built in function, ifelse()
that allows us to provide a condition, and, if that condition returns TRUE
to do one thing, and if it returns FALSE
to do something else. We’ll split life expectancy into ‘above average’ and below average’ categories.
# Assign values
# of 'below-average' if lifeExp is equal to or below 59.47
# and 'above-average' if above 59.47
gapminder_cond$lifeExp_cat <- ifelse(gapminder_cond$lifeExp <= 59.47, "below-average", "above-average")
# assign appropriate data type
gapminder_cond$lifeExp_cat <- ordered(gapminder_cond$lifeExp_cat, levels = c("below-average", "above-average"))
summary(gapminder_cond$lifeExp_cat)
## below-average above-average
## 809 895
ifelse()
works well when we have a two condition set to work through. One more example, pulled from its documentation, to help avoid an error / warning being thrown or unwanted values being produced…
# create a numeric vector
some_numbers <- c(6:-4)
# take the square root of each value, which creates a bit of an issue
sqrt(some_numbers)
## Warning in sqrt(some_numbers): NaNs produced
## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NaN
## [9] NaN NaN NaN
# using if else to test for numbers greater than or equal to 0
# if true, calculate the square root
# if false, produce NA
sqrt(ifelse(some_numbers >= 0, some_numbers, NA))
## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NA
## [9] NA NA NA
Tidyverse’s dplyr package offers an extension on ifelse()
, allowing multiple if else
statements.
Whereas earlier we ran
for (i in seq_along(gapminder_cond$income_level)) { # for each item along the dataframe
if (gapminder_cond$gdpPercap[i] <= 10000) { # if gdpPercap is LT = 10000
gapminder_cond$income_level[i] = 'low-income' # add 'low-income' to that observation's income level variable
} else if (gapminder_cond$gdpPercap[i] <= 75000) { # else, if gdpPercap is LT = 75000
gapminder_cond$income_level[i] = "middle-income" # add 'middle-income' to that observation's income level variable
} else { # otherwise
gapminder_cond$income_level[i] = "high-income" # add 'high-income' to that observation's income level variable
}
}
With case_when()
, we can simply write
library(dplyr)
gapminder_cond$pop_size <- dplyr::case_when(
gapminder_cond$pop <= 1000000 ~ "small",
gapminder_cond$pop <= 100000000 ~ "medium",
gapminder_cond$pop > 100000000 ~ "large"
)
gapminder_cond$pop_size <- ordered(gapminder_cond$pop_size, levels = c("small", "medium", "large"))
summary(gapminder_cond$pop_size)
## small medium large
## 180 1447 77