Last Updates: 2023-10-17
In general, R takes care of a lot of iteration for you; this is perhaps why R is based around atomic vectors as a fundamental data structure. When processing your data, R looks at the data type of the vector – not every value in it – and frequently passes the task off to code written in a much faster language, like C or Fortran. This helps to keep R code as fast as possible. You can see this by looking at a function’s internals - these ‘vectorized’ functions are often denoted by something like Primitive
or Internal
.
# look inside the sum() and sqrt() functions
sum
## function (..., na.rm = FALSE) .Primitive("sum")
sqrt
## function (x) .Primitive("sqrt")
# look inside the colSums() function that does some tests in R first before passing along
colSums
## function (x, na.rm = FALSE, dims = 1L)
## {
## if (is.data.frame(x))
## x <- as.matrix(x)
## if (!is.array(x) || length(dn <- dim(x)) < 2L)
## stop("'x' must be an array of at least two dimensions")
## if (dims < 1L || dims > length(dn) - 1L)
## stop("invalid 'dims'")
## n <- prod(dn[id <- seq_len(dims)])
## dn <- dn[-id]
## z <- if (is.complex(x))
## .Internal(colSums(Re(x), n, prod(dn), na.rm)) + (0+1i) *
## .Internal(colSums(Im(x), n, prod(dn), na.rm))
## else .Internal(colSums(x, n, prod(dn), na.rm))
## if (length(dn) > 1L) {
## dim(z) <- dn
## dimnames(z) <- dimnames(x)[-id]
## }
## else names(z) <- dimnames(x)[[dims + 1L]]
## z
## }
## <bytecode: 0x00000239cebae130>
## <environment: namespace:base>
What this means in practice, is that we can pass a vector to one of these functions, and the function operates on the entire vector.
some_numbers <- c(1:10)
sqrt(some_numbers + 2)
## [1] 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000 3.162278
## [9] 3.316625 3.464102
Similarly, we can pass an object made of vectors and the whole ‘package’ is passed along.
library(gapminder)
library(dplyr)
data_gapminder_num <- select(gapminder, is.numeric)
colSums(data_gapminder_num)
## year lifeExp pop gdpPercap
## 3.373068e+06 1.013444e+05 5.044047e+10 1.229492e+07
colMeans(data_gapminder_num)
## year lifeExp pop gdpPercap
## 1.979500e+03 5.947444e+01 2.960121e+07 7.215327e+03
If you hear people talking about ‘vectorizing your code’, generally they mean taking advantage of this construct.
R also has several several built in functions that hide the need to build a loop for you; these are generally simpler and faster than writing your own loops. Perhaps the most common are the apply()
family of functions. Generally, these functions work such that you provide a data set and function to apply across that data set, but each is a little unique in terms of what it expects, how it expects it, and what it returns. Each will cater to slightly different use cases.
apply()
operates on a matrix or array (it will convert an eligible data frame to a matrix) and takes three arguments, the data set, a dimension (1 = row, 2 = column) on which to apply a function, and a function.
apply(data_gapminder_num, 2, sum)
## year lifeExp pop gdpPercap
## 3.373068e+06 1.013444e+05 5.044047e+10 1.229492e+07
head(apply(data_gapminder_num[data_gapminder_num$year == 1952, ], 2, sort))
## year lifeExp pop gdpPercap
## [1,] 1952 28.801 60011 298.8462
## [2,] 1952 30.000 63149 299.8503
## [3,] 1952 30.015 120447 328.9406
## [4,] 1952 30.331 147962 331.0000
## [5,] 1952 31.286 153936 339.2965
## [6,] 1952 31.975 160000 362.1463
apply(data_gapminder_num, 2, max)
## year lifeExp pop gdpPercap
## 2.007000e+03 8.260300e+01 1.318683e+09 1.135231e+05
lapply()
and sapply()
are similar to apply()
, but a little simpler in their construction; both require only a data set (vector, data frame, or list) and a function as an argument.
lapply()
always returns a list and sapply()
‘simplifies’ the output, returning a vector or matrix.
lapply(data_gapminder_num, sum)
## $year
## [1] 3373068
##
## $lifeExp
## [1] 101344.4
##
## $pop
## [1] 50440465801
##
## $gdpPercap
## [1] 12294917
sapply(data_gapminder_num, sum)
## year lifeExp pop gdpPercap
## 3.373068e+06 1.013444e+05 5.044047e+10 1.229492e+07
When using lapply()
it can be handy to be familiar with unlist()
, which simplifies a list to a vector:
unlist(lapply(data_gapminder_num, sum))
## year lifeExp pop gdpPercap
## 3.373068e+06 1.013444e+05 5.044047e+10 1.229492e+07
apply()
, sapply()
, etc all take a single object (vector, data frame, etc), as an argument and perform a function on that object. mapply()
applies a function across more than one object. To do this, you start with the function and then a list of data objects:
# three vectors
a <- c(1:10)
b <- c(11:20)
c <- c(21:30)
# sum each element together in order (first value of each object, second value, etc)
mapply(sum, a, b, c)
## [1] 33 36 39 42 45 48 51 54 57 60
This is functionally equivalent to first merging the objects into one object and then using one of the other members of the apply()
family
d <- cbind(a, b, c)
d
## a b c
## [1,] 1 11 21
## [2,] 2 12 22
## [3,] 3 13 23
## [4,] 4 14 24
## [5,] 5 15 25
## [6,] 6 16 26
## [7,] 7 17 27
## [8,] 8 18 28
## [9,] 9 19 29
## [10,] 10 20 30
apply(d, 1, sum)
## [1] 33 36 39 42 45 48 51 54 57 60
The purrr package from Tidyverse provides an equivalent to the apply()
family of functions, called map()
and map2()
. These are constructed to be consistent in how they take their inputs and specific in what they return. map()
expects one data object, map2()
expects two data objects.
# install purr if not already installed
# install.packages("purrr")
library(purrr)
map(data_gapminder_num, sum)
## $year
## [1] 3373068
##
## $lifeExp
## [1] 101344.4
##
## $pop
## [1] 50440465801
##
## $gdpPercap
## [1] 12294917
map2(a, b, sum)
## [[1]]
## [1] 12
##
## [[2]]
## [1] 14
##
## [[3]]
## [1] 16
##
## [[4]]
## [1] 18
##
## [[5]]
## [1] 20
##
## [[6]]
## [1] 22
##
## [[7]]
## [1] 24
##
## [[8]]
## [1] 26
##
## [[9]]
## [1] 28
##
## [[10]]
## [1] 30
There are several variations on map()
and map2()
- it’s worth consulting the documentation if you plan to use the Tidyverse approach:
map()
: https://purrr.tidyverse.org/reference/map.htmlmap2()
: https://purrr.tidyverse.org/reference/map2.htmlVecotrization and pre-existing looping functions will be the fastest and likely simplest way to iterate over your data. When your task is unique, however, you may need to write your own loop. Two common loop constructs are ‘for loops’ and ‘while loops’. Here we cover the former.
For loops iterate for a pre-determined number of iterations and take the following basic approach:
for (each value in a list of values using an indexed location) {
do something;
}
for example
for (int in c(1:10)) { # for each int in the range 1 through 10
print(int) # print that int
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
We can similarly walk over a data frame
# sample our data to keep things small
data_gapminder_sample <- data_gapminder_num[sample(nrow(data_gapminder_num), 20), ]
data_gapminder_sample
## # A tibble: 20 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1972 72.9 22284500 18971.
## 2 1997 68.0 63327987 8264.
## 3 1967 46.6 420690 2613.
## 4 1972 73.5 4991596 18866.
## 5 1962 44.2 1146757 1056.
## 6 1977 46.5 228694 3082.
## 7 1962 49.1 10516500 4957.
## 8 1952 38.5 863308 576.
## 9 1967 71.1 52667100 10022.
## 10 1987 54.9 395114 1316.
## 11 1997 43.8 6633514 931.
## 12 1962 69.4 2310904 7106.
## 13 2002 50.7 67946797 530.
## 14 1992 71.6 3326498 2497.
## 15 2007 78.2 301139947 42952.
## 16 1977 73.5 56059245 14256.
## 17 1972 40.3 7450606 422.
## 18 1962 38.4 3628608 428.
## 19 1962 53.5 3453434 1662.
## 20 1997 69.5 5783439 5155.
#example of what we want to print, the log base 10 of the first value in the column gdpPercap
print(log10(data_gapminder_sample$gdpPercap[1]))
## [1] 4.27808
# for demonstration purposes
# for each indexed position in our sampled data
# print the index position and
# the log base 10 of the per capita gdp for that index point
for (index in seq_along(data_gapminder_sample$gdpPercap)) {
print(index)
print(log10(data_gapminder_sample$gdpPercap[index]))
}
## [1] 1
## [1] 4.27808
## [1] 2
## [1] 3.917169
## [1] 3
## [1] 3.417156
## [1] 4
## [1] 4.275685
## [1] 5
## [1] 3.023621
## [1] 6
## [1] 3.488799
## [1] 7
## [1] 3.695222
## [1] 8
## [1] 2.7601
## [1] 9
## [1] 4.000972
## [1] 10
## [1] 3.11925
## [1] 11
## [1] 2.968761
## [1] 12
## [1] 3.851603
## [1] 13
## [1] 2.72432
## [1] 14
## [1] 3.397495
## [1] 15
## [1] 4.63298
## [1] 16
## [1] 4.153997
## [1] 17
## [1] 2.624925
## [1] 18
## [1] 2.631343
## [1] 19
## [1] 3.220667
## [1] 20
## [1] 3.712214
We can store values within a for loop, but they are confined to that for loop and are not accessible outside of it
for (index in seq_along(data_gapminder_sample$gdpPercap)) {
index_number <- index
log <- log10(data_gapminder_sample$gdpPercap[index])
cat(paste0("The log base 10 of the index point ", index, " is: ", log, "\n"))
}
## The log base 10 of the index point 1 is: 4.27808039981948
## The log base 10 of the index point 2 is: 3.91716877724031
## The log base 10 of the index point 3 is: 3.41715630666173
## The log base 10 of the index point 4 is: 4.2756846000415
## The log base 10 of the index point 5 is: 3.02362115947219
## The log base 10 of the index point 6 is: 3.48879895792208
## The log base 10 of the index point 7 is: 3.69522224658516
## The log base 10 of the index point 8 is: 2.76010041014319
## The log base 10 of the index point 9 is: 4.00097178847123
## The log base 10 of the index point 10 is: 3.11924955698062
## The log base 10 of the index point 11 is: 2.9687613813918
## The log base 10 of the index point 12 is: 3.85160263257717
## The log base 10 of the index point 13 is: 2.72431973268524
## The log base 10 of the index point 14 is: 3.39749469826406
## The log base 10 of the index point 15 is: 4.632979883279
## The log base 10 of the index point 16 is: 4.15399722211687
## The log base 10 of the index point 17 is: 2.62492535061919
## The log base 10 of the index point 18 is: 2.6313433883056
## The log base 10 of the index point 19 is: 3.22066691101895
## The log base 10 of the index point 20 is: 3.71221396789169
In order to store the output of a for loop, we first need to create an object to hold the data.
# create an empty vector of a known length and data type
gdp_percapita_log <- vector(mode = "double", length = nrow(data_gapminder_sample))
# update the values of the vector as we iterate over gdpPercap
for (i in seq_along(data_gapminder_sample$gdpPercap)){
gdp_percapita_log[i] <- log10(data_gapminder_sample$gdpPercap[i])
}
# print the vector
gdp_percapita_log
## [1] 4.278080 3.917169 3.417156 4.275685 3.023621 3.488799 3.695222 2.760100
## [9] 4.000972 3.119250 2.968761 3.851603 2.724320 3.397495 4.632980 4.153997
## [17] 2.624925 2.631343 3.220667 3.712214