Last Updates: 2023-10-25
$
We’ve already seen that we can select a single variable of a data frame with $
.
data_gapminder <- read.csv("../data/gapminder.csv")
lifeExp <- data_gapminder$lifeExp
head(lifeExp)
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
It’s important to note that generally R simplifies data structures when possible, so lifeExp
has lost most of its attributes, and is now simply a vector…
class(lifeExp)
## [1] "numeric"
$
can be used with both data frames and lists.
basic_list <- list(item_1 = letters,
item_2 = 1:10,
item_3 = c(TRUE, FALSE))
item_1 <- basic_list$item_1
item_1
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
# again, it defaults to creating a vector
class(item_1)
## [1] "character"
[
and [[
You can also subset using single or double square brackets. These provide you with more flexibility than $
. They can also be more verbose. The primary difference between the two is that [
preserves the original data structure, while [[
discards it and simplifies.
pop_preserved <- data_gapminder["pop"]
pop_simple <- data_gapminder[["pop"]]
class(pop_preserved)
## [1] "data.frame"
class(pop_simple)
## [1] "integer"
[
and [[
allow you to specify specific ranges of columns and rows to extract. The arguments are passed as [row range, column range]
. Leaving an argument blank returns all rows or columns.
# value in the first row and first column
data_gapminder[1,1]
## [1] "Afghanistan"
# values in the second row
data_gapminder[2,]
## country continent year lifeExp pop gdpPercap
## 2 Afghanistan Asia 1957 30.332 9240934 820.853
# values in the first 5 rows and first 3 columns
data_gapminder[1:5, 1:3]
## country continent year
## 1 Afghanistan Asia 1952
## 2 Afghanistan Asia 1957
## 3 Afghanistan Asia 1962
## 4 Afghanistan Asia 1967
## 5 Afghanistan Asia 1972
This works on lists as well…
# return the first element of a list
basic_list[1]
## $item_1
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
However, if you want an item (value) from within a list, you need to simplify that object first with [[
and then apply a single [
# extract the first value from the first list item
basic_list[[1]][1]
## [1] "a"
You can specify the row and column to subset either by its name, wrapped in “”, or by its indexed number.
You can also filter your data to specific cases based on conditions. This is done by placing the condition within the subsetting operators.
Conditions include:
!= | not equal to |
== | equal to |
< | less than |
> | greater than |
<= | less than or equal to |
=> | equal to or greater than |
And conditions can be strung together with the following operators:
! | not |
| | or |
& | and |
# using the lifeExp vector created earlier
lifeExp_gt_70 <- lifeExp[lifeExp > 70]
head(lifeExp_gt_70, n = 20)
## [1] 70.420 72.000 71.581 72.950 75.651 76.423 70.994 72.301 70.774 71.868
## [11] 73.275 74.340 75.320 70.330 70.930 71.100 71.930 73.490 74.740 76.320
length(lifeExp_gt_70)
## [1] 493
For data frames, you identify the column and the condition for the values in that column, which will give you all rows that match that condition, followed by the columns of interest.
# all columns for 1952
gapminder_1952 <- data_gapminder[data_gapminder$year == 1952, ]
head(gapminder_1952)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 13 Albania Europe 1952 55.230 1282697 1601.0561
## 25 Algeria Africa 1952 43.077 9279525 2449.0082
## 37 Angola Africa 1952 30.015 4232095 3520.6103
## 49 Argentina Americas 1952 62.485 17876956 5911.3151
## 61 Australia Oceania 1952 69.120 8691212 10039.5956
# country and gdp for 1952
gdp_1952 <- data_gapminder[data_gapminder$year == 1952, c(1,6)]
head(gdp_1952)
## country gdpPercap
## 1 Afghanistan 779.4453
## 13 Albania 1601.0561
## 25 Algeria 2449.0082
## 37 Angola 3520.6103
## 49 Argentina 5911.3151
## 61 Australia 10039.5956
subset()
subset()
allows you to achieve similar results to those above. At its most basic, it takes a data set and a condition on which to subset.
head(subset(data_gapminder, year == 1952))
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 13 Albania Europe 1952 55.230 1282697 1601.0561
## 25 Algeria Africa 1952 43.077 9279525 2449.0082
## 37 Angola Africa 1952 30.015 4232095 3520.6103
## 49 Argentina Americas 1952 62.485 17876956 5911.3151
## 61 Australia Oceania 1952 69.120 8691212 10039.5956
You can also list the columns you’d like to keep with the select
argument.
head(subset(data_gapminder, year == 1952,
select = c("country", "gdpPercap")))
## country gdpPercap
## 1 Afghanistan 779.4453
## 13 Albania 1601.0561
## 25 Algeria 2449.0082
## 37 Angola 3520.6103
## 49 Argentina 5911.3151
## 61 Australia 10039.5956
Multiple conditions can be passed when subsetting, whether using [
or subset()
# 1952 and Americas
am_52 <- data_gapminder[data_gapminder$year == 1952 & data_gapminder$continent == "Americas", ]
head(am_52)
## country continent year lifeExp pop gdpPercap
## 49 Argentina Americas 1952 62.485 17876956 5911.315
## 133 Bolivia Americas 1952 40.414 2883315 2677.326
## 169 Brazil Americas 1952 50.917 56602560 2108.944
## 241 Canada Americas 1952 68.750 14785584 11367.161
## 277 Chile Americas 1952 54.745 6377619 3939.979
## 301 Colombia Americas 1952 50.643 12350771 2144.115
am_52_ss <- subset(data_gapminder, year == 1952 & continent == "Americas")
head(am_52_ss)
## country continent year lifeExp pop gdpPercap
## 49 Argentina Americas 1952 62.485 17876956 5911.315
## 133 Bolivia Americas 1952 40.414 2883315 2677.326
## 169 Brazil Americas 1952 50.917 56602560 2108.944
## 241 Canada Americas 1952 68.750 14785584 11367.161
## 277 Chile Americas 1952 54.745 6377619 3939.979
## 301 Colombia Americas 1952 50.643 12350771 2144.115
The dplyr
package from Tidyverse draws a good conceptual break between selecting variables and filtering for cases based on values with its select()
and filter()
functions.
select()
select()
allows you pick columns based on their names. It takes two arguments, a data set, and a set of parameters by which to select columns; that parameter could be a single name, a span of names separated with a :
, or some other condition, like names starting with, ending with, or containing specific characters, or even by the data type held in the column.
# load the library
library(dplyr)
# select 1 column
head(select(data_gapminder, country))
## country
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
# select multiple columns
head(select(data_gapminder, year, lifeExp))
## year lifeExp
## 1 1952 28.801
## 2 1957 30.332
## 3 1962 31.997
## 4 1967 34.020
## 5 1972 36.088
## 6 1977 38.438
# select range of columns
head(select(data_gapminder, continent:pop))
## continent year lifeExp pop
## 1 Asia 1952 28.801 8425333
## 2 Asia 1957 30.332 9240934
## 3 Asia 1962 31.997 10267083
## 4 Asia 1967 34.020 11537966
## 5 Asia 1972 36.088 13079460
## 6 Asia 1977 38.438 14880372
# select numeric columns
head(select(data_gapminder, where(is.numeric)))
## year lifeExp pop gdpPercap
## 1 1952 28.801 8425333 779.4453
## 2 1957 30.332 9240934 820.8530
## 3 1962 31.997 10267083 853.1007
## 4 1967 34.020 11537966 836.1971
## 5 1972 36.088 13079460 739.9811
## 6 1977 38.438 14880372 786.1134
# select by negation
head(select(data_gapminder, !country))
## continent year lifeExp pop gdpPercap
## 1 Asia 1952 28.801 8425333 779.4453
## 2 Asia 1957 30.332 9240934 820.8530
## 3 Asia 1962 31.997 10267083 853.1007
## 4 Asia 1967 34.020 11537966 836.1971
## 5 Asia 1972 36.088 13079460 739.9811
## 6 Asia 1977 38.438 14880372 786.1134
More options can be found here https://dplyr.tidyverse.org/reference/select.html.
filter()
Filter is very similar to subset for working on data frames (it doesn’t work on vectors), and follows the same basic approach of requiring a dataset and a condition on which to filter.
head(filter(data_gapminder, continent == "Americas"))
## country continent year lifeExp pop gdpPercap
## 1 Argentina Americas 1952 62.485 17876956 5911.315
## 2 Argentina Americas 1957 64.399 19610538 6856.856
## 3 Argentina Americas 1962 65.142 21283783 7133.166
## 4 Argentina Americas 1967 65.634 22934225 8052.953
## 5 Argentina Americas 1972 67.065 24779799 9443.039
## 6 Argentina Americas 1977 68.481 26983828 10079.027
filter()
works well with other offerings from Tidyverse. And more options can be found here https://dplyr.tidyverse.org/reference/filter.html.