There are three subsetting operators in R – [[, [, $ – each of which will work with a data frame.
We’ve already seen how we can select a named column in a data frame with the dollar sign selector.
$Age sq_data
## [1] <NA> <NA> <NA> Adult Adult Adult
## Levels: Adult Juvenile
We can also use indexed positions with square brackets
8] sq_data[
## Age
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 Adult
## 5 Adult
## 6 Adult
If we only include a single digit, R defaults to selecting a column.
We can select both rows and columns, however, using a comma, where the first number, or set of numbers, selects the rows, and the second, the columns.
1,1] # show the first row and first column of the data frame sq_data[
## [1] -73.95613
If we wanted to see the first 6 rows
1:6, 1] # show the first six rows and first column of the data frame sq_data[
## [1] -73.95613 -73.96886 -73.97428 -73.95964 -73.97027 -73.96836
Leaving a selection blank is the equivalent of selecting everything; either all rows, or all columns,
1] # all rows, first column, ie one variable
sq_data[ ,1, ] # all columns, first row, ie one observation sq_data[
We can also mix and match names and indexes,
1:6, "Running"] sq_data[
There are two ways to select a single column in a data frame; using
[] or [ , ]. Extract a single variable, and all observations, from
sq_data
using each of these two methods, saving each to
their own variable. Then using str()
investigate the
difference in the outputs.
<- sq_data[6]
matrix_ss str(matrix_ss)
## 'data.frame': 3023 obs. of 1 variable:
## $ Date: chr "10142018" "10192018" "10142018" "10172018" ...
<- sq_data[ ,6]
list_ss str(list_ss)
## chr [1:3023] "10142018" "10192018" "10142018" "10172018" "10172018" ...
We’ve encountered sample()
. We can use sample and index
subsetting take a random sample from our data set. To do this, we’ll
need to use the nrow()
function, that when fed a data set,
reports on the number of rows,
nrow(sq_data)
## [1] 3023
Using this information, Take a random sample of 10 observations for
sq_data
for the first 6 columns.
sample(nrow(sq_data), 10), 1:6] sq_data[
## X Y Unique.Squirrel.ID Hectare Shift Date
## 1387 -73.96972 40.77114 9H-AM-1006-02 09H AM 10062018
## 481 -73.97900 40.76978 4B-AM-1010-03 04B AM 10102018
## 1747 -73.96035 40.79060 32E-PM-1017-15 32E PM 10172018
## 2812 -73.96321 40.79261 32A-PM-1013-07 32A PM 10132018
## 2785 -73.97291 40.76728 4H-PM-1006-02 04H PM 10062018
## 2807 -73.96182 40.79276 33B-AM-1010-02 33B AM 10102018
## 1797 -73.96972 40.76970 7I-PM-1013-07 07I PM 10132018
## 2044 -73.95642 40.79916 42C-PM-1013-02 42C PM 10132018
## 995 -73.96988 40.78038 18C-PM-1018-02 18C PM 10182018
## 2908 -73.96543 40.78160 21E-AM-1017-02 21E AM 10172018
Lastly, we might want only observations that are complete, that is, records with no missing values
<- sq_data[complete.cases(sq_data), ]) # not terribly useful with this particular data set (sq_data_complete
## X Y Unique.Squirrel.ID Hectare Shift Date
## 2254 -73.971 40.77258 10F-PM-1019-03 10F PM 10192018
## Hectare.Squirrel.Number Age Primary.Fur.Color Highlight.Fur.Color
## 2254 3 Juvenile Gray Cinnamon
## Combination.of.Primary.and.Highlight.Color Color.notes Location
## 2254 Gray+Cinnamon Cinnamon head Ground Plane
## agsm.f Specific.Location Running Chasing Climbing Eating Foraging
## 2254 0 Behind fence, grass FALSE FALSE FALSE TRUE FALSE
## Other.Activities Kuks Quaas Moans Tail.flags Tail.twitches Approaches
## 2254 eating (nuts) FALSE FALSE FALSE FALSE FALSE TRUE
## Indifferent Runs.from Other.Interactions
## 2254 FALSE FALSE approaches (bad tourists w/ nuts)
## Lat.Long agsm.m
## 2254 POINT (-73.9709991016317 40.772575670774806) 0
So far, we haven’t applied any conditions to our subsets, but we can. In logical subsetting, the subset defaults to returning the results where the condition is TRUE.
R allows us to specify several conditions
>
<
<=
>=
==
!=
As well as boolean operators
|
&
Note that =
is equivalent to <-
, setting
a value, while ==
tests whether or not two things are the
same.
x = 2) (
## [1] 2
== 2) (x
## [1] TRUE
$Age == "Adult", ] # all variables for all Adult observations
sq_data[sq_data
$Age == "Adult" & sq_data$Shift == "PM", ] # all variables for all Adult, PM observat sq_data[sq_data
A lot this functionality has been built into a subset function,
subset()
.
We’ll provide subset()
with three arguments:
Let’s say we only want to see the records for adult squirrels.
<- subset(sq_data, Age == "Adult") # from sq_data, select only rows where Age equals Adult sq_data_adult
Or all records with cinnamon coloured adult squirrels.
<- subset(sq_data, Age == "Adult" & Primary.Fur.Color == "Cinnamon") sq_data_c_adult
Or all records where an age has been recorded. This can be done in two ways
<- subset(sq_data, !is.na(Age))
sq_data_age_1
<- subset(sq_data, Age == "Adult" | Age == "Juvenile") sq_data_age_2
Remember, NA
values are a special value, indicating that
a value is missing. This is different from writing the characters NA.
is.na()
tests for NA
. It’s for this reason
that running subset(sq_data, sq_data$Age == 'NA')
will give
you an empty data frame.
Or maybe we want to investigate patterns of only those squirrels for which we don’t have age recorded!
<- subset(sq_data, is.na(Age)) sq_data_age_na
All data was collected between October 6 and October 20, 2018. Create
a subset of the data that includes only data collected on or before
October 14, and only retains the variables Date
and
Age
. Verify your date range output with
unique()
and if you get that far, order this output from
unique()
using sort()
for easier reading.
Remember that we have not yet converted our Date
variable to a date class.
$Date <- as.Date(sq_data$Date, "%m%d%Y")
sq_data<- subset(sq_data, Date <= "2018-10-14", select = c(Date, Age))
sq_data_date sort(unique(sq_data_date$Date))
Subsetting can be achieved in many ways. We can use indexing to get
specific observations (rows) and / or variable (columns), we can use
conditions on which to subset, we can use other functions like
is.na()
and complete.cases()
to work around
missing values, and we can use subset()
, among other
functions, subset.
Function | Description |
---|---|
subset |
conditionally susbet a data frame |
is.na |
conditionally test if values in a vector are
NA . |
complete.cases |
conditionally test if cases (rows) are complete. |