Subsetting

There are three subsetting operators in R – [[, [, $ – each of which will work with a data frame.

We’ve already seen how we can select a named column in a data frame with the dollar sign selector.

sq_data$Age

## [1] <NA>  <NA>  <NA>  Adult Adult Adult
## Levels: Adult Juvenile

We can also use indexed positions with square brackets

sq_data[8]

##     Age
## 1  <NA>
## 2  <NA>
## 3  <NA>
## 4 Adult
## 5 Adult
## 6 Adult

If we only include a single digit, R defaults to selecting a column.

We can select both rows and columns, however, using a comma, where the first number, or set of numbers, selects the rows, and the second, the columns.

sq_data[1,1] # show the first row and first column of the data frame

## [1] -73.95613

If we wanted to see the first 6 rows

sq_data[1:6, 1] # show the first six rows and first column of the data frame

## [1] -73.95613 -73.96886 -73.97428 -73.95964 -73.97027 -73.96836

Leaving a selection blank is the equivalent of selecting everything; either all rows, or all columns,

sq_data[ ,1] # all rows, first column, ie one variable
sq_data[1, ] # all columns, first row, ie one observation

We can also mix and match names and indexes,

sq_data[1:6, "Running"]

There are two ways to select a single column in a data frame; using [] or [ , ]. Extract a single variable, and all observations, from sq_data using each of these two methods, saving each to their own variable. Then using str() investigate the difference in the outputs.

matrix_ss <- sq_data[6]
str(matrix_ss)

## 'data.frame':    3023 obs. of  1 variable:
##  $ Date: chr  "10142018" "10192018" "10142018" "10172018" ...

list_ss <- sq_data[ ,6]
str(list_ss)

##  chr [1:3023] "10142018" "10192018" "10142018" "10172018" "10172018" ...

We’ve encountered sample(). We can use sample and index subsetting take a random sample from our data set. To do this, we’ll need to use the nrow() function, that when fed a data set, reports on the number of rows,

nrow(sq_data)

## [1] 3023

Using this information, Take a random sample of 10 observations for sq_data for the first 6 columns.

sq_data[sample(nrow(sq_data), 10), 1:6]

##              X        Y Unique.Squirrel.ID Hectare Shift     Date
## 1387 -73.96972 40.77114      9H-AM-1006-02     09H    AM 10062018
## 481  -73.97900 40.76978      4B-AM-1010-03     04B    AM 10102018
## 1747 -73.96035 40.79060     32E-PM-1017-15     32E    PM 10172018
## 2812 -73.96321 40.79261     32A-PM-1013-07     32A    PM 10132018
## 2785 -73.97291 40.76728      4H-PM-1006-02     04H    PM 10062018
## 2807 -73.96182 40.79276     33B-AM-1010-02     33B    AM 10102018
## 1797 -73.96972 40.76970      7I-PM-1013-07     07I    PM 10132018
## 2044 -73.95642 40.79916     42C-PM-1013-02     42C    PM 10132018
## 995  -73.96988 40.78038     18C-PM-1018-02     18C    PM 10182018
## 2908 -73.96543 40.78160     21E-AM-1017-02     21E    AM 10172018

Lastly, we might want only observations that are complete, that is, records with no missing values

(sq_data_complete <- sq_data[complete.cases(sq_data), ]) # not terribly useful with this particular data set

##            X        Y Unique.Squirrel.ID Hectare Shift     Date
## 2254 -73.971 40.77258     10F-PM-1019-03     10F    PM 10192018
##      Hectare.Squirrel.Number      Age Primary.Fur.Color Highlight.Fur.Color
## 2254                       3 Juvenile              Gray            Cinnamon
##      Combination.of.Primary.and.Highlight.Color   Color.notes     Location
## 2254                              Gray+Cinnamon Cinnamon head Ground Plane
##      agsm.f   Specific.Location Running Chasing Climbing Eating Foraging
## 2254      0 Behind fence, grass   FALSE   FALSE    FALSE   TRUE    FALSE
##      Other.Activities  Kuks Quaas Moans Tail.flags Tail.twitches Approaches
## 2254    eating (nuts) FALSE FALSE FALSE      FALSE         FALSE       TRUE
##      Indifferent Runs.from                Other.Interactions
## 2254       FALSE     FALSE approaches (bad tourists w/ nuts)
##                                          Lat.Long agsm.m
## 2254 POINT (-73.9709991016317 40.772575670774806)      0

Logical subsetting

So far, we haven’t applied any conditions to our subsets, but we can. In logical subsetting, the subset defaults to returning the results where the condition is TRUE.

R allows us to specify several conditions

less than >
greater than <
less than or equal to <=
greater than or equal >=
equivalent to ==
not equivalent to !=

As well as boolean operators

or |
and &

Note that = is equivalent to <-, setting a value, while == tests whether or not two things are the same.

(x = 2)

## [1] 2

(x == 2)

## [1] TRUE

sq_data[sq_data$Age == "Adult", ] # all variables for all Adult observations

sq_data[sq_data$Age == "Adult" & sq_data$Shift == "PM", ] # all variables for all Adult, PM observat

Subset()

A lot this functionality has been built into a subset function, subset().

We’ll provide subset() with three arguments:

A data set to subset from
The condition on which to subset
The variables from the original data set to keep – leave this empty to default to all original variables

Let’s say we only want to see the records for adult squirrels.

sq_data_adult <- subset(sq_data, Age == "Adult") # from sq_data, select only rows where Age equals Adult

Or all records with cinnamon coloured adult squirrels.

sq_data_c_adult <- subset(sq_data, Age == "Adult" & Primary.Fur.Color == "Cinnamon")

Or all records where an age has been recorded. This can be done in two ways

sq_data_age_1 <- subset(sq_data, !is.na(Age))

sq_data_age_2 <- subset(sq_data, Age == "Adult" | Age == "Juvenile")

Remember, NA values are a special value, indicating that a value is missing. This is different from writing the characters NA. is.na() tests for NA. It’s for this reason that running subset(sq_data, sq_data$Age == 'NA') will give you an empty data frame.

Or maybe we want to investigate patterns of only those squirrels for which we don’t have age recorded!

sq_data_age_na <- subset(sq_data, is.na(Age))

All data was collected between October 6 and October 20, 2018. Create a subset of the data that includes only data collected on or before October 14, and only retains the variables Date and Age. Verify your date range output with unique() and if you get that far, order this output from unique() using sort() for easier reading.

Remember that we have not yet converted our Date variable to a date class.

sq_data$Date <- as.Date(sq_data$Date, "%m%d%Y")
sq_data_date <- subset(sq_data, Date <= "2018-10-14", select = c(Date, Age))
sort(unique(sq_data_date$Date))

Subsetting can be achieved in many ways. We can use indexing to get specific observations (rows) and / or variable (columns), we can use conditions on which to subset, we can use other functions like is.na() and complete.cases() to work around missing values, and we can use subset(), among other functions, subset.

Function	Description
`subset`	conditionally susbet a data frame
`is.na`	conditionally test if values in a vector are `NA`.
`complete.cases`	conditionally test if cases (rows) are complete.