We’re going to work with a (hopefully) fun data set today, which we’ll download and then import.

First things first, we’ll use R to set up a place for us to work. We’ll begin by creating a directory on our desktop

dir.create("~/Desktop/r-for-beginners") # create a project directory

Most systems - MacOS, Windows, Linux - will set your home directory to your user directory. The tilde ~ is shorthand for your user directory. If your home directory is not your user directory, the above will not work.

We’ll then make sure that our current session is working with the files in this folder, we’ll verify where we are, and finally create a directory to hold our data.

setwd("~/Desktop/r-for-beginners") # set the working directory to the directory we just created
getwd() # print current working directory
## [1] "/Users/vdunbar/Desktop/r-for-beginners"
dir.create("data") # create a directory in our working directory called data
## Warning in dir.create("data"): 'data' already exists
list.dirs() # list the directories to verify things.
## [1] "."      "./data"

And then we’ll download our data

data_src <- "https://tinyurl.com/mu8y9n29" # define the source of our data, in this case, a url

download.file(url = data_src, destfile = 'data/2018-cp_squirrel_census.csv') # download the file, and save it with a specified name

Now that our data is downloaded and we have a local copy, we’ll pull it into R

We can simply import the data and display it to our console

read.csv('data/2018-cp_squirrel_census.csv', header = TRUE) # read the data

But this isn’t terribly useful. We want to store this data in memory, so we’ll load it into a variable

sq_data <- read.csv('data/2018-cp_squirrel_census.csv', header = TRUE) # read the data into a variable called sq_data

We can now explore the data, looking at the data types and data structures R is defaulting to when importing this csv file

str(sq_data) # explore the structure of the data set.
## 'data.frame':    3023 obs. of  31 variables:
##  $ X                                         : num  -74 -74 -74 -74 -74 ...
##  $ Y                                         : num  40.8 40.8 40.8 40.8 40.8 ...
##  $ Unique.Squirrel.ID                        : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
##  $ Hectare                                   : chr  "37F" "21B" "11B" "32E" ...
##  $ Shift                                     : chr  "PM" "AM" "PM" "PM" ...
##  $ Date                                      : int  10142018 10192018 10142018 10172018 10172018 10102018 10102018 10082018 10062018 10102018 ...
##  $ Hectare.Squirrel.Number                   : int  3 4 8 14 5 3 2 2 1 3 ...
##  $ Age                                       : chr  "" "" "" "Adult" ...
##  $ Primary.Fur.Color                         : chr  "" "" "Gray" "Gray" ...
##  $ Highlight.Fur.Color                       : chr  "" "" "" "" ...
##  $ Combination.of.Primary.and.Highlight.Color: chr  "+" "+" "Gray+" "Gray+" ...
##  $ Color.notes                               : chr  "" "" "" "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
##  $ Location                                  : chr  "" "" "Above Ground" "" ...
##  $ Above.Ground.Sighter.Measurement          : int  NA NA 10 NA NA NA 0 0 0 30 ...
##  $ Specific.Location                         : chr  "" "" "" "" ...
##  $ Running                                   : chr  "false" "false" "false" "false" ...
##  $ Chasing                                   : chr  "false" "false" "true" "false" ...
##  $ Climbing                                  : chr  "false" "false" "false" "false" ...
##  $ Eating                                    : chr  "false" "false" "false" "true" ...
##  $ Foraging                                  : chr  "false" "false" "false" "true" ...
##  $ Other.Activities                          : chr  "" "" "" "" ...
##  $ Kuks                                      : chr  "false" "false" "false" "false" ...
##  $ Quaas                                     : chr  "false" "false" "false" "false" ...
##  $ Moans                                     : chr  "false" "false" "false" "false" ...
##  $ Tail.flags                                : chr  "false" "false" "false" "false" ...
##  $ Tail.twitches                             : chr  "false" "false" "false" "false" ...
##  $ Approaches                                : chr  "false" "false" "false" "false" ...
##  $ Indifferent                               : chr  "false" "false" "false" "false" ...
##  $ Runs.from                                 : chr  "false" "false" "false" "true" ...
##  $ Other.Interactions                        : chr  "" "" "" "" ...
##  $ Lat.Long                                  : chr  "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...

We can get slightly better access to the data itself with View()

View(sq_data) # look at the data in a 'spreadsheet' like format. Note the capital V.

This output tells us that we’re working with a data frame, that there are 3023 rows or observations, and 31 columns or variables. It then lists all of the variables, tells us what data type they were interpreted as on import, and show us the first six values of each variable.

Missing Values

Before digging much deeper into the data set, one of the first things we’ll note is that there are a lot of missing values. Missing values need to be properly encoded to be programatically useful. An application – or user – may represent missing values in many ways. Sometimes it’s by using an out of range value, so when a variable is binary, and values are represented as 0 or 1, 99 may be used to indicate a missing value. Files exported from SPSS might be encoded with a period, .. In Excel, it’s extremely easy for a user to accidentally introduce spaces, , in an otherwise empty cell, or they may choose to write the characters NA. A data base export might include either NA or NULL.

When exported, especially using a format like csv, all of these notations for missing values are converted to character strings or numbers. Ideally, when importing into a piece of software, like R, we would have a way to provide a list of possible ways of encoding missing values and standardize how these are presented.

In R, NA values are a specific way of indicating that a value is missing. And read.csv() has a specific argument for converting strings into NA encoded values on import.

Using the help documentation – ?read.csv – see if you can figure out how to update your data import – sq_data <- read.csv('data/2018-cp_squirrel_census.csv', header = TRUE) – to convert a list of strings to NA values.

sq_data <- read.csv('2018-cp_squirrel_census.csv', header = TRUE, na.strings = c("", " ", "NA", "NULL", ".", "+"))

Review the data structure again…

str(sq_data)
## 'data.frame':    3023 obs. of  31 variables:
##  $ X                                         : num  -74 -74 -74 -74 -74 ...
##  $ Y                                         : num  40.8 40.8 40.8 40.8 40.8 ...
##  $ Unique.Squirrel.ID                        : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
##  $ Hectare                                   : chr  "37F" "21B" "11B" "32E" ...
##  $ Shift                                     : chr  "PM" "AM" "PM" "PM" ...
##  $ Date                                      : int  10142018 10192018 10142018 10172018 10172018 10102018 10102018 10082018 10062018 10102018 ...
##  $ Hectare.Squirrel.Number                   : int  3 4 8 14 5 3 2 2 1 3 ...
##  $ Age                                       : chr  NA NA NA "Adult" ...
##  $ Primary.Fur.Color                         : chr  NA NA "Gray" "Gray" ...
##  $ Highlight.Fur.Color                       : chr  NA NA NA NA ...
##  $ Combination.of.Primary.and.Highlight.Color: chr  NA NA "Gray+" "Gray+" ...
##  $ Color.notes                               : chr  NA NA NA "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
##  $ Location                                  : chr  NA NA "Above Ground" NA ...
##  $ Above.Ground.Sighter.Measurement          : int  NA NA 10 NA NA NA 0 0 0 30 ...
##  $ Specific.Location                         : chr  NA NA NA NA ...
##  $ Running                                   : chr  "false" "false" "false" "false" ...
##  $ Chasing                                   : chr  "false" "false" "true" "false" ...
##  $ Climbing                                  : chr  "false" "false" "false" "false" ...
##  $ Eating                                    : chr  "false" "false" "false" "true" ...
##  $ Foraging                                  : chr  "false" "false" "false" "true" ...
##  $ Other.Activities                          : chr  NA NA NA NA ...
##  $ Kuks                                      : chr  "false" "false" "false" "false" ...
##  $ Quaas                                     : chr  "false" "false" "false" "false" ...
##  $ Moans                                     : chr  "false" "false" "false" "false" ...
##  $ Tail.flags                                : chr  "false" "false" "false" "false" ...
##  $ Tail.twitches                             : chr  "false" "false" "false" "false" ...
##  $ Approaches                                : chr  "false" "false" "false" "false" ...
##  $ Indifferent                               : chr  "false" "false" "false" "false" ...
##  $ Runs.from                                 : chr  "false" "false" "false" "true" ...
##  $ Other.Interactions                        : chr  NA NA NA NA ...
##  $ Lat.Long                                  : chr  "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...

We’ll see a number of other things that we need to sort out before we can do much with our data. For example, many of our variable of have not been assigned to the appropriate data type. We have logical variables, date variables etc not yet sorted.

Changing Data Types

We can isolate or target a single variable in our data set using the dollar sign $.

sq_data$Running

Now that we can target a variable, we can assign new properties to that variable. R has a series of as. functions to allow us to manipulate data types and classes. To view all as. functions available to you

apropos("^as\\.")
## [1] "as.array"               "as.array.default"       "as.call"               
## [4] "as.character"           "as.character.condition" "as.character.Date"
## ...

We can see in the list, that we have several useful as. functions for our current data set, including as.Date and as.logical. We’ll start with the latter.

sq_data$Running <- as.logical(sq_data$Running) # re-assign the variable Running from character to logical

str(sq_data$Running) # view the results
##  logi [1:3023] FALSE FALSE FALSE FALSE FALSE FALSE ...

read.csv allows us to assign variable types on import. Revisit the help documentation for read.csv and see if you can update your import line – sq_data <- read.csv('data/2018-cp_squirrel_census.csv', na.strings = c("", " ", "NA", "NULL", ".", "+")) – to also convert the columns ‘Running’, ‘Chasing’, and ‘Climbing’ to logical on import.

sq_data <- read.csv('2018-cp_squirrel_census.csv',
                    header = TRUE,
                    na.strings = c("", " ", "NA", "NULL", ".", "+"),
                    colClasses = c("Chasing" = "logical",
                                   "Running" = "logical",
                                   "Climbing" = "logical")
                    )
Function Description
dir.create create a directory on your file system.
setwd set the working directory. See also getwd to get current working directory.
read.csv read a csv file into R.
str display information about the data including structure, types, and a few values.
View open the data set in a spreadsheet like viewer.
apropos search for functions and variables.
as. a family of functions for converting data types and structures.