If it’s not already loaded, read in the data set

sq_data <- read.csv(file = 'data/2018-cp_squirrel_census.csv',
                    header = TRUE,
                    na.strings = c("", " ", "NA", "NULL", ".", "+")
                    )

Dates & Factors

In addition to logical variables, our data set also has categorical data and date data. Categorical data in R is called factor data. We’ll convert a single variable to factor to see how this behaves.

sq_data$Age <- as.factor(sq_data$Age)

str(sq_data$Age)
##  Factor w/ 3 levels "?","Adult","Juvenile": NA NA NA 2 2 2 2 2 2 2 ...
levels(sq_data$Age)
## [1] "?"        "Adult"    "Juvenile"

This process highlights that we have one additional string character that should be converted to NA on import, ?. Calling str tells us that we have a factor with three levels or categories. Calling levels lists those categories.

Sometime the levels are not in the order we want them for ordinal categorical data. We can re-assign these.

sq_data$Age <- factor(sq_data$Age, levels = c("Juvenile", "Adult", "?"))

levels(sq_data$Age)
## [1] "Juvenile" "Adult"    "?"
str(sq_data$Age)
##  Factor w/ 3 levels "Juvenile","Adult",..: NA NA NA 2 2 2 2 2 2 2 ...

Using the help page for as.Date, convert the Date variable, currently structured as int to a date class variable.

as.Date requires a character input, so we must first convert our variable to a character format

sq_data$Date <- as.character(sq_data$Date)

Next, we need to specify to as.Date how our data is structured - where the year, month, and date are articulated, and whether the year is 4 digits or 2, and the months and days are abbreviated.

sq_data$Date <- as.Date(sq_data$Date, "%m%d%Y")

Importing with a Data Dictionary

The value of having a data dictionary in advance will go a long ways in facilitating this process.

A data dictionary generally describes key attributes about your data - it will list your variables, provide a description of them, and indicate the base type of data it is.

I have a very simple data dictionary for this project. Let’s load that in

dict_src <- "https://tinyurl.com/4kxt9n8e" # source of the file

download.file(url = dict_src, destfile = 'data/_datadictionary.csv') # download and name the file

data_dict <- read.csv("data/_datadictionary.csv") # read in the file

We’ll now import our data in a slightly more efficient way

na_values <- c("" , " ", ".", "NA", "NULL", "?", "+") # a list of values to use for na.strings

data_types <- data_dict$data.import.class # a list of data types to feed to colClass

sq_data <- read.csv(file = "data/2018-cp_squirrel_census.csv",
         na.strings = na_values,
         colClasses = data_types
           )

str(sq_data)
## 'data.frame':    3023 obs. of  31 variables:
##  $ X                                         : num  -74 -74 -74 -74 -74 ...
##  $ Y                                         : num  40.8 40.8 40.8 40.8 40.8 ...
##  $ Unique.Squirrel.ID                        : chr  "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
##  $ Hectare                                   : chr  "37F" "21B" "11B" "32E" ...
##  $ Shift                                     : chr  "PM" "AM" "PM" "PM" ...
##  $ Date                                      : chr  "10142018" "10192018" "10142018" "10172018" ...
##  $ Hectare.Squirrel.Number                   : num  3 4 8 14 5 3 2 2 1 3 ...
##  $ Age                                       : Factor w/ 2 levels "Adult","Juvenile": NA NA NA 1 1 1 1 1 1 1 ...
##  $ Primary.Fur.Color                         : Factor w/ 3 levels "Black","Cinnamon",..: NA NA 3 3 3 2 3 3 3 3 ...
##  $ Highlight.Fur.Color                       : Factor w/ 10 levels "Black","Black, Cinnamon",..: NA NA NA NA 5 10 NA NA NA 5 ...
##  $ Combination.of.Primary.and.Highlight.Color: Factor w/ 21 levels "Black+","Black+Cinnamon",..: NA NA 14 14 19 13 14 14 14 19 ...
##  $ Color.notes                               : chr  NA NA NA "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
##  $ Location                                  : Factor w/ 2 levels "Above Ground",..: NA NA 1 NA 1 NA 2 2 2 1 ...
##  $ Above.Ground.Sighter.Measurement          : num  NA NA 10 NA NA NA 0 0 0 30 ...
##  $ Specific.Location                         : chr  NA NA NA NA ...
##  $ Running                                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Chasing                                   : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Climbing                                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Eating                                    : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ Foraging                                  : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
##  $ Other.Activities                          : chr  NA NA NA NA ...
##  $ Kuks                                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Quaas                                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Moans                                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Tail.flags                                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Tail.twitches                             : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Approaches                                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Indifferent                               : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Runs.from                                 : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ Other.Interactions                        : chr  NA NA NA NA ...
##  $ Lat.Long                                  : chr  "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...

At this point, we’ve got a pretty clean imported data set. It’s likely we’ll still find some issues with, but for now let’s save it as a cleaned, version 0

write.csv(sq_data, "data/2018-cp_squirrel_census_cleaned_v0.csv", row.names = FALSE)

On import, we can control many aspects of our data set, including setting NA values and assigning datat types and classes. This work is facilitated by being organized in advance. Creating a data dictionary will help to facilitate this process.