If it’s not already loaded, read in the data set
<- read.csv(file = 'data/2018-cp_squirrel_census.csv',
sq_data header = TRUE,
na.strings = c("", " ", "NA", "NULL", ".", "+")
)
In addition to logical variables, our data set also has categorical data and date data. Categorical data in R is called factor data. We’ll convert a single variable to factor to see how this behaves.
$Age <- as.factor(sq_data$Age)
sq_data
str(sq_data$Age)
## Factor w/ 3 levels "?","Adult","Juvenile": NA NA NA 2 2 2 2 2 2 2 ...
levels(sq_data$Age)
## [1] "?" "Adult" "Juvenile"
This process highlights that we have one additional string character
that should be converted to NA
on import, ?
.
Calling str
tells us that we have a factor with three
levels or categories. Calling levels
lists those
categories.
Sometime the levels are not in the order we want them for ordinal categorical data. We can re-assign these.
$Age <- factor(sq_data$Age, levels = c("Juvenile", "Adult", "?"))
sq_data
levels(sq_data$Age)
## [1] "Juvenile" "Adult" "?"
str(sq_data$Age)
## Factor w/ 3 levels "Juvenile","Adult",..: NA NA NA 2 2 2 2 2 2 2 ...
Using the help page for as.Date
, convert the Date
variable, currently structured as int
to a date class
variable.
as.Date
requires a character input, so we must first
convert our variable to a character format
$Date <- as.character(sq_data$Date) sq_data
Next, we need to specify to as.Date
how our data is
structured - where the year, month, and date are articulated, and
whether the year is 4 digits or 2, and the months and days are
abbreviated.
$Date <- as.Date(sq_data$Date, "%m%d%Y") sq_data
The value of having a data dictionary in advance will go a long ways in facilitating this process.
A data dictionary generally describes key attributes about your data - it will list your variables, provide a description of them, and indicate the base type of data it is.
I have a very simple data dictionary for this project. Let’s load that in
<- "https://tinyurl.com/4kxt9n8e" # source of the file
dict_src
download.file(url = dict_src, destfile = 'data/_datadictionary.csv') # download and name the file
<- read.csv("data/_datadictionary.csv") # read in the file data_dict
We’ll now import our data in a slightly more efficient way
<- c("" , " ", ".", "NA", "NULL", "?", "+") # a list of values to use for na.strings
na_values
<- data_dict$data.import.class # a list of data types to feed to colClass
data_types
<- read.csv(file = "data/2018-cp_squirrel_census.csv",
sq_data na.strings = na_values,
colClasses = data_types
)
str(sq_data)
## 'data.frame': 3023 obs. of 31 variables:
## $ X : num -74 -74 -74 -74 -74 ...
## $ Y : num 40.8 40.8 40.8 40.8 40.8 ...
## $ Unique.Squirrel.ID : chr "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
## $ Hectare : chr "37F" "21B" "11B" "32E" ...
## $ Shift : chr "PM" "AM" "PM" "PM" ...
## $ Date : chr "10142018" "10192018" "10142018" "10172018" ...
## $ Hectare.Squirrel.Number : num 3 4 8 14 5 3 2 2 1 3 ...
## $ Age : Factor w/ 2 levels "Adult","Juvenile": NA NA NA 1 1 1 1 1 1 1 ...
## $ Primary.Fur.Color : Factor w/ 3 levels "Black","Cinnamon",..: NA NA 3 3 3 2 3 3 3 3 ...
## $ Highlight.Fur.Color : Factor w/ 10 levels "Black","Black, Cinnamon",..: NA NA NA NA 5 10 NA NA NA 5 ...
## $ Combination.of.Primary.and.Highlight.Color: Factor w/ 21 levels "Black+","Black+Cinnamon",..: NA NA 14 14 19 13 14 14 14 19 ...
## $ Color.notes : chr NA NA NA "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
## $ Location : Factor w/ 2 levels "Above Ground",..: NA NA 1 NA 1 NA 2 2 2 1 ...
## $ Above.Ground.Sighter.Measurement : num NA NA 10 NA NA NA 0 0 0 30 ...
## $ Specific.Location : chr NA NA NA NA ...
## $ Running : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Chasing : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ Climbing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Eating : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ Foraging : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ Other.Activities : chr NA NA NA NA ...
## $ Kuks : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Quaas : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Moans : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Tail.flags : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Tail.twitches : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Approaches : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Indifferent : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ Runs.from : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
## $ Other.Interactions : chr NA NA NA NA ...
## $ Lat.Long : chr "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...
At this point, we’ve got a pretty clean imported data set. It’s likely we’ll still find some issues with, but for now let’s save it as a cleaned, version 0
write.csv(sq_data, "data/2018-cp_squirrel_census_cleaned_v0.csv", row.names = FALSE)
On import, we can control many aspects of our data set, including setting NA values and assigning datat types and classes. This work is facilitated by being organized in advance. Creating a data dictionary will help to facilitate this process.