We’re going to work with a (hopefully) fun data set today, which we’ll download and then import.
First things first, we’ll use R to set up a place for us to work. We’ll begin by creating a directory on our desktop
dir.create("~/Desktop/r-for-beginners") # create a project directory
Most systems - MacOS, Windows, Linux - will set your home directory
to your user directory. The tilde ~
is shorthand for your
user directory. If your home directory is not your user directory, the
above will not work.
We’ll then make sure that our current session is working with the files in this folder, we’ll verify where we are, and finally create a directory to hold our data.
setwd("~/Desktop/r-for-beginners") # set the working directory to the directory we just created
getwd() # print current working directory
## [1] "/Users/vdunbar/Desktop/r-for-beginners"
dir.create("data") # create a directory in our working directory called data
## Warning in dir.create("data"): 'data' already exists
list.dirs() # list the directories to verify things.
## [1] "." "./data"
And then we’ll download our data
<- "https://tinyurl.com/mu8y9n29" # define the source of our data, in this case, a url
data_src
download.file(url = data_src, destfile = 'data/2018-cp_squirrel_census.csv') # download the file, and save it with a specified name
Now that our data is downloaded and we have a local copy, we’ll pull it into R
We can simply import the data and display it to our console
read.csv('data/2018-cp_squirrel_census.csv', header = TRUE) # read the data
But this isn’t terribly useful. We want to store this data in memory, so we’ll load it into a variable
<- read.csv('data/2018-cp_squirrel_census.csv', header = TRUE) # read the data into a variable called sq_data sq_data
We can now explore the data, looking at the data types and data structures R is defaulting to when importing this csv file
str(sq_data) # explore the structure of the data set.
## 'data.frame': 3023 obs. of 31 variables:
## $ X : num -74 -74 -74 -74 -74 ...
## $ Y : num 40.8 40.8 40.8 40.8 40.8 ...
## $ Unique.Squirrel.ID : chr "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
## $ Hectare : chr "37F" "21B" "11B" "32E" ...
## $ Shift : chr "PM" "AM" "PM" "PM" ...
## $ Date : int 10142018 10192018 10142018 10172018 10172018 10102018 10102018 10082018 10062018 10102018 ...
## $ Hectare.Squirrel.Number : int 3 4 8 14 5 3 2 2 1 3 ...
## $ Age : chr "" "" "" "Adult" ...
## $ Primary.Fur.Color : chr "" "" "Gray" "Gray" ...
## $ Highlight.Fur.Color : chr "" "" "" "" ...
## $ Combination.of.Primary.and.Highlight.Color: chr "+" "+" "Gray+" "Gray+" ...
## $ Color.notes : chr "" "" "" "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
## $ Location : chr "" "" "Above Ground" "" ...
## $ Above.Ground.Sighter.Measurement : int NA NA 10 NA NA NA 0 0 0 30 ...
## $ Specific.Location : chr "" "" "" "" ...
## $ Running : chr "false" "false" "false" "false" ...
## $ Chasing : chr "false" "false" "true" "false" ...
## $ Climbing : chr "false" "false" "false" "false" ...
## $ Eating : chr "false" "false" "false" "true" ...
## $ Foraging : chr "false" "false" "false" "true" ...
## $ Other.Activities : chr "" "" "" "" ...
## $ Kuks : chr "false" "false" "false" "false" ...
## $ Quaas : chr "false" "false" "false" "false" ...
## $ Moans : chr "false" "false" "false" "false" ...
## $ Tail.flags : chr "false" "false" "false" "false" ...
## $ Tail.twitches : chr "false" "false" "false" "false" ...
## $ Approaches : chr "false" "false" "false" "false" ...
## $ Indifferent : chr "false" "false" "false" "false" ...
## $ Runs.from : chr "false" "false" "false" "true" ...
## $ Other.Interactions : chr "" "" "" "" ...
## $ Lat.Long : chr "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...
We can get slightly better access to the data itself with
View()
View(sq_data) # look at the data in a 'spreadsheet' like format. Note the capital V.
This output tells us that we’re working with a data frame, that there are 3023 rows or observations, and 31 columns or variables. It then lists all of the variables, tells us what data type they were interpreted as on import, and show us the first six values of each variable.
Before digging much deeper into the data set, one of the first things
we’ll note is that there are a lot of missing values. Missing values
need to be properly encoded to be programatically useful. An application
– or user – may represent missing values in many ways. Sometimes it’s by
using an out of range value, so when a variable is binary, and values
are represented as 0 or 1, 99 may be used to indicate a missing value.
Files exported from SPSS might be encoded with a period, .
.
In Excel, it’s extremely easy for a user to accidentally introduce
spaces, , in an otherwise empty cell, or they may choose to
write the characters
NA
. A data base export might include
either NA
or NULL
.
When exported, especially using a format like csv
, all
of these notations for missing values are converted to character strings
or numbers. Ideally, when importing into a piece of software, like R, we
would have a way to provide a list of possible ways of encoding missing
values and standardize how these are presented.
In R, NA
values are a specific way of indicating that a
value is missing. And read.csv()
has a specific argument
for converting strings into NA
encoded values on
import.
Using the help documentation – ?read.csv
– see if you
can figure out how to update your data import –
sq_data <- read.csv('data/2018-cp_squirrel_census.csv', header = TRUE)
– to convert a list of strings to NA
values.
<- read.csv('2018-cp_squirrel_census.csv', header = TRUE, na.strings = c("", " ", "NA", "NULL", ".", "+")) sq_data
Review the data structure again…
str(sq_data)
## 'data.frame': 3023 obs. of 31 variables:
## $ X : num -74 -74 -74 -74 -74 ...
## $ Y : num 40.8 40.8 40.8 40.8 40.8 ...
## $ Unique.Squirrel.ID : chr "37F-PM-1014-03" "21B-AM-1019-04" "11B-PM-1014-08" "32E-PM-1017-14" ...
## $ Hectare : chr "37F" "21B" "11B" "32E" ...
## $ Shift : chr "PM" "AM" "PM" "PM" ...
## $ Date : int 10142018 10192018 10142018 10172018 10172018 10102018 10102018 10082018 10062018 10102018 ...
## $ Hectare.Squirrel.Number : int 3 4 8 14 5 3 2 2 1 3 ...
## $ Age : chr NA NA NA "Adult" ...
## $ Primary.Fur.Color : chr NA NA "Gray" "Gray" ...
## $ Highlight.Fur.Color : chr NA NA NA NA ...
## $ Combination.of.Primary.and.Highlight.Color: chr NA NA "Gray+" "Gray+" ...
## $ Color.notes : chr NA NA NA "Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments." ...
## $ Location : chr NA NA "Above Ground" NA ...
## $ Above.Ground.Sighter.Measurement : int NA NA 10 NA NA NA 0 0 0 30 ...
## $ Specific.Location : chr NA NA NA NA ...
## $ Running : chr "false" "false" "false" "false" ...
## $ Chasing : chr "false" "false" "true" "false" ...
## $ Climbing : chr "false" "false" "false" "false" ...
## $ Eating : chr "false" "false" "false" "true" ...
## $ Foraging : chr "false" "false" "false" "true" ...
## $ Other.Activities : chr NA NA NA NA ...
## $ Kuks : chr "false" "false" "false" "false" ...
## $ Quaas : chr "false" "false" "false" "false" ...
## $ Moans : chr "false" "false" "false" "false" ...
## $ Tail.flags : chr "false" "false" "false" "false" ...
## $ Tail.twitches : chr "false" "false" "false" "false" ...
## $ Approaches : chr "false" "false" "false" "false" ...
## $ Indifferent : chr "false" "false" "false" "false" ...
## $ Runs.from : chr "false" "false" "false" "true" ...
## $ Other.Interactions : chr NA NA NA NA ...
## $ Lat.Long : chr "POINT (-73.9561344937861 40.7940823884086)" "POINT (-73.9688574691102 40.7837825208444)" "POINT (-73.97428114848522 40.775533619083)" "POINT (-73.9596413903948 40.7903128889029)" ...
We’ll see a number of other things that we need to sort out before we can do much with our data. For example, many of our variable of have not been assigned to the appropriate data type. We have logical variables, date variables etc not yet sorted.
We can isolate or target a single variable in our data set using the
dollar sign $
.
$Running sq_data
Now that we can target a variable, we can assign new properties to
that variable. R has a series of as.
functions to allow us
to manipulate data types and classes. To view all as.
functions available to you
apropos("^as\\.")
## [1] "as.array" "as.array.default" "as.call"
## [4] "as.character" "as.character.condition" "as.character.Date"
## ...
We can see in the list, that we have several useful as.
functions for our current data set, including as.Date
and
as.logical
. We’ll start with the latter.
$Running <- as.logical(sq_data$Running) # re-assign the variable Running from character to logical
sq_data
str(sq_data$Running) # view the results
## logi [1:3023] FALSE FALSE FALSE FALSE FALSE FALSE ...
read.csv
allows us to assign variable types on import.
Revisit the help documentation for read.csv
and see if you
can update your import line –
sq_data <- read.csv('data/2018-cp_squirrel_census.csv', na.strings = c("", " ", "NA", "NULL", ".", "+"))
– to also convert the columns ‘Running’, ‘Chasing’, and ‘Climbing’ to
logical on import.
<- read.csv('2018-cp_squirrel_census.csv',
sq_data header = TRUE,
na.strings = c("", " ", "NA", "NULL", ".", "+"),
colClasses = c("Chasing" = "logical",
"Running" = "logical",
"Climbing" = "logical")
)
Function | Description |
---|---|
dir.create |
create a directory on your file system. |
setwd |
set the working directory. See also getwd
to get current working directory. |
read.csv |
read a csv file into R. |
str |
display information about the data including structure, types, and a few values. |
View |
open the data set in a spreadsheet like viewer. |
apropos |
search for functions and variables. |
as. |
a family of functions for converting data types and structures. |