If we’re only dealing with variables of a single data type, matrices are great; in fact, they're very efficient if we only have one data type. But most of us are probably used to working with data that will have some variables that are numerical, some that are categorical, and some that are character based.
Consider the following table of data recording the number of times a presidential candidate used the words 'will', 'shall', and 'going to' in the presidential debates, and whether or not they won or lost the popular vote:
Year | Candidate | Won (W) or Lost (L) the Popular Vote | Number of 'will', 'shall', 'going to' |
---|---|---|---|
1960 | Kennedy | W | 163 |
1960 | Nixon | L | 122 |
1976 | Carter | W | 68 |
1976 | Ford | L | 32 |
1980 | Reagan | W | 19 |
1980 | Carter | L | 18 |
… | … | … | … |
Luckily, R
has a way for handling this spread sheet like data, called a dataframe.
Let’s create the above as a dataframe using the data.frame()
function:
presidentialElection <- data.frame(
Year = c(1960, 1960, 1976, 1976, 1980, 1980),
Candidate = c("Kennedy", "Nixon", "Carter", "Ford", "Reagan", "Carter"),
PopularVote = c("W", "L", "W", "L", "W", "L"),
TermCount = c(163, 122, 68, 32, 19, 18)
)
First, let’s take a look at our dataframe:
View(presidentialElection)
We can target a specific variable–or vector–in a dataframe with the $
symbol using the following formula:
dataframeName$variableName
For example, if we want to see the first few entries of the Candidate
data we would enter:
head(presidentialElection$Candidate)
## [1] "Kennedy" "Nixon" "Carter" "Ford" "Reagan" "Carter"
Since dataframes are made up of a series of vectors of equal length, we can query any column as we would any vector.
We can, for example, ask about the data type or even change the data type of a given variable. For example, we can ask about PopularVote
typeof(presidentialElection$PopularVote)
## [1] "character"
Which we see is character
. But we know this data would probably be more useful if interpreted as categorical data. We can adjust for that
presidentialElection$PopularVote <- as.factor(presidentialElection$PopularVote)
typeof(presidentialElection$PopularVote)
## [1] "integer"
levels(presidentialElection$PopularVote)
## [1] "L" "W"
Since dataframes are indexed and the structure is dimensionally similar to a matrix, we can also inquire about them like a matrix. So we can view a single row or a single data point using either an index point or a label:
presidentialElection[2,]
## Year Candidate PopularVote TermCount
## 2 1960 Nixon L 122
presidentialElection[2,2]
## [1] "Nixon"
presidentialElection[2,"Candidate"]
## [1] "Nixon"
While we can run the usual questions against our dataframe, such as nrow()
for get a sense of how many observations are in the dataset and summary()
for some basic computed stats, this is where str()
can be particularly useful
str(presidentialElection)
## 'data.frame': 6 obs. of 4 variables:
## $ Year : num 1960 1960 1976 1976 1980 ...
## $ Candidate : chr "Kennedy" "Nixon" "Carter" "Ford" ...
## $ PopularVote: Factor w/ 2 levels "L","W": 2 1 2 1 2 1
## $ TermCount : num 163 122 68 32 19 18
The take away here is that, just as we can do things to vectors and matrices, we can do similar things to dataframes which contain a mix of data types.