Spreadsheet Data

If we’re only dealing with variables of a single data type, matrices are great; in fact, they're very efficient if we only have one data type. But most of us are probably used to working with data that will have some variables that are numerical, some that are categorical, and some that are character based.

Consider the following table of data recording the number of times a presidential candidate used the words 'will', 'shall', and 'going to' in the presidential debates, and whether or not they won or lost the popular vote:

Year Candidate Won (W) or Lost (L) the Popular Vote Number of 'will', 'shall', 'going to'
1960 Kennedy W 163
1960 Nixon L 122
1976 Carter W 68
1976 Ford L 32
1980 Reagan W 19
1980 Carter L 18

Luckily, R has a way for handling this spread sheet like data, called a dataframe.

Creating a Dataframe

Let’s create the above as a dataframe using the data.frame() function:

presidentialElection <- data.frame(
    Year = c(1960, 1960, 1976, 1976, 1980, 1980),
    Candidate = c("Kennedy", "Nixon", "Carter", "Ford", "Reagan", "Carter"),
    PopularVote = c("W", "L", "W", "L", "W", "L"),
    TermCount = c(163, 122, 68, 32, 19, 18)
)

First, let’s take a look at our dataframe:

View(presidentialElection)

Targeting a Variable

We can target a specific variable–or vector–in a dataframe with the $ symbol using the following formula:

dataframeName$variableName

For example, if we want to see the first few entries of the Candidate data we would enter:

head(presidentialElection$Candidate)
## [1] "Kennedy" "Nixon"   "Carter"  "Ford"    "Reagan"  "Carter"

Querying a Dataframe

Since dataframes are made up of a series of vectors of equal length, we can query any column as we would any vector.

We can, for example, ask about the data type or even change the data type of a given variable. For example, we can ask about PopularVote

typeof(presidentialElection$PopularVote)
## [1] "character"

Which we see is character. But we know this data would probably be more useful if interpreted as categorical data. We can adjust for that

presidentialElection$PopularVote <- as.factor(presidentialElection$PopularVote)

typeof(presidentialElection$PopularVote)
## [1] "integer"
levels(presidentialElection$PopularVote)
## [1] "L" "W"

Indexing

Since dataframes are indexed and the structure is dimensionally similar to a matrix, we can also inquire about them like a matrix. So we can view a single row or a single data point using either an index point or a label:

presidentialElection[2,]
##   Year Candidate PopularVote TermCount
## 2 1960     Nixon           L       122
presidentialElection[2,2]
## [1] "Nixon"
presidentialElection[2,"Candidate"]
## [1] "Nixon"

Questions

While we can run the usual questions against our dataframe, such as nrow() for get a sense of how many observations are in the dataset and summary() for some basic computed stats, this is where str() can be particularly useful

str(presidentialElection)
## 'data.frame':    6 obs. of  4 variables:
##  $ Year       : num  1960 1960 1976 1976 1980 ...
##  $ Candidate  : chr  "Kennedy" "Nixon" "Carter" "Ford" ...
##  $ PopularVote: Factor w/ 2 levels "L","W": 2 1 2 1 2 1
##  $ TermCount  : num  163 122 68 32 19 18

What have we learned

The take away here is that, just as we can do things to vectors and matrices, we can do similar things to dataframes which contain a mix of data types.