Dataframes

Spreadsheet Data

If we’re only dealing with variables of a single data type, matrices are great; in fact, they're very efficient if we only have one data type. But most of us are probably used to working with data that will have some variables that are numerical, some that are categorical, and some that are character based.

Consider the following table of data recording the number of times a presidential candidate used the words 'will', 'shall', and 'going to' in the presidential debates, and whether or not they won or lost the popular vote:

Year	Candidate	Won (W) or Lost (L) the Popular Vote	Number of 'will', 'shall', 'going to'
1960	Kennedy	W	163
1960	Nixon	L	122
1976	Carter	W	68
1976	Ford	L	32
1980	Reagan	W	19
1980	Carter	L	18
…	…	…	…

Luckily, R has a way for handling this spread sheet like data, called a dataframe.

Creating a Dataframe

Let’s create the above as a dataframe using the data.frame() function:

presidentialElection <- data.frame(
    Year = c(1960, 1960, 1976, 1976, 1980, 1980),
    Candidate = c("Kennedy", "Nixon", "Carter", "Ford", "Reagan", "Carter"),
    PopularVote = c("W", "L", "W", "L", "W", "L"),
    TermCount = c(163, 122, 68, 32, 19, 18)
)

First, let’s take a look at our dataframe:

View(presidentialElection)

Targeting a Variable

We can target a specific variable–or vector–in a dataframe with the $ symbol using the following formula:

dataframeName$variableName

For example, if we want to see the first few entries of the Candidate data we would enter:

head(presidentialElection$Candidate)

## [1] "Kennedy" "Nixon"   "Carter"  "Ford"    "Reagan"  "Carter"

Querying a Dataframe

Since dataframes are made up of a series of vectors of equal length, we can query any column as we would any vector.

We can, for example, ask about the data type or even change the data type of a given variable. For example, we can ask about PopularVote

typeof(presidentialElection$PopularVote)

## [1] "character"

Which we see is character. But we know this data would probably be more useful if interpreted as categorical data. We can adjust for that

presidentialElection$PopularVote <- as.factor(presidentialElection$PopularVote)

typeof(presidentialElection$PopularVote)

## [1] "integer"

levels(presidentialElection$PopularVote)

## [1] "L" "W"

Indexing

Since dataframes are indexed and the structure is dimensionally similar to a matrix, we can also inquire about them like a matrix. So we can view a single row or a single data point using either an index point or a label:

presidentialElection[2,]

##   Year Candidate PopularVote TermCount
## 2 1960     Nixon           L       122

presidentialElection[2,2]

## [1] "Nixon"

presidentialElection[2,"Candidate"]

## [1] "Nixon"

Questions

While we can run the usual questions against our dataframe, such as nrow() for get a sense of how many observations are in the dataset and summary() for some basic computed stats, this is where str() can be particularly useful

str(presidentialElection)

## 'data.frame':    6 obs. of  4 variables:
##  $ Year       : num  1960 1960 1976 1976 1980 ...
##  $ Candidate  : chr  "Kennedy" "Nixon" "Carter" "Ford" ...
##  $ PopularVote: Factor w/ 2 levels "L","W": 2 1 2 1 2 1
##  $ TermCount  : num  163 122 68 32 19 18

What have we learned

The take away here is that, just as we can do things to vectors and matrices, we can do similar things to dataframes which contain a mix of data types.