When we collect data, we also frequently collect strings of text. For example, we might collect data in a free text field where we ask participants, "what's your favourite food?"
Let’s build a potential selection of choices into a variable called favouriteFood
:
favouriteFood <- c("Costco hotdogs", "Pizza", "Pork and onion dumplings", "Anything deep fried", "Potato chips")
favouriteFood
## [1] "Costco hotdogs" "Pizza"
## [3] "Pork and onion dumplings" "Anything deep fried"
## [5] "Potato chips"
Note
Our text is contained in " "
. Characters, or strings of text, always need to be encompassed in " "
, which tells R
we're dealing with natural language, not units that we can compute, such as numbers, or a conditional test, like logical values.
We can ask R
about both our class()
and data type typeof()
class(favouriteFood)
## [1] "character"
typeof(favouriteFood)
## [1] "character"
We could hold larger chunks of text in a variable if we wanted, and in fact, we can use R
for text analysis starting with a single work of text as a data import which we then break apart for analysis.
Often our character data is in fact categorical data. Categorical data, while represented by strings of characters, is not the same as prose. It would be nice to have a way to work with categorical text data as categorical data.
With favouriteFood
we have 5 choices that participants can select from. It is common practice to code these choices, say from 1 through 5. We then have three things that we're working with–a variable, a set of labels, and a set of codes assigned to those labels. In describing this data set–say in a data dictionary that accompanied our data–we might record something like:
code | label |
---|---|
1 | Costco hot dogs |
2 | Pizza |
3 | Pork and onion dumplings |
4 | Anything deep fried |
5 | Potato chips |
Why might it be useful to store categorical data as integers?
One reason would be to think about the amount of information we're working with. Costco hot dogs is comprised of thirteen letters, Pork and onion dumplings is a whopping 21. That’s a lot more information than 1 and 3. Storing this information and processing this data as integer data tied to a label is much less resource heavy. We can see this by taking some large random samples from out two potential ways of storing this data
labels <- sample(favouriteFood, 1000000, replace = TRUE) ## take a million samples from favouriteFood
codes <- sample(c(1:5), 1000000, replace = TRUE) ## take a million samples from the integer codes representing favouritFood
labels.size <- format(object.size(labels), units = "Mb") ## calculate the amount of storage space needed to hold the labels
codes.size <- format(object.size(codes), units = "Mb") ## calculate the amount of storage space needed to hold the integer codes
## print the difference to the screen
paste0("Using labels, we'd be using ", labels.size, " worth of memory. Using numeric codes, we'd be using ", codes.size, ", which is half as much!")
## [1] "Using labels, we'd be using 7.6 Mb worth of memory. Using numeric codes, we'd be using 3.8 Mb, which is half as much!"
When we think about using R
well, we want to be thinking in part of how to avoid inefficiencies. Storing categorical data as integers with reference labels is one way to do this.
Back to R
and categories
R
provides us with a semantic class, factors, to address this; factors are–just as we suggested above as a solution–integers with labels.
Say we collected information on eye colour from some survey participants. We can capture possible values in the following vector:
eyeColour <- c("blue", "green", "brown", "hazel", "amber", "grey")
eyeColour
## [1] "blue" "green" "brown" "hazel" "amber" "grey"
Now let's treat this more like true data capture, and create a random sample of eye colour from 50 participants–we'll look more at sample()
shortly.
set.seed(120) ## make the example reproducible
eyeColour.Sample <- sample(eyeColour, 50, replace = TRUE) ## take a sample
eyeColour.Sample ## print the sample to the screen
## [1] "amber" "brown" "blue" "grey" "hazel" "green" "blue" "green" "blue"
## [10] "hazel" "amber" "hazel" "brown" "blue" "hazel" "brown" "brown" "green"
## [19] "brown" "hazel" "green" "brown" "amber" "hazel" "grey" "green" "blue"
## [28] "blue" "hazel" "green" "hazel" "grey" "grey" "grey" "green" "green"
## [37] "amber" "brown" "brown" "green" "green" "brown" "green" "blue" "green"
## [46] "brown" "amber" "hazel" "amber" "brown"
Let's first ask R
what class()
our object is and what typeof()
data we're storing
class(eyeColour.Sample)
## [1] "character"
typeof(eyeColour.Sample)
## [1] "character"
Note that both return Character
. But we know that this is categorical data with text labels, so ideally, class()
would tell us that we have a vector of a categorical variable, while typeof()
would tell us that this is store with numeric data values.
So, we have character data that we want held as factor, or categorical data–that is, we want our strings converted to integers where each integer has its original string as its label.
R
allows us to convert data between classes. R
stores categorical data as a class called factor
. We can convert our plain text character data to categorical data with the as.factor()
function:
eyeColour.Sample <- as.factor(eyeColour.Sample)
eyeColour.Sample
## [1] amber brown blue grey hazel green blue green blue hazel amber hazel
## [13] brown blue hazel brown brown green brown hazel green brown amber hazel
## [25] grey green blue blue hazel green hazel grey grey grey green green
## [37] amber brown brown green green brown green blue green brown amber hazel
## [49] amber brown
## Levels: amber blue brown green grey hazel
We have now overwritten our variable eyeColour.Sample
and when we call it, R
reports on levels
, that is, categories, and it has dropped our " "
. Also, note that we're no longer looking strictly at our underlying data and its raw structure, we're looking instead at the labels associated with our data.
If we ask about class()
and typeof()
we also now see that this is a factor, comprised of integers, which when called, is displayed with text.
class(eyeColour.Sample)
## [1] "factor"
typeof(eyeColour.Sample)
## [1] "integer"
eyeColour.Sample
## [1] amber brown blue grey hazel green blue green blue hazel amber hazel
## [13] brown blue hazel brown brown green brown hazel green brown amber hazel
## [25] grey green blue blue hazel green hazel grey grey grey green green
## [37] amber brown brown green green brown green blue green brown amber hazel
## [49] amber brown
## Levels: amber blue brown green grey hazel
We can ask R
to tell us only the labels used, as well as the number of labels or categories represented using the functions levels()
and nlevels()
levels(eyeColour.Sample)
## [1] "amber" "blue" "brown" "green" "grey" "hazel"
nlevels(eyeColour.Sample)
## [1] 6
eyeColour.Sample
is nominal categorical data. Sometimes categorical data has an inherent order, such as storm classifications or education levels. Calling the help page for factor
–?factor
tells us we can use as.ordered()
or is.ordered()
to specify that our categorical data is ordinal.
We should then investigate if the order being used by R
is in fact the order we intend! The levels()
function will allow you to re-order your ordinal data.
Note
In older versions of R
, on import, character data would be converted to factor. This is no longer the default behaviour. Some analyses of your data that you may undertake will only work if your character data is encoded as categorical data. There are ways to do this when importing your data, which we'll see later. After importing your data you can use as.factor
to achieve this.
R
allows us to convert between many data structures using as.
:
as.numeric
as.vector
as.character
etc…
We’ve encountered a number of functions to this point that allow us to ask questions of our data and the objects holding our data. One other useful function is summary()
, which provides an overview of our data. What summary()
returns will depend on what you're calling the function on.
summary(eyeColour.Sample)
## amber blue brown green grey hazel
## 6 7 11 12 5 9
set.seed(120) ## make the example reproducible
eyeColour.Sample.Char <- sample(eyeColour, 50, replace = TRUE) ## take a sample
summary(eyeColour.Sample.Char) ## print the sample to the screen
## Length Class Mode
## 50 character character
as.
functions available in R
R
can work with plain text, which needs to be wrapped in " "
R
can hold categorical data numerically while representing it with labels making it semantically significant.R
can convert between classes of data