Character and Categorical Data

Introduction

When we collect data, we also frequently collect strings of text. For example, we might collect data in a free text field where we ask participants, "what's your favourite food?"

Let’s build a potential selection of choices into a variable called favouriteFood:

favouriteFood <- c("Costco hotdogs", "Pizza", "Pork and onion dumplings", "Anything deep fried", "Potato chips")

favouriteFood

## [1] "Costco hotdogs"           "Pizza"                   
## [3] "Pork and onion dumplings" "Anything deep fried"     
## [5] "Potato chips"

Note

Our text is contained in " ". Characters, or strings of text, always need to be encompassed in " ", which tells R we're dealing with natural language, not units that we can compute, such as numbers, or a conditional test, like logical values.

We can ask R about both our class() and data type typeof()

class(favouriteFood)

## [1] "character"

typeof(favouriteFood)

## [1] "character"

We could hold larger chunks of text in a variable if we wanted, and in fact, we can use R for text analysis starting with a single work of text as a data import which we then break apart for analysis.

Categorical Data

Often our character data is in fact categorical data. Categorical data, while represented by strings of characters, is not the same as prose. It would be nice to have a way to work with categorical text data as categorical data.

With favouriteFood we have 5 choices that participants can select from. It is common practice to code these choices, say from 1 through 5. We then have three things that we're working with–a variable, a set of labels, and a set of codes assigned to those labels. In describing this data set–say in a data dictionary that accompanied our data–we might record something like:

Variable: favouriteFood
Data type: character
Description: A selection of 5 choices of food types.
Possible values: "Costco hot dogs", "Pizza", "Pork and onion dumplings", "Anything deep fried", "Potato chips"
Code label assignment:

code	label
1	Costco hot dogs
2	Pizza
3	Pork and onion dumplings
4	Anything deep fried
5	Potato chips

Why might it be useful to store categorical data as integers?

One reason would be to think about the amount of information we're working with. Costco hot dogs is comprised of thirteen letters, Pork and onion dumplings is a whopping 21. That’s a lot more information than 1 and 3. Storing this information and processing this data as integer data tied to a label is much less resource heavy. We can see this by taking some large random samples from out two potential ways of storing this data

labels <- sample(favouriteFood, 1000000, replace = TRUE) ## take a million samples from favouriteFood

codes <- sample(c(1:5), 1000000, replace = TRUE) ## take a million samples from the integer codes representing favouritFood

labels.size <- format(object.size(labels), units = "Mb") ## calculate the amount of storage space needed to hold the labels

codes.size <- format(object.size(codes), units = "Mb") ## calculate the amount of storage space needed to hold the integer codes

## print the difference to the screen
paste0("Using labels, we'd be using ", labels.size, " worth of memory. Using numeric codes, we'd be using ", codes.size, ", which is half as much!")

## [1] "Using labels, we'd be using 7.6 Mb worth of memory. Using numeric codes, we'd be using 3.8 Mb, which is half as much!"

When we think about using R well, we want to be thinking in part of how to avoid inefficiencies. Storing categorical data as integers with reference labels is one way to do this.

Back to R and categories

R provides us with a semantic class, factors, to address this; factors are–just as we suggested above as a solution–integers with labels.

Say we collected information on eye colour from some survey participants. We can capture possible values in the following vector:

eyeColour <- c("blue", "green", "brown", "hazel", "amber", "grey")

eyeColour

## [1] "blue"  "green" "brown" "hazel" "amber" "grey"

Now let's treat this more like true data capture, and create a random sample of eye colour from 50 participants–we'll look more at sample() shortly.

set.seed(120) ## make the example reproducible

eyeColour.Sample <- sample(eyeColour, 50, replace = TRUE) ## take a sample

eyeColour.Sample ## print the sample to the screen

##  [1] "amber" "brown" "blue"  "grey"  "hazel" "green" "blue"  "green" "blue" 
## [10] "hazel" "amber" "hazel" "brown" "blue"  "hazel" "brown" "brown" "green"
## [19] "brown" "hazel" "green" "brown" "amber" "hazel" "grey"  "green" "blue" 
## [28] "blue"  "hazel" "green" "hazel" "grey"  "grey"  "grey"  "green" "green"
## [37] "amber" "brown" "brown" "green" "green" "brown" "green" "blue"  "green"
## [46] "brown" "amber" "hazel" "amber" "brown"

Let's first ask R what class() our object is and what typeof() data we're storing

class(eyeColour.Sample)

## [1] "character"

typeof(eyeColour.Sample)

## [1] "character"

Note that both return Character. But we know that this is categorical data with text labels, so ideally, class() would tell us that we have a vector of a categorical variable, while typeof() would tell us that this is store with numeric data values.

So, we have character data that we want held as factor, or categorical data–that is, we want our strings converted to integers where each integer has its original string as its label.

Converting Between Classes

R allows us to convert data between classes. R stores categorical data as a class called factor. We can convert our plain text character data to categorical data with the as.factor() function:

eyeColour.Sample <- as.factor(eyeColour.Sample)

eyeColour.Sample

##  [1] amber brown blue  grey  hazel green blue  green blue  hazel amber hazel
## [13] brown blue  hazel brown brown green brown hazel green brown amber hazel
## [25] grey  green blue  blue  hazel green hazel grey  grey  grey  green green
## [37] amber brown brown green green brown green blue  green brown amber hazel
## [49] amber brown
## Levels: amber blue brown green grey hazel

We have now overwritten our variable eyeColour.Sample and when we call it, R reports on levels, that is, categories, and it has dropped our " ". Also, note that we're no longer looking strictly at our underlying data and its raw structure, we're looking instead at the labels associated with our data.

If we ask about class() and typeof() we also now see that this is a factor, comprised of integers, which when called, is displayed with text.

class(eyeColour.Sample)

## [1] "factor"

typeof(eyeColour.Sample)

## [1] "integer"

eyeColour.Sample

##  [1] amber brown blue  grey  hazel green blue  green blue  hazel amber hazel
## [13] brown blue  hazel brown brown green brown hazel green brown amber hazel
## [25] grey  green blue  blue  hazel green hazel grey  grey  grey  green green
## [37] amber brown brown green green brown green blue  green brown amber hazel
## [49] amber brown
## Levels: amber blue brown green grey hazel

We can ask R to tell us only the labels used, as well as the number of labels or categories represented using the functions levels() and nlevels()

levels(eyeColour.Sample)

## [1] "amber" "blue"  "brown" "green" "grey"  "hazel"

nlevels(eyeColour.Sample)

## [1] 6

eyeColour.Sample is nominal categorical data. Sometimes categorical data has an inherent order, such as storm classifications or education levels. Calling the help page for factor–?factor tells us we can use as.ordered() or is.ordered() to specify that our categorical data is ordinal.

We should then investigate if the order being used by R is in fact the order we intend! The levels() function will allow you to re-order your ordinal data.

Note

In older versions of R, on import, character data would be converted to factor. This is no longer the default behaviour. Some analyses of your data that you may undertake will only work if your character data is encoded as categorical data. There are ways to do this when importing your data, which we'll see later. After importing your data you can use as.factor to achieve this.

R allows us to convert between many data structures using as.:

as.numeric
as.vector
as.character

etc…

Asking Questions

We’ve encountered a number of functions to this point that allow us to ask questions of our data and the objects holding our data. One other useful function is summary(), which provides an overview of our data. What summary() returns will depend on what you're calling the function on.

summary(eyeColour.Sample)

## amber  blue brown green  grey hazel 
##     6     7    11    12     5     9

set.seed(120) ## make the example reproducible

eyeColour.Sample.Char <- sample(eyeColour, 50, replace = TRUE) ## take a sample

summary(eyeColour.Sample.Char) ## print the sample to the screen

##    Length     Class      Mode 
##        50 character character

Exercise 3.2

Find all the as. functions available in R

What Have we Learned

R can work with plain text, which needs to be wrapped in " "
R can hold categorical data numerically while representing it with labels making it semantically significant.
R can convert between classes of data