Vectors are single dimension objects.
Arrays are multidimensional. A matrix is a special case of an array that contains only 2 dimensions; they have length and width but no depth. Think of this like a an excel document that has only one sheet.
Note
Generating all of our own data may seem a little tedious, but it's both a great way to troubleshoot, get familiar with a new concept, and once you move into online forums for support, this support will often use ad hoc generated data to provide a solution–in fact if you're asking for help, you'll be asked to share a reprducible example, but you may not wish to do so using youractual data points–so it's good to be familiar with the processes involved in data generation.
Imagine we're collecting data on people, including their height and weight. To simulate a random sample create the following vectors:
set.seed(120) ## This way we know we're all getting the same values
Height <- sample(120:190, 20, replace = TRUE) # A random sample of 20 data points from the range 120 to 190 where values can be replaced
Weight <- sample(50:90, 20, replace = TRUE) # A random sample of 20 data points from the range 50 to 90 where values can be replaced
Height
## [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136
Weight
## [1] 83 61 55 63 79 59 88 51 60 66 75 60 65 69 70 54 72 60 89 73
We now have two uni-dimensional objects; two vectors.
To convert these into a two dimensional array, or matrix, we'll combine them together into a single object called PhysCharacteristics
.
To do this, we will use the function cbind()
, as in column bind; a handy function that allows us to append one vector of data onto another, which will give us a matrix with 2 variables each with 20 data points.
We will give cbind()
two arguments, the variable names of our two vectors.
PhysCharacteristics <- cbind(Height, Weight)
head(PhysCharacteristics)
## Height Weight
## [1,] 186 83
## [2,] 176 61
## [3,] 157 55
## [4,] 166 63
## [5,] 142 79
## [6,] 155 59
Great, now we have a grid of data.
We could get a summary of our data if we wanted
summary(PhysCharacteristics)
## Height Weight
## Min. :121.0 Min. :51.0
## 1st Qu.:150.2 1st Qu.:60.0
## Median :159.5 Median :65.5
## Mean :159.1 Mean :67.6
## 3rd Qu.:170.0 3rd Qu.:73.5
## Max. :190.0 Max. :89.0
Note
Matrices and arrays come with a limiting aspect; each column of data must be the same length. If we try to combine two objects of different lengths, we'll get a warning and R
will recycle the values of the shorter object until it matches the length of the longer object.
v1 <- c(1:11)
v2 <- c(1,2)
v3 <- cbind(v1, v2)
## Warning in cbind(v1, v2): number of rows of result is not a multiple of vector
## length (arg 2)
v3
## v1 v2
## [1,] 1 1
## [2,] 2 2
## [3,] 3 1
## [4,] 4 2
## [5,] 5 1
## [6,] 6 2
## [7,] 7 1
## [8,] 8 2
## [9,] 9 1
## [10,] 10 2
## [11,] 11 1
Also, matrices, like vectors, are atomic, that is, an array can only contain one data type - numeric, character, logical etc. We cannot have a matrix where one vector is numeric data and the second is character data. We have other options that we'll look at for this in a moment.
Let's now create a matrix, with three columns to see a few more things that we can do with matrices and arrays.
Say that we are also interested in collecting age in addition to height and weight. As before create a sample of 20 ages from an undergraduate population, so, between 20 and 25:
set.seed(120)
Age <- sample(20:25, 20, replace = TRUE) # A random sample of 20 data points from the range 20 to 25 where values can be replaced
Age
## [1] 24 22 20 25 23 21 20 21 20 23 24 23 22 20 23 22 22 21 22 23
We'll employ the function cbind()
again to join this new variable and associated data with our existing matrix.
PhysCharacteristics <- cbind(PhysCharacteristics, Age) ## overwrite PhysCharacteristics with a merger between PhysCharacteristics and Age
head(PhysCharacteristics) ## display the first few rows
## Height Weight Age
## [1,] 186 83 24
## [2,] 176 61 22
## [3,] 157 55 20
## [4,] 166 63 25
## [5,] 142 79 23
## [6,] 155 59 21
We now have three columns, one for Height
, Weight
and Age
respectively. And we’re finally starting to see something like a proper data set!
In the same way that we can add columns to our data, we can also add rows to our data. We are not limited to adding a single column or a single row at a time.
In the following steps, we will use rbind()
, as in row bind, to merge two data sets together.
Say we get two data sets of height, weight, and age data, each collected by two different researchers. We already have one set PhysCharacteristics
. We'll create a second, ResearcherTwo
, to work through this.
As before, we'll take two samples and cbind
them together.
set.seed(121) ## not strictly necessary, but makes sure we're all in the same place
## generate some data
Height2 <- sample(120:190, 20, replace = TRUE)
Weight2 <- sample(50:90, 20, replace = TRUE)
Age2 <- sample(20:25, 20, replace = TRUE)
## print that data to the screen to verify
Height2
## [1] 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151 159
## [20] 174
Weight2
## [1] 67 61 65 63 55 57 67 83 55 83 66 54 62 80 76 62 76 83 69 83
Age2
## [1] 22 21 24 20 21 23 21 20 21 21 24 20 21 22 23 21 22 22 22 25
ResearcherTwo <- cbind(Height2, Weight2, Age2) ## column bind the data
head(ResearcherTwo) ## check things out
## Height2 Weight2 Age2
## [1,] 171 67 22
## [2,] 147 61 21
## [3,] 131 65 24
## [4,] 190 63 20
## [5,] 123 55 21
## [6,] 178 57 23
Now we put them together into a new matrix called completePhysDataSet
.
completePhysDataSet <- rbind(PhysCharacteristics, ResearcherTwo)
We'll use View()
to see if we were successful
View(completePhysDataSet)
While we can use summary()
to get some basic stats about our data
summary(completePhysDataSet)
## Height Weight Age
## Min. :120.0 Min. :51.00 Min. :20.00
## 1st Qu.:145.8 1st Qu.:60.00 1st Qu.:21.00
## Median :157.5 Median :66.00 Median :22.00
## Mean :158.0 Mean :67.97 Mean :21.93
## 3rd Qu.:174.5 3rd Qu.:76.00 3rd Qu.:23.00
## Max. :190.0 Max. :89.00 Max. :25.00
We can also get information about the structure of our new data object with str()
str(completePhysDataSet)
## int [1:40, 1:3] 186 176 157 166 142 155 153 168 190 163 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:3] "Height" "Weight" "Age"
Remember we said that matrices, like vectors, are atomic. Let’s see what the implications of this are.
Imagine we were also collecting eye colour as a variable. What we’ll do then is create a vector of 40 random samples of eye colour and add it as a fourth column to our matrix, named this time, physData.eyeColour.sample
.
Remember, we already created a vector of eye colours from which we can generate a random sample.
eyeColour
## [1] "blue" "green" "brown" "hazel" "amber" "grey"
We'll use this to generate a sample of 40 data points that we can add to our completePhysDataSet
set.seed(120)
physData.eyeColour.sample <- sample(eyeColour, 40, replace = TRUE) # A random sample of 40 data points from the character options in the variable eyeColour where values can be replaced
physData.eyeColour.sample
## [1] "amber" "brown" "blue" "grey" "hazel" "green" "blue" "green" "blue"
## [10] "hazel" "amber" "hazel" "brown" "blue" "hazel" "brown" "brown" "green"
## [19] "brown" "hazel" "green" "brown" "amber" "hazel" "grey" "green" "blue"
## [28] "blue" "hazel" "green" "hazel" "grey" "grey" "grey" "green" "green"
## [37] "amber" "brown" "brown" "green"
Now, we'll add this to completePhysDataSet
with cbind()
completePhysDataSet <- cbind(completePhysDataSet, physData.eyeColour.sample)
head(completePhysDataSet)
## Height Weight Age physData.eyeColour.sample
## [1,] "186" "83" "24" "amber"
## [2,] "176" "61" "22" "brown"
## [3,] "157" "55" "20" "blue"
## [4,] "166" "63" "25" "grey"
## [5,] "142" "79" "23" "hazel"
## [6,] "155" "59" "21" "green"
Outcome: R
has adapted on the fly to allow you to do this.
Inadvertently, however, it has converted all of your numerical data to character data! Remember we said that R
is a little too easy going and let's you do things it probably shouldn't? This is one case in point. Instead of providing you with an error or a warning, it has coerced the other data in your matrix to conform to the data type you've added.
You should always be careful and always inquire to see if what you intended to have happen actually happened.
Let's reset things…
completePhysDataSet <- rbind(PhysCharacteristics, ResearcherTwo) ## overwrite the variable holding the data set back to when we first added the two data sets together
Matrices, like vectors are indexed, but now they have two axes that are indexed, rows and columns.
We can work with these just as we did with vectors, but we now need to specify if we're interested in a row or a column of data - that is, a variable or an observation.
We do this in the following way: matrix[row, column]
.
So, say we wanted to see only the Height
data. We could can do this in one of two ways, with an index number or an index label:
completePhysDataSet[, 1]
## [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151
## [39] 159 174
completePhysDataSet[, "Height"] ## since labels are characters, they need to wrapped in " "
## [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151
## [39] 159 174
If we wanted to see Height
and Age
only, we can concatenate just as we did with our vector:
head(completePhysDataSet[, c(1,3)])
## Height Age
## [1,] 186 24
## [2,] 176 22
## [3,] 157 20
## [4,] 166 25
## [5,] 142 23
## [6,] 155 21
head(completePhysDataSet[, c("Height", "Age")])
## Height Age
## [1,] 186 24
## [2,] 176 22
## [3,] 157 20
## [4,] 166 25
## [5,] 142 23
## [6,] 155 21
To see a row:
completePhysDataSet[1, ]
## Height Weight Age
## 186 83 24
completePhysDataSet[1:3, ]
## Height Weight Age
## [1,] 186 83 24
## [2,] 176 61 22
## [3,] 157 55 20
Note
By not indicating what row or column we’re interested in, R
defaults to showing us the entire row or column.
To see an intersection:
completePhysDataSet[1,3]
## Age
## 24
completePhysDataSet[1, "Age"]
## Age
## 24
completePhysDataSet[1:3, c("Height", "Age")]
## Height Age
## [1,] 186 24
## [2,] 176 22
## [3,] 157 20
Three questions that we can ask of multi-dimensional structures in R that can come in handy include.
nrow()
which tells us the number of rowsncol()
which tells us the number of columnsdim()
which tells us both pieces of informationnrow(completePhysDataSet)
## [1] 40
ncol(completePhysDataSet)
## [1] 3
dim(completePhysDataSet)
## [1] 40 3
Note
The output of all of these inquiries are data structures themselves. In fact, the output of the three above is a vector consisting of either a single numeric value or two numeric values.
is.vector(nrow(completePhysDataSet))
## [1] TRUE
This means we can use these outputs for other calculations.
newCalculation <- nrow(completePhysDataSet) + ncol(completePhysDataSet)
newCalculation
## [1] 43
cbind()
rbind()
Let’s bring together a few of the things we’ve learned to organize some of our data. And in doing so, introduce a few more useful functions.
Take a few minutes and try to:
Height
vector data from lowest to highest value;Do the above using the following functions:
sort()
unique()
as.factor
summary