Matrix Data

Vectors are single dimension objects.

Arrays are multidimensional. A matrix is a special case of an array that contains only 2 dimensions; they have length and width but no depth. Think of this like a an excel document that has only one sheet.

  • Vectors have one dimension
  • Arrays have two or more dimensions
  • Matrices are a specific case of an array where there are only 2 dimensions

Note

Generating all of our own data may seem a little tedious, but it's both a great way to troubleshoot, get familiar with a new concept, and once you move into online forums for support, this support will often use ad hoc generated data to provide a solution–in fact if you're asking for help, you'll be asked to share a reprducible example, but you may not wish to do so using youractual data points–so it's good to be familiar with the processes involved in data generation.

Imagine we're collecting data on people, including their height and weight. To simulate a random sample create the following vectors:

set.seed(120) ## This way we know we're all getting the same values

Height <- sample(120:190, 20, replace = TRUE) # A random sample of 20 data points from the range 120 to 190 where values can be replaced

Weight <- sample(50:90, 20, replace = TRUE) # A random sample of 20 data points from the range 50 to 90 where values can be replaced

Height
##  [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136
Weight
##  [1] 83 61 55 63 79 59 88 51 60 66 75 60 65 69 70 54 72 60 89 73

Combining Data

We now have two uni-dimensional objects; two vectors.

To convert these into a two dimensional array, or matrix, we'll combine them together into a single object called PhysCharacteristics.

cbind()

To do this, we will use the function cbind(), as in column bind; a handy function that allows us to append one vector of data onto another, which will give us a matrix with 2 variables each with 20 data points.

We will give cbind() two arguments, the variable names of our two vectors.

PhysCharacteristics <- cbind(Height, Weight)

head(PhysCharacteristics)
##      Height Weight
## [1,]    186     83
## [2,]    176     61
## [3,]    157     55
## [4,]    166     63
## [5,]    142     79
## [6,]    155     59

Great, now we have a grid of data.

We could get a summary of our data if we wanted

summary(PhysCharacteristics)
##      Height          Weight    
##  Min.   :121.0   Min.   :51.0  
##  1st Qu.:150.2   1st Qu.:60.0  
##  Median :159.5   Median :65.5  
##  Mean   :159.1   Mean   :67.6  
##  3rd Qu.:170.0   3rd Qu.:73.5  
##  Max.   :190.0   Max.   :89.0

Note

Matrices and arrays come with a limiting aspect; each column of data must be the same length. If we try to combine two objects of different lengths, we'll get a warning and R will recycle the values of the shorter object until it matches the length of the longer object.

v1 <- c(1:11)
v2 <- c(1,2)
v3 <- cbind(v1, v2)
## Warning in cbind(v1, v2): number of rows of result is not a multiple of vector
## length (arg 2)
v3
##       v1 v2
##  [1,]  1  1
##  [2,]  2  2
##  [3,]  3  1
##  [4,]  4  2
##  [5,]  5  1
##  [6,]  6  2
##  [7,]  7  1
##  [8,]  8  2
##  [9,]  9  1
## [10,] 10  2
## [11,] 11  1

Also, matrices, like vectors, are atomic, that is, an array can only contain one data type - numeric, character, logical etc. We cannot have a matrix where one vector is numeric data and the second is character data. We have other options that we'll look at for this in a moment.

Let's now create a matrix, with three columns to see a few more things that we can do with matrices and arrays.

Say that we are also interested in collecting age in addition to height and weight. As before create a sample of 20 ages from an undergraduate population, so, between 20 and 25:

set.seed(120)

Age <- sample(20:25, 20, replace = TRUE) # A random sample of 20 data points from the range 20 to 25 where values can be replaced

Age
##  [1] 24 22 20 25 23 21 20 21 20 23 24 23 22 20 23 22 22 21 22 23

We'll employ the function cbind() again to join this new variable and associated data with our existing matrix.

PhysCharacteristics <- cbind(PhysCharacteristics, Age) ## overwrite PhysCharacteristics with a merger between PhysCharacteristics and Age

head(PhysCharacteristics) ## display the first few rows
##      Height Weight Age
## [1,]    186     83  24
## [2,]    176     61  22
## [3,]    157     55  20
## [4,]    166     63  25
## [5,]    142     79  23
## [6,]    155     59  21

We now have three columns, one for Height, Weight and Age respectively. And we’re finally starting to see something like a proper data set!

rbind()

In the same way that we can add columns to our data, we can also add rows to our data. We are not limited to adding a single column or a single row at a time.

In the following steps, we will use rbind(), as in row bind, to merge two data sets together.

Say we get two data sets of height, weight, and age data, each collected by two different researchers. We already have one set PhysCharacteristics. We'll create a second, ResearcherTwo, to work through this.

As before, we'll take two samples and cbind them together.

set.seed(121) ## not strictly necessary, but makes sure we're all in the same place

## generate some data

Height2 <- sample(120:190, 20, replace = TRUE)

Weight2 <- sample(50:90, 20, replace = TRUE)

Age2 <- sample(20:25, 20, replace = TRUE)

## print that data to the screen to verify

Height2
##  [1] 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151 159
## [20] 174
Weight2
##  [1] 67 61 65 63 55 57 67 83 55 83 66 54 62 80 76 62 76 83 69 83
Age2
##  [1] 22 21 24 20 21 23 21 20 21 21 24 20 21 22 23 21 22 22 22 25
ResearcherTwo <- cbind(Height2, Weight2, Age2) ## column bind the data

head(ResearcherTwo) ## check things out
##      Height2 Weight2 Age2
## [1,]     171      67   22
## [2,]     147      61   21
## [3,]     131      65   24
## [4,]     190      63   20
## [5,]     123      55   21
## [6,]     178      57   23

Now we put them together into a new matrix called completePhysDataSet.

completePhysDataSet <- rbind(PhysCharacteristics, ResearcherTwo)

We'll use View() to see if we were successful

View(completePhysDataSet)

While we can use summary() to get some basic stats about our data

summary(completePhysDataSet)
##      Height          Weight           Age       
##  Min.   :120.0   Min.   :51.00   Min.   :20.00  
##  1st Qu.:145.8   1st Qu.:60.00   1st Qu.:21.00  
##  Median :157.5   Median :66.00   Median :22.00  
##  Mean   :158.0   Mean   :67.97   Mean   :21.93  
##  3rd Qu.:174.5   3rd Qu.:76.00   3rd Qu.:23.00  
##  Max.   :190.0   Max.   :89.00   Max.   :25.00

We can also get information about the structure of our new data object with str()

str(completePhysDataSet)
##  int [1:40, 1:3] 186 176 157 166 142 155 153 168 190 163 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "Height" "Weight" "Age"

Atomic data

Remember we said that matrices, like vectors, are atomic. Let’s see what the implications of this are.

Imagine we were also collecting eye colour as a variable. What we’ll do then is create a vector of 40 random samples of eye colour and add it as a fourth column to our matrix, named this time, physData.eyeColour.sample.

Remember, we already created a vector of eye colours from which we can generate a random sample.

eyeColour
## [1] "blue"  "green" "brown" "hazel" "amber" "grey"

We'll use this to generate a sample of 40 data points that we can add to our completePhysDataSet

set.seed(120)

physData.eyeColour.sample <- sample(eyeColour, 40, replace = TRUE) # A random sample of 40 data points from the character options in the variable eyeColour where values can be replaced

physData.eyeColour.sample
##  [1] "amber" "brown" "blue"  "grey"  "hazel" "green" "blue"  "green" "blue" 
## [10] "hazel" "amber" "hazel" "brown" "blue"  "hazel" "brown" "brown" "green"
## [19] "brown" "hazel" "green" "brown" "amber" "hazel" "grey"  "green" "blue" 
## [28] "blue"  "hazel" "green" "hazel" "grey"  "grey"  "grey"  "green" "green"
## [37] "amber" "brown" "brown" "green"

Now, we'll add this to completePhysDataSet with cbind()

completePhysDataSet <- cbind(completePhysDataSet, physData.eyeColour.sample)

head(completePhysDataSet)
##      Height Weight Age  physData.eyeColour.sample
## [1,] "186"  "83"   "24" "amber"                  
## [2,] "176"  "61"   "22" "brown"                  
## [3,] "157"  "55"   "20" "blue"                   
## [4,] "166"  "63"   "25" "grey"                   
## [5,] "142"  "79"   "23" "hazel"                  
## [6,] "155"  "59"   "21" "green"

Outcome: R has adapted on the fly to allow you to do this.

Inadvertently, however, it has converted all of your numerical data to character data! Remember we said that R is a little too easy going and let's you do things it probably shouldn't? This is one case in point. Instead of providing you with an error or a warning, it has coerced the other data in your matrix to conform to the data type you've added.

You should always be careful and always inquire to see if what you intended to have happen actually happened.

Let's reset things…

completePhysDataSet <- rbind(PhysCharacteristics, ResearcherTwo) ## overwrite the variable holding the data set back to when we first added the two data sets together

Indexing

Matrices, like vectors are indexed, but now they have two axes that are indexed, rows and columns.

We can work with these just as we did with vectors, but we now need to specify if we're interested in a row or a column of data - that is, a variable or an observation.

We do this in the following way: matrix[row, column].

So, say we wanted to see only the Height data. We could can do this in one of two ways, with an index number or an index label:

completePhysDataSet[, 1]
##  [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151
## [39] 159 174
completePhysDataSet[, "Height"] ## since labels are characters, they need to wrapped in " "
##  [1] 186 176 157 166 142 155 153 168 190 163 130 184 154 161 163 158 180 139 121
## [20] 136 171 147 131 190 123 178 120 170 132 149 184 127 185 184 153 155 156 151
## [39] 159 174

If we wanted to see Height and Age only, we can concatenate just as we did with our vector:

head(completePhysDataSet[, c(1,3)])
##      Height Age
## [1,]    186  24
## [2,]    176  22
## [3,]    157  20
## [4,]    166  25
## [5,]    142  23
## [6,]    155  21
head(completePhysDataSet[, c("Height", "Age")])
##      Height Age
## [1,]    186  24
## [2,]    176  22
## [3,]    157  20
## [4,]    166  25
## [5,]    142  23
## [6,]    155  21

To see a row:

completePhysDataSet[1, ]
## Height Weight    Age 
##    186     83     24
completePhysDataSet[1:3, ]
##      Height Weight Age
## [1,]    186     83  24
## [2,]    176     61  22
## [3,]    157     55  20

Note

By not indicating what row or column we’re interested in, R defaults to showing us the entire row or column.

To see an intersection:

completePhysDataSet[1,3]
## Age 
##  24
completePhysDataSet[1, "Age"]
## Age 
##  24
completePhysDataSet[1:3, c("Height", "Age")]
##      Height Age
## [1,]    186  24
## [2,]    176  22
## [3,]    157  20

Questions

Three questions that we can ask of multi-dimensional structures in R that can come in handy include.

  • nrow() which tells us the number of rows
  • ncol() which tells us the number of columns
  • dim() which tells us both pieces of information
nrow(completePhysDataSet)
## [1] 40
ncol(completePhysDataSet)
## [1] 3
dim(completePhysDataSet)
## [1] 40  3

Note

The output of all of these inquiries are data structures themselves. In fact, the output of the three above is a vector consisting of either a single numeric value or two numeric values.

is.vector(nrow(completePhysDataSet))
## [1] TRUE

This means we can use these outputs for other calculations.

newCalculation <- nrow(completePhysDataSet) + ncol(completePhysDataSet)

newCalculation
## [1] 43

What have we learned

  • We can store 2 dimensional data in a matrix
  • A matrix is atomic, that is, it can only hold one data type
  • We can combine vectors to make a matrix with cbind()
  • We can combine datasets together with rbind()
  • The vectors of an array need be of the same length, or recycling happens.

Exercise 4.1

Let’s bring together a few of the things we’ve learned to organize some of our data. And in doing so, introduce a few more useful functions.

Take a few minutes and try to:

  • sort your Height vector data from lowest to highest value;
  • list only the unique values in this same data;
  • list a summary of your data as if it were categorical data;
  • sort your summary from above from lowest to highest.

Do the above using the following functions:

  • sort()
  • unique()
  • as.factor
  • summary