Next we look at creating random samples of data in more detail, how we can run calculations on our data, and how we can replace values in our data.
In additional to mathematical operators, we will learn about the following functions:
sample()
and some of it's more specific variants rnorm
and runif
replace()
In doing so, we're also going to walk out of the console and move into a scripting environment.
Scripting
By writing our code in a script, we can save our code and don't need to retype it every time we want to run it.
Scripting is really the first step to creating a reproducible environment for your work that also saves you time in the long run.
From your file menu, select > New File and > R Script.
This next bit is important
Save your script on your desktop in a folder called RScripts
. This will be important later in this workshop.
R allows us to perform math on our data.
This can be as simple as typing in
2+2
## [1] 4
2*3
## [1] 6
3-2
## [1] 1
4/2
## [1] 2
But that's not overly exciting. We can get a little fancier:
3^3
## [1] 27
sqrt(81)
## [1] 9
log(1200)
## [1] 7.090077
5%%2
## [1] 1
4%%2
## [1] 0
Still not terribly exciting, as this is basically just a glorified calculator. But let's look at this in more depth.
First We know how to create a vector and assign that vector to a variable. So let's get some data prepped and try the following:
a <- c(1:5)
a
## [1] 1 2 3 4 5
b <- c(6:10)
b
## [1] 6 7 8 9 10
c <- a + b
c
## [1] 7 9 11 13 15
Second Let’s do another calculation on c
, this time multiplication:
d <- c * 5
d
## [1] 35 45 55 65 75
Note
Each time we're storing our new, computed data in a new variable.
If we're going to be running a computation on our data, for the purposes error tracking and transparency, it is advisable to generate a new variable.
This last example highlights an interesting feature of R
. Vectorization is built into R
from the ground up. To multiply every value in c
by 5 where vectorization was not a reality, we'd need to iterate the process.
x <- vector() # create an empty vector to hold our results
for (i in 1:length(c)) { # for each element from 1 through to the length of c
x[i] <- c[i] * 5 # multiply that element by 5 and pop it into x sequentially
}
x # print the results
## [1] 35 45 55 65 75
That is, we would need to painstakingly say, "Here’s an empty object ready to hold our computed data. Now, for every value in c
, multiply that value by 5 and then iteratively put the result into the empty object–which then starts to fill up".
This is partly why R
is so nice, you don't need to know how to program–like how to loop through data–for basic applications! And yet, you get many of the benefits of a programming environment.
There's another key takeaway here. R
frequently recycles data values. In the above example, 5 is it's own vector of data. When we hit the end of that vector, it's quantities are recycled until it matches the length of the longer vector it's being computed against. To see this perhaps a bit more clearly:
short <- c(1,2)
long <- c(1:10)
unity <- short * long
short
## [1] 1 2
long
## [1] 1 2 3 4 5 6 7 8 9 10
unity
## [1] 1 4 3 8 5 12 7 16 9 20
R
can be used as a glorified calculatorR
Now we're going to look at how we can isolate or inquire about only a portion of our data.
To do this, we're gong to build a vector of 20 random numbers between 20 and 35, pretending that these are temperatures representing daily highs over a given period of time.
sample()
Generating sample data can be a great way of preparing your analyses in advance of doing data collection. When thinking about limiting bias in research design, the more that can be planned out in advance, the less of the overall process is being determined after study implementation in an ad hoc fashion; this is of particular importance in hypothesis testing confirmatory research. When doing exploratory research or data cleaning, sampling can support refining techniques on a smaller, more manageable data set.
While random sampling is a big topic, three common sampling functions that you'll find in R
include rnorm()
for normally distributed data generation, runif()
, for uniformly distributed data generation, and sample()
, for, well, basic sampling.
The sample()
function takes four arguments
We apply the arguments in this pattern sample(dataSource, sampleSize, repeatsAllowed, Probabilities)
For this example, we'll ignore weighted probability.
set.seed(120) ## makes things reproducible
dailyHighs <- sample(20:35, 20, replace = TRUE) # 20 samples with replacement between 20 and 35
dailyHighs
## [1] 24 22 28 25 34 26 23 21 34 20 21 20 26 31 32 31 30 20 31 22
To get a bit more familiar with our sample, we'll test for values above a certain threshold, let's say above 27.
R
allows testing of equivalence, using <
, >
, and ==
. Note that 5 = 5
is assignment, while 5 == 5
is a test.
Our first inclination might be to type the variable with a greater than sign in the hopes that R
will tell you what values are greater than 27 in the variable.
Let’s give that a try:
dailyHighs > 27
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
So, this is interesting, and not exactly what we wanted.
This first statement is not showing us the values, but is apparently testing the condition and telling us if the condition–dailyHighs
is greater than 27–is TRUE
or FALSE
. The output is a logical vector. We can confirm this
is.vector(dailyHighs > 27)
## [1] TRUE
Let’s try this again a little differently:
dailyHighs[dailyHighs > 27]
## [1] 28 34 34 31 32 31 30 31
That’s better. This time, we successfully asked the system to print out the values of our vector dailyHighs
where it is true that dailyHighs
is greater the 27.
An alternative approach would be to store the logical vector as it's own variable and then to pass that variable into dailyHighs
.
greaterThan27 <- dailyHighs > 27 ## assign the logical output of the > test to "greaterThan27"
dailyHighs[greaterThan27] ## print the values of dailyHighs where it is TRUE that the temps are higher than 27
## [1] 28 34 34 31 32 31 30 31
We can explore this as a side by side too.
## dailyHighs greaterThan27
## 1 24 FALSE
## 2 22 FALSE
## 3 28 TRUE
## 4 25 FALSE
## 5 34 TRUE
## 6 26 FALSE
## 7 23 FALSE
## 8 21 FALSE
## 9 34 TRUE
## 10 20 FALSE
## 11 21 FALSE
## 12 20 FALSE
## 13 26 FALSE
## 14 31 TRUE
## 15 32 TRUE
## 16 31 TRUE
## 17 30 TRUE
## 18 20 FALSE
## 19 31 TRUE
## 20 22 FALSE
What’s happening here? Indexing. R
maintains an index of the placement of your variable values in your data object. We ask about the value at a particular index using square brackets [ ]
.
If we were to type
dailyHighs[1]
## [1] 24
R
would return the first value in our vector. We can ask about any range within our vector. For example, the first three values:
dailyHighs[1:3]
## [1] 24 22 28
Or, like above, we can ask for a range based off of a condition, such as, all values where the variable dailyHighs
is greater than 27.
dailyHighs[dailyHighs > 27]
## [1] 28 34 34 31 32 31 30 31
dailyHighs[greaterThan27]
## [1] 28 34 34 31 32 31 30 31
In short:
dailyHighs > 27 # is testing a condition that returns a logical value of TRUE or FALSE
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
dailyHighs[dailyHighs > 27] ## is inquiring about the indexed values that satisfy the condition
## [1] 28 34 34 31 32 31 30 31
dailyHighs[greaterThan27] ## is the same as above, but storing the output of `dailyHighs > 27` inside of a new variable
## [1] 28 34 34 31 32 31 30 31
dailyHighs
are higher than 27:daiylHighs
.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 21.75 25.50 26.05 31.00 34.00
Note
Your values will be different because our sampes are different, but the descriptive categories should be the same.
Now that we know how to isolate values in our variables, we can start to manipulate portions of our data. To do this, we'll explore the replace()
function.
replace()
requires three arguments:
We'll start by replacing the first 3 temperatures with 0.
And we apply the arguments in this pattern replace(vector, list, value)
dailyHighs.zeroStart <- replace(dailyHighs, c(1,2,3), 0)
dailyHighs.zeroStart
## [1] 0 0 0 25 34 26 23 21 34 20 21 20 26 31 32 31 30 20 31 22
Outcome: Not very exciting, but if we know where a value is in our vector, we can change it.
As we've seen, one way that we can know where a value is located in our vector is to conditionally test for it.
And we'll do this by replacing all temperatures that are greater than 27 with 0. As quick reminder
dailyHighs > 27 ## tests for a condition
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [13] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
dailyHighs[dailyHighs > 27] ## while we can get a list of values where the condition is TRUE with this
## [1] 28 34 34 31 32 31 30 31
Let’s try this out.
dailyHighs.gt27 <- replace(dailyHighs, dailyHighs > 27, 0) # in the variable dailyHighs, where the condition is TRUE, replace with 0
dailyHighs.gt27
## [1] 24 22 0 25 0 26 23 21 0 20 21 20 26 0 0 0 0 20 0 22
alternatively, we use our stored variable greaterThan27
dailyHighs.gt27 <- replace(dailyHighs, greaterThan27, 0)
dailyHighs.gt27
## [1] 24 22 0 25 0 26 23 21 0 20 21 20 26 0 0 0 0 20 0 22
See if you can now add a bit of math into this scenario. Exactly as above, replace the temperatures in dailyHighs
that are above 27, but this time, replace them with values that are double their own, so if you have 30, it becomes 60, 32 becomes 64 and so on. Your output should look like the following:
## [1] 24 22 56 25 68 26 23 21 68 20 21 20 26 62 64 62 60 20 62 22