Scripting

Next we look at creating random samples of data in more detail, how we can run calculations on our data, and how we can replace values in our data.

In additional to mathematical operators, we will learn about the following functions:

  • sample() and some of it's more specific variants rnorm and runif
  • replace()

In doing so, we're also going to walk out of the console and move into a scripting environment.

Scripting

By writing our code in a script, we can save our code and don't need to retype it every time we want to run it.

Scripting is really the first step to creating a reproducible environment for your work that also saves you time in the long run.

From your file menu, select > New File and > R Script.

This next bit is important

Save your script on your desktop in a folder called RScripts. This will be important later in this workshop.

Simple math

R allows us to perform math on our data.

  • +
  • -
  • *
  • /

This can be as simple as typing in

2+2
## [1] 4
2*3
## [1] 6
3-2
## [1] 1
4/2
## [1] 2

But that's not overly exciting. We can get a little fancier:

3^3
## [1] 27
sqrt(81)
## [1] 9
log(1200)
## [1] 7.090077
5%%2
## [1] 1
4%%2
## [1] 0

Still not terribly exciting, as this is basically just a glorified calculator. But let's look at this in more depth.

Math on Vectors

First We know how to create a vector and assign that vector to a variable. So let's get some data prepped and try the following:

  1. Create 2 vectors of equal length
  2. Add them together with the results contained in a new variable.
a <- c(1:5)
a
## [1] 1 2 3 4 5
b <- c(6:10)
b
## [1]  6  7  8  9 10
c <- a + b
c
## [1]  7  9 11 13 15

Second Let’s do another calculation on c, this time multiplication:

d <- c * 5

d
## [1] 35 45 55 65 75

Note

Each time we're storing our new, computed data in a new variable.

If we're going to be running a computation on our data, for the purposes error tracking and transparency, it is advisable to generate a new variable.

This last example highlights an interesting feature of R. Vectorization is built into R from the ground up. To multiply every value in c by 5 where vectorization was not a reality, we'd need to iterate the process.

x <- vector() # create an empty vector to hold our results

for (i in 1:length(c)) { # for each element from 1 through to the length of c
  x[i] <- c[i] * 5 # multiply that element by 5 and pop it into x sequentially 
}

x # print the results
## [1] 35 45 55 65 75

That is, we would need to painstakingly say, "Here’s an empty object ready to hold our computed data. Now, for every value in c, multiply that value by 5 and then iteratively put the result into the empty object–which then starts to fill up".

This is partly why R is so nice, you don't need to know how to program–like how to loop through data–for basic applications! And yet, you get many of the benefits of a programming environment.

There's another key takeaway here. R frequently recycles data values. In the above example, 5 is it's own vector of data. When we hit the end of that vector, it's quantities are recycled until it matches the length of the longer vector it's being computed against. To see this perhaps a bit more clearly:

short <- c(1,2)
long <- c(1:10)

unity <- short * long

short
## [1] 1 2
long
##  [1]  1  2  3  4  5  6  7  8  9 10
unity
##  [1]  1  4  3  8  5 12  7 16  9 20

What have we learned

  • R can be used as a glorified calculator
  • Mathematical functions are performed across a vector in R

Conditions and Testing

Now we're going to look at how we can isolate or inquire about only a portion of our data.

To do this, we're gong to build a vector of 20 random numbers between 20 and 35, pretending that these are temperatures representing daily highs over a given period of time.

sample()

Generating sample data can be a great way of preparing your analyses in advance of doing data collection. When thinking about limiting bias in research design, the more that can be planned out in advance, the less of the overall process is being determined after study implementation in an ad hoc fashion; this is of particular importance in hypothesis testing confirmatory research. When doing exploratory research or data cleaning, sampling can support refining techniques on a smaller, more manageable data set.

While random sampling is a big topic, three common sampling functions that you'll find in R include rnorm() for normally distributed data generation, runif(), for uniformly distributed data generation, and sample(), for, well, basic sampling.

The sample() function takes four arguments

  • a range to sample from
  • the sample size
  • a declaration of whether or not repeat selections are allowed
  • a weighted probability

We apply the arguments in this pattern sample(dataSource, sampleSize, repeatsAllowed, Probabilities)

For this example, we'll ignore weighted probability.

set.seed(120) ## makes things reproducible

dailyHighs <- sample(20:35, 20, replace = TRUE) # 20 samples with replacement between 20 and 35

dailyHighs
##  [1] 24 22 28 25 34 26 23 21 34 20 21 20 26 31 32 31 30 20 31 22

To get a bit more familiar with our sample, we'll test for values above a certain threshold, let's say above 27.

R allows testing of equivalence, using <, >, and ==. Note that 5 = 5 is assignment, while 5 == 5 is a test.

Our first inclination might be to type the variable with a greater than sign in the hopes that R will tell you what values are greater than 27 in the variable.

Let’s give that a try:

dailyHighs > 27
##  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

So, this is interesting, and not exactly what we wanted.

This first statement is not showing us the values, but is apparently testing the condition and telling us if the condition–dailyHighs is greater than 27–is TRUE or FALSE. The output is a logical vector. We can confirm this

is.vector(dailyHighs > 27)
## [1] TRUE

Let’s try this again a little differently:

dailyHighs[dailyHighs > 27]
## [1] 28 34 34 31 32 31 30 31

That’s better. This time, we successfully asked the system to print out the values of our vector dailyHighs where it is true that dailyHighs is greater the 27.

An alternative approach would be to store the logical vector as it's own variable and then to pass that variable into dailyHighs.

greaterThan27 <- dailyHighs > 27 ## assign the logical output of the > test to "greaterThan27"

dailyHighs[greaterThan27] ## print the values of dailyHighs where it is TRUE that the temps are higher than 27
## [1] 28 34 34 31 32 31 30 31

We can explore this as a side by side too.

##    dailyHighs greaterThan27
## 1          24         FALSE
## 2          22         FALSE
## 3          28          TRUE
## 4          25         FALSE
## 5          34          TRUE
## 6          26         FALSE
## 7          23         FALSE
## 8          21         FALSE
## 9          34          TRUE
## 10         20         FALSE
## 11         21         FALSE
## 12         20         FALSE
## 13         26         FALSE
## 14         31          TRUE
## 15         32          TRUE
## 16         31          TRUE
## 17         30          TRUE
## 18         20         FALSE
## 19         31          TRUE
## 20         22         FALSE

Indexing

What’s happening here? Indexing. R maintains an index of the placement of your variable values in your data object. We ask about the value at a particular index using square brackets [ ].

If we were to type

dailyHighs[1]
## [1] 24

R would return the first value in our vector. We can ask about any range within our vector. For example, the first three values:

dailyHighs[1:3]
## [1] 24 22 28

Or, like above, we can ask for a range based off of a condition, such as, all values where the variable dailyHighs is greater than 27.

dailyHighs[dailyHighs > 27]
## [1] 28 34 34 31 32 31 30 31
dailyHighs[greaterThan27]
## [1] 28 34 34 31 32 31 30 31

In short:

dailyHighs > 27 # is testing a condition that returns a logical value of TRUE or FALSE
##  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
dailyHighs[dailyHighs > 27] ## is inquiring about the indexed values that satisfy the condition
## [1] 28 34 34 31 32 31 30 31
dailyHighs[greaterThan27] ## is the same as above, but storing the output of `dailyHighs > 27` inside of a new variable
## [1] 28 34 34 31 32 31 30 31

Exercise 3.3

  1. How many of the values in dailyHighs are higher than 27:
  2. Get a summary, similar to this, of the data in daiylHighs.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   21.75   25.50   26.05   31.00   34.00

Note

Your values will be different because our sampes are different, but the descriptive categories should be the same.

What have we learned

  • We can test a vector for a condition and logical vector in return
  • We can retrieve the values from only a portion of the data in a vector. This subset of our data can be defined by:
    • an index point or range; or
    • a condition

Replace()

Now that we know how to isolate values in our variables, we can start to manipulate portions of our data. To do this, we'll explore the replace() function.

replace() requires three arguments:

  • a vector
  • a list of index values to be replaced (or as a condition of the vector)
  • a new value for replacement (this can be a computed value)

We'll start by replacing the first 3 temperatures with 0.

And we apply the arguments in this pattern replace(vector, list, value)

dailyHighs.zeroStart <- replace(dailyHighs, c(1,2,3), 0)

dailyHighs.zeroStart
##  [1]  0  0  0 25 34 26 23 21 34 20 21 20 26 31 32 31 30 20 31 22

Outcome: Not very exciting, but if we know where a value is in our vector, we can change it.

As we've seen, one way that we can know where a value is located in our vector is to conditionally test for it.

And we'll do this by replacing all temperatures that are greater than 27 with 0. As quick reminder

dailyHighs > 27 ## tests for a condition
##  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
dailyHighs[dailyHighs > 27] ## while we can get a list of values where the condition is TRUE with this
## [1] 28 34 34 31 32 31 30 31

Let’s try this out.

dailyHighs.gt27 <- replace(dailyHighs, dailyHighs > 27, 0) # in the variable dailyHighs, where the condition is TRUE, replace with 0

dailyHighs.gt27
##  [1] 24 22  0 25  0 26 23 21  0 20 21 20 26  0  0  0  0 20  0 22

alternatively, we use our stored variable greaterThan27

dailyHighs.gt27 <- replace(dailyHighs, greaterThan27, 0)

dailyHighs.gt27
##  [1] 24 22  0 25  0 26 23 21  0 20 21 20 26  0  0  0  0 20  0 22

Exercise 3.4

See if you can now add a bit of math into this scenario. Exactly as above, replace the temperatures in dailyHighs that are above 27, but this time, replace them with values that are double their own, so if you have 30, it becomes 60, 32 becomes 64 and so on. Your output should look like the following:

##  [1] 24 22 56 25 68 26 23 21 68 20 21 20 26 62 64 62 60 20 62 22

What have we learned

  • We can replace values in our data based either off of:
    • A known index point
    • A condition being met that allows an index point to be targeted
  • We can perform multiple tasks in tandem, for example, replacing a value with a computed derivative of its original value.