Functions take in data and do things with that data. We can write our own functions, but generally in R we’ll be using functions that have already been written or built. You’ve already seen three functions, sqrt(), log10, and log2.

Calling a function requires 2 things: the function itself and any arguments the function allows us to specify – one of these arguments is the data source, but there are usually other parameters we can specify.

Rivers & Descriptive Stats

We’ll explore the basics of a handful of functions for descriptive stats with a built in data set called rivers. We can look at the data in rivers simply by calling it:

rivers # call the rivers dataset
  [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280  315  870
 [16]  906  202  329  290 1000  600  505 1450  840 1243  890  350  407  286  280
 [31]  525  720  390  250  327  230  265  850  210  630  260  230  360  730  600
 [46]  306  390  420  291  710  340  217  281  352  259  250  470  680  570  350
 [61]  300  560  900  625  332 2348 1171 3710 2315 2533  780  280  410  460  260
 [76]  255  431  350  760  618  338  981 1306  500  696  605  250  411 1054  735
 [91]  233  435  490  310  460  383  375 1270  545  445 1885  380  300  380  377
[106]  425  276  210  800  420  350  360  538 1100 1205  314  237  610  360  540
[121] 1038  424  310  300  444  301  268  620  215  652  900  525  246  360  529
[136]  500  720  270  430  671 1770

rivers consists of a single variable – lengths in miles of a set of rivers in the United States.

To calculate the mean of these data, we use the function mean()

mean(rivers) # calculate the mean of rivers
[1] 591.1844

For the median,

median(rivers) # calculate the median of rivers
[1] 425

There is no built in function for the mode of a data set. We’ll look later at how we can calculate the mode.

But we can calculate the variance and standard deviation,

var(rivers) # variance of rivers
[1] 243908.4
sd(rivers) # standard deviation of rivers
[1] 493.8708

Arguments

So far, the only ‘argument’ we’ve passed to any of these functions is the data itself.

When calculating the mean, it is not uncommon to trim a percentage from the lower and upper ends of the data set. How much to trim is an argument we can pass to mean(). The value we can assign to trim is a fraction from 0 (the default) to 0.5 or 50% of the data set.

mean(rivers, trim = 0.1) # drops the upper and lower 10% of the data set from the calculation
[1] 490.9469

Sampling

Another function that is conceptually good to know includes sample(), which takes a random sample from a data set. sample() can take several arguments, the first of which is a data set to sample, and the second, the number of samples to take.

To take a random sample of 5 values from rivers,

sample(rivers, 5) # random sample of 5 from rivers. Will be different every time you run it.
[1]  270  383  460  390 1459

This is a useful tool for generating data to test code or building random subsets of a data set to support analysis. We’ll explore this application later.

Other functions to create random samples include runif() for uniformly distributed data and rnorm() for normally distributed data.

Functions are mini programs that do things with our data. They generally have parameters, or arguments, that can be specified to customize how the function operates.

Function Description
mean calculate the mean of a range of values.
median calculate the median of a range of values.
var calculate the variance of a range of values.
sd calculate the standard deviation of a range of values.
sample take a randmon sample from a range of values.