Functions take in data and do things with that data. We can write our
own functions, but generally in R we’ll be using functions that have
already been written or built. You’ve already seen three functions,
sqrt()
, log10
, and log2
.
Calling a function requires 2 things: the function itself and any arguments the function allows us to specify – one of these arguments is the data source, but there are usually other parameters we can specify.
We’ll explore the basics of a handful of functions for descriptive stats with a built in data set called rivers. We can look at the data in rivers simply by calling it:
# call the rivers dataset rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
[16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
[31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
[46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
[61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
[91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
[106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
[136] 500 720 270 430 671 1770
rivers consists of a single variable – lengths in miles of a set of rivers in the United States.
To calculate the mean of these data, we use the function
mean()
mean(rivers) # calculate the mean of rivers
[1] 591.1844
For the median,
median(rivers) # calculate the median of rivers
[1] 425
There is no built in function for the mode of a data set. We’ll look later at how we can calculate the mode.
But we can calculate the variance and standard deviation,
var(rivers) # variance of rivers
[1] 243908.4
sd(rivers) # standard deviation of rivers
[1] 493.8708
So far, the only ‘argument’ we’ve passed to any of these functions is the data itself.
When calculating the mean, it is not uncommon to trim a percentage
from the lower and upper ends of the data set. How much to trim is an
argument we can pass to mean()
. The value we can assign to
trim is a fraction from 0 (the default) to 0.5 or 50% of the data
set.
mean(rivers, trim = 0.1) # drops the upper and lower 10% of the data set from the calculation
[1] 490.9469
Another function that is conceptually good to know includes
sample()
, which takes a random sample from a data set.
sample()
can take several arguments, the first of which is
a data set to sample, and the second, the number of samples to take.
To take a random sample of 5 values from rivers,
sample(rivers, 5) # random sample of 5 from rivers. Will be different every time you run it.
[1] 270 383 460 390 1459
This is a useful tool for generating data to test code or building random subsets of a data set to support analysis. We’ll explore this application later.
Other functions to create random samples include runif()
for uniformly distributed data and rnorm()
for normally
distributed data.
Functions are mini programs that do things with our data. They generally have parameters, or arguments, that can be specified to customize how the function operates.
Function | Description |
---|---|
mean |
calculate the mean of a range of values. |
median |
calculate the median of a range of values. |
var |
calculate the variance of a range of values. |
sd |
calculate the standard deviation of a range of values. |
sample |
take a randmon sample from a range of values. |