When we talk about data, we can talk about data types, data classes, and data structures.
Data types are fundamental building blocks for storing information. R has five atomic data types – the data types from which other objects are created. The three of importance to us here are:
Type | Representation |
---|---|
Numeric | Numbers |
Character | Text |
Logical | True and False Values |
Character data, also known as strings, are always wrapped in “quotation marks”.
Numeric data can be stored two ways, as integers or as floating point, also called ‘double’.
Data types can have specific attributes that influence what we can and cannot do with these data. One of these attributes is a class.
Consider the following numbers – 20220301.
Without context, this is simply one big number. Or a list of smaller numbers. Classify it as a date, however, that has a set of rules for how a date is written – yyyymmdd – and a series of conventions for how dates function – a specific calendar type, the length of a year, month, or day etc – and we can start to be able to do some date specific operations with this data, like calculating a person’s age.
We’ll assign some numbers that could be a date to a variable
<- 20220301) # create variable numbers as atomic type numeric (numbers
[1] 20220301
# convert numbers to class date and assign to a new variable
<- as.Date(as.character(numbers), '%Y%m%d')) (numbers_as_date
[1] "2022-03-01"
After which we can inquire about their class
class(numbers) # inquire about the class
[1] "numeric"
class(numbers_as_date)
[1] "Date"
And see the utility of adding the date class
Sys.Date() # retrieve today's data
[1] "2023-02-14"
# calculate the number of days that have passed since numbers
<- Sys.Date() - numbers) # doesn't make sense (days_since_March_31
[1] "-53339-10-26"
<- Sys.Date() - numbers_as_date) # works (days_since_March_31
Time difference of 350 days
Data structures can be thought of how these data are stored collectively – the structure that groups multiple values from a variable together, or the values from multiple variables together. R has a few basic data structures that you’ll frequently encounter. These include vectors, lists, matrices, and data frames.
A vector is a very simple list. It is uni-dimensional - think of it as a single column or row of data - and it can only contain data of exactly the same type. So, if you have a list of numbers or words in R, these will likely be contained within a vector. In fact, the data set rivers is a vector,
rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
[16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
[31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
[46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
[61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
[91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
[106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
[136] 500 720 270 430 671 1770
To test if something is a vector, we have a couple of options. We can
use is.vector()
, but it’s more appropriate to use
is.atomic()
,
is.vector(rivers)
[1] TRUE
is.atomic(rivers)
[1] TRUE
A data frame essentially functions as a series of connected vectors, where each vector is a column. In this sense a data frame is also a special kind of list.
In a data frame, all vectors need to be of the same length. And while each vector must hold the same data type, not all vectors need to be of the same data type. Data frames also allow us to apply column names.
data.frame(
(numbers = c(1,5,8,9, 11),
words = c('I', 'want', 'to', 'learn', 'R')
))
numbers words
1 1 I
2 5 want
3 8 to
4 9 learn
5 11 R
A list also essentially functions as a series of connected vectors, but breaks us free of each column needing to be the same length as in a data frame. You can also nest a list within a list. This can start to get complicated.
list(
(breakfast = c('Eggs', 'Muffins', 'Coffee'),
lunch = c('Grilled Cheese Sandwich with Orange Juice'),
numbers = c(1,4,6,7)
))
$breakfast
[1] "Eggs" "Muffins" "Coffee"
$lunch
[1] "Grilled Cheese Sandwich with Orange Juice"
$numbers
[1] 1 4 6 7
A matrix resembles a data frame when displayed on screen, but is more accurately a vector with attributes that define the number of columns to divide the vector into. As a result, a matrix can only hold a single data type or class.
In the following, a series of numeric data. Instead of having column names, we have column and row numbers.
matrix(round(rnorm(12, 10, 1), 2), nrow = 3)) (
[,1] [,2] [,3] [,4]
[1,] 11.03 9.06 10.93 10.55
[2,] 9.76 9.98 7.81 9.31
[3,] 10.45 8.97 10.05 11.51
Vectors are the building blocks of data frames, lists, and matrices. Matrices are vectors broken into columns of the same length and same data types. Data frames are joined vectors of the same length and different data types. Lists are joined vectors of different lengths and data types. Each is useful in certain situations.
Function | Description |
---|---|
class |
reports the type of data or data structure. |
is. |
a family of functions for identifying data types and structures. |