Getting the Data

I have a sample csv file available for you at this link. Please download it and place it in your RScripts folder that we created earlier on your desktop.

File Formats

Data are stored in a wide variety of formats, and R can handle many of these out of the box. We'll work with a simple example importing a csv–or comma separated value–file, in which column variables are separated by commas and rows of observations by a new line.

Comma Separated Values

In general, keeping your data in a plain text format like csv has many advantages including longevity of access, transferability across systems, and perhaps most importantly, encouraging good data handling practices; programs like Excel allow us to indicate information about our data using colours, to merge cells etc, all of which limit the ability of a program to read or interpret our data.

Knowing where your data is

Before we can import our data, we need to know what our working directory is, that is, what folder R thinks we're currently in and what folder our data is in.

We have two functions for this:

  • getwd() get working directory, which tells us where we are; and
  • setwd() set working directory, which tells R where we want to be.

This next bit is important

We want to set our working directory to the same location as our script. To do this, we'll use the following code:

Mac and Linux

setwd("~/Desktop/RScripts")

Windows

setwd("~/Desktop/RScripts")

Directories

Every system has a default home directory. We can access that directory with the ~. On Mac OS this should take you to /users/username, on Windows to \user\username and on Linux /home/username.

The above will work for this session on systems that have not had their default home directories changed.

Importing csv

We can read a csv with the function read.table(). read.table() has several iterations, such as read.csv() specifically for comma separated values and read.csv2 that accounts for the use of commas in place of decimals in certain countries, but read.table will allow us to explore a bit more what's happening on our import.

Swiss data set

R comes pre-loaded with several data sets that you can play around with. The Swiss data set is one example. This is a data set of standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. The csv generated for this workshop is derived from this dataset, with a few changes introduced.

You can learn more about the data set with ?swiss

Inital Exploration

First, we'll open our file in a text editor, so you can see the comma separations.

We'LL also open it in Excel, so you can see what it looks like in a spread sheet application.

We can see that we have a series of observations, represented as rows containing province names in Switzerland. Each observation has several variables, including things like education level and infant mortality rates.

It's important to note that we have column names (variables, which will also be used as our labels) and row names (which will act as row labels), but that these don’t represent data points.

At this point, we should have a sense of what we need to import.

Import Options

Let's explore the options we can set when reading in this file considering the above. We'll start by just looking at the first few lines; a way to avoid loading a large data set that we just want to get a sense of.

read.table() provides us with several things that we can tell it on import, for instance how our data is separated - this could be comma, tab, or any value really - whether or not our data has row names, if there’s a header, and very importantly, if there are values that we want to replace with NA on import. We can also specify how many rows of data to import.

You can of course get a full list of the options available with ?read.table, but for the moment we’re going to forge ahead with the following:

read.table("~/Desktop/RScripts/SwissData.csv", sep = ",", row.names = 1, header = TRUE, nrows = 6, na.strings = c("NULL", "", " "))
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12       NA
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Let’s work through this.

  1. First, we identify the file.
  2. Then sep = "," indicates the column separator used, in this case a comma.
  3. row.names = 1 indicates that we have row names and that these appear in the first column, so, import these, but don’t include them as data, just labels.
  4. header = TRUE indicates that we have a header row naming our variables, so again, import these, but don’t include them as data points.
  5. nrows = 6 indicates to stop reading at the 6th line of data
  6. Lastly, na.strings = c("NULL", "", " ") tells R to replace any NULL values, blank values, or data points that are just spaces with NA.

Missing values

NULL values, empty cells, and random spaces routinely show up in data sets, either as artifacts of other applications, user error or errors introduced during data collection. All these will cause issues when it comes time to process our data. We can account for and then deal with these appropriately by replacing these with the value NA using na.strings.

At this stage, everything looks good, and we should be set to import our dataset, so we’ll do exactly as above, but we'll pull in the full data set and store it in a variable called SwissData:

SwissData <- (read.table("~/Desktop/RScripts/SwissData.csv", sep = ",", row.names = 1, header = TRUE, na.strings = c("NULL", "", " ", ".")))
head(SwissData)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12       NA
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

The one last thing we'll do is inquire about our structure

str(SwissData)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  NA 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

And at this point, we should be able to start working with this data set using all of the tools that weencountered so far. You should be able to look at observations that match only certain criteria, calculate descriptive stats, add data points etc all the basic elements we'd expect to be able to do with a data set. You also know how to look for and learn how to use various functions in R. And lastly, you hopefully have some of the language needed to be able use a search egine to ask ‘how do I this in R’ and understand some of the answers you might get on various forums or blogs.

Next steps

Up until now, we've been working through exploring, altering, and asking questions of our data with tools that are built into base R, that is with tools that come with a basic installation.

In the introduction, we talked about R being more than a programming language, and being a community of developers and users.

This community is continually creating new tools, that is new functions, to help you do more and to do this more efficiently. These functions come bundled as packages that you can install. You'll frequently see reference to these packages.

To see what packages you have installed on you system, you can run installed.packages()

installed.packages()

To know what functions are available in a give package you can run help(package="packageName")

To install a package that you don't have, you would run install.packages("PackageName")

Tidyverse

Tidyverse is a collection of R packages that is well worth exploring.

dplyr

One of these packages, called dplyr, allows you to do a lot of what we did today in a more streamlined way.

To install dplyr you can run

install.packages("dplyr")

And then to see the functions available, you can run

help(package = "dplyr")

To learn more about tidyverse and dplyr you can visit https://dplyr.tidyverse.org/

readxl

Another of these is readxl, which has several functions to allow you to import Excel files. It works very much like read.table but for Excel files.

To install readxl you can run

install.packages("readxl")

And then to see the functions available, you can run

help(package = "readxl")

To learn more about tidyverse and readxl you can visit https://readxl.tidyverse.org/