An application programming interface (API) is a formalized way of interacting with another piece of hardware or software without needing to know how that hardware or software actually works. As such, it is a convenient method of making requests to get, update, or push information between your system and the remote system. As a loose analogy, you can think of this as a menu in a resuraunt; you don’t need to know how the kitchen makes your food, but you need to know what you can ask for from your server. When you want scrambled eggs, you don’t need to go into the kitchen and make these from scratch, you simply put in a request based on the options on the menu.
JSON
JavaScript Object Notation (JSON) is a method for storing and transmitting data primarily using key value pairs. It is compact, flexible, human readable, and works well in a Web based environment. It is generally the default format in which data is transmitted from an API.
Set up
The Abacus API is accessible through http, which means we can access the API using any http compliant tool, including a web browser.
By default, the API returns data in JSON notation. One of the issues we have when working in R is that R excels with rectangular data, especially if working in the Tidyverse. So we need to wrangle the JSON formatted data into a data structure more amenable to R. This begins by moving content into a list after which we can transition it to a dataframe, or tibble, as needed.
To do this all, we’ll use the following four libraries, although, as we’ll see later, we can get away with just the httr2 package:
1
2
library(httr2)# http protocolslibrary(jsonlite)# json to r data structures
Steps
In the proceding sections we’ll cover the following steps:
Using the search API to look for relevant datasets.
Plugging a unique ID associated with the dataset(s) of interest into the datasets api to identify the files of interest. This involves first identifying a new unique ID for each dataset, only accessible through the datasets API, and the version of the dataset we’re interested in (an Abacus record is version controlled).
We will then use the urls associated with the data files to download the relevant data.
We will be working with the Statistics Canada Labour Force Surveys for this exercise.
Caveats
This is a simplified overview of API interactions. So, for example, we will not be doing any error handling, nor will we deal with using authentication tokens. But these are easily built in once the general approach is understood, and the httr2 package has functions for handling these things.
Abacus has more than one API. This is common when interacting with APIs. Each API gives you access to certain functionalities and returns certain pieces of data. Here, we’ll review the search and datasets APIs; the former to locate content, the latter to retrieve that content. We will ultimately use the access API with download.file() to acquire the data.
In either case, we start with the base url into the API, which we’ll store in a variable.
1
base_url<-"https://abacus.library.ubc.ca/api/"
Next, we can add on the specific API we intend to search.
1
search_api<-"search?q="
The search API requires a query parameter, ie, something to search for. This is articulated as a string, following the ?q=.
Note: When using the httr2 package, if the query string is multi-word, spaces need to be filled with %20. For example, a search for labour force will fail, and need to be replaced with labour%20force.
Search API
There are a variety of fields and file types that we can search through the search API, noting that, as a data repository, objects in Abacus are organized into dataverses, datasets, and files, where datasets reside in dataverses, and files within datasets. Datasets provide a high level overview of the data they hold, including descriptions, number of files, who produced the files, when they were uploaded, etc.
We’ve already built our base_url and added on the requisite text to access the search API, search_api. We’ll now build our query, specifying the query and the file type of interest, ie, do we want to search for a dataverse, dataset, or file.
1
2
search_query<-"title:labour"# retrieve everything with the word labour in the titlesearch_type<-"&type=dataset"# limit to datasets
We can then paste these together to construct a full query.
To get a sense of what this looks like, we can load this query in our browser.
Implementing the Search in R
The hhtr2 package will allow us to make http requests from within R. The steps involved including:
a) building a request; and then
b) performing that request, which results in some data being sent back to us.
These two steps are done, respectively, with the request() and req_perform() functions. We feed our search_api_call into the request() and then feed the result into the req_preform.
What we’re most interested in here is the body. However, the body is sent to us structured as json and in raw bytes. This is generally the case with an API response. httr2 has a built in function for converting the body from raw to something R can work with – resp_body_json() – and we’ll use this later, but we’ll start with the jsonlite package to explore this process step by step.
1
summary(search_resp$body)
1
2
## Length Class Mode
## 24063 raw raw
Since we only want the body going forward, we’ll isolate this.
1
search_body<-search_resp$body
Data Wrangling
Our first task is to convert this raw data to character data. We can do this with rawToChar.
1
search_body_char<-rawToChar(search_body)
We’re presented with a vector of length one, which if we print out the first bit of, does not look very friendly.
We’re slowly whittling this down. This now looks almost identical to what we pulled up in our browser earlier. The actual relevant content is in the items list. But we’ll also need the total_count for when we need to request all the data from the server in a loop. For now, let’s look at the 26 items in there.
We don’t need all this data. We should be able to do some evaluation with the name, description, fileCount, producers, published_at, url, and, for later retrieval, global_id. Let’s look at these quickly, noting importantly, that producers is another list, that we’ll need to wrangle some how.
1
2
vars_of_interest<-c("name","description","fileCount","producers","published_at","url","global_id")lapply(search_data$items[vars_of_interest],class)# show the class of each list item
At this stage, we have what we need to pull search results from Abacus. You’ll note that the search returned a result set of 109, but only 10 items:
1
(search_body_list$data$total_count)
1
## [1] 111
1
(nrow(search_body_list$data$items))
1
## [1] 10
Going forward, we need to paginate through the result set to retrieve all the items.
A few things to note:
By default, the search API returns 10 results per request. It allows for a maximum of 1000 per request. This is handled in the per_page parameter.
There is a start parameter that we can feed into the request; this indicates which record to start retrieving data from. This number will need to be updated as we iterate over our requests, for example, if in the first call we pull records 1:1000, in the second call, we need to start at 1001. This is handled in the start parameter.
R starts counting at 1, other languages start at 0, so the first record returned by the API is indexed at 0 on the server, not 1.
APIs will have a limit on the number of requests you can make in any given time span. We’ll build in a simple delay using Sys.Sleep(), but httr2 has a built in function for handling this as well.
Below is a function with the above steps built into it. It takes two arguments, a search string and the number of results to retrieve per request. It also prints some progress information to the console.
search_datasets<-function(query,per_page){# Build the queryurl<-paste0("https://abacus.library.ubc.ca/api/search?q=",query,"&type=dataset")# Initial call to figure out total recordsresp<-url|>httr2::request()|>httr2::req_perform()resp_body<-jsonlite::fromJSON(rawToChar(resp$body))total_count<-resp_body$data$total_count# Give some feedbackcat(paste0("There are ",total_count," records to be fetched.\n","Fetching ",per_page," records per call.\nThis will require ",ceiling(total_count/per_page)," calls.\n\n"))# build place holders for loopname<-vector()description<-vector()file_count<-vector()producers<-list()pub_date<-vector()global_id<-vector()handle<-vector()# establish the starting pointstart_point<-0# create a counter for tracking callscall_counter<-1# run the loopwhile(length(handle)<total_count){# while the number of retrieved records is < the total countcat(paste0("Call ",call_counter,"\n"))# Indicate what call we're on to the APIreq<-httr2::request(paste0(url,"&start=",start_point,"&per_page=",per_page))# create the requestresp<-httr2::req_perform(req)# perform the requestresp_body<-jsonlite::fromJSON(rawToChar(resp$body))# convert the body from raw JSON to a listresp_body_name<-resp_body$data$items$name# get the nameresp_body_desc<-resp_body$data$items$description# get the descriptionresp_body_file_count<-resp_body$data$items$fileCount# get the file countresp_body_producers<-resp_body$data$items$producers# get the producersresp_body_handle<-resp_body$data$items$url#get the urlresp_body_global_id<-resp_body$data$items$global_id# get idresp_body_pubdate<-resp_body$data$items$published_at# get update date# update the place holders:name<-c(name,resp_body_name)description<-c(description,resp_body_desc)file_count<-c(file_count,resp_body_file_count)producers<-c(producers,resp_body_producers)pub_date<-c(pub_date,resp_body_pubdate)handle<-c(handle,resp_body_handle)global_id<-c(global_id,resp_body_global_id)# update countersstart_point<-start_point+per_page# increment the start_pointcall_counter<-call_counter+1# increment the counterSys.sleep(.1)# pause before making a new call}# When all is said and done, compile the place holders into a list and return this objectreturn(list("name"=name,"description"=description,"file_count"=file_count,"producers"=producers,"pub_date"=pub_date,"handle"=handle,"global_id"=global_id))}
We can then call that function, here adding an additional parameter indicating that we only want to search the title and that we want 50 results per query. We’re being specifically broad with our search.
1
labour<-search_datasets("title:labour",50)
1
2
3
4
5
6
7
## There are 111 records to be fetched.
## Fetching 50 records per call.
## This will require 3 calls.
##
## Call 1
## Call 2
## Call 3
We now have record data that we can explore.
1
summary(labour)
Length
Class
Mode
name
111
-none-
character
description
111
-none-
character
file_count
111
-none-
numeric
producers
111
-none-
list
pub_date
111
-none-
character
handle
111
-none-
character
global_id
111
-none-
character
We can make this a bit easier for ourselves by converting this into a dataframe or tibble. There are several ways to approach this, considering that producers is a list – it is a list as some records have more than one producer associated them. Here, I’ll treat datasets with multiple producers as a different record authority than if one of these co-producers produced a standalone dataset. To do this, we’ll first collapse the producers nested lists, and then flatten the result into a single list.
We have several datasets with labour in the title. One way of handling exploring this is to strip the dates so that we can look for unique values. First we extract the date:
1
2
labour$year<-gsub(".*?([0-9]+).*$","\\1",labour$name)# regex to find ending numbers and storing in a new varhead(labour$year)
1
## [1] "2021" "2022" "2023" "2006" "2008" "2011"
Then we strip the name:
1
2
labour$name_no_year<-trimws(gsub("[[:punct:]]|[[:digit:]]","",labour$name))# regex to remove punctuation and remaing numbershead(labour$name_no_year)
1
2
## [1] "Labour Force Survey" "Labour Force Survey" "Labour Force Survey"
## [4] "Labour Force Survey" "Labour Force Survey" "Labour Force Survey"
## [1] "Labour Force Survey"
## [2] "Survey of Labour and Income Dynamics"
## [3] "Survey of Persons Not in the Labour Force"
## [4] "Labour Force Income Profiles"
## [5] "Cultural Labour Force Survey"
## [6] "Small Area Business and Labour Database"
## [7] "Survey of Labour and Income Dynamics Reweighted"
## [8] "Labour Force Historical Review"
## [9] "Labour Force Survey Rebased Revised"
## [10] "Labour Force Survey revision"
## [11] "Interwar Labour Database"
## [12] "Labour force historical review"
## [13] "Labour Income Profile for selected BC communities"
## [14] "Labour Market Activity Survey"
## [15] "Experienced Labour Force Population by Occupation custom tabulation"
## [16] "Experienced Labour Force Population by Industry and Occupation custom tabulation"
## [17] "Labour Market Activity Survey Crosssectional Files"
## [18] "Labour Market Activity Survey Longitudinal Files"
## [19] "New Survey of London Life and Labour"
## [20] "Labour Force years of age and older in private households by occupation for Quebec Census of Canada Sample Custom tabulation"
We only need the Labour Force Surveys themselves. These are occasionally revised, documentation on these revisions here, hence multiple records with slightly modified names (Labour Force Survey, Labour Force Survey (Rebased, Revised), etc)
1
2
lfs<-labour[grepl("^Labour Force Survey",labour$name_no_year),]# regex to find names starting with LFShead(lfs)
name
description
file_count
producers
pub_date
handle
global_id
year
name_no_year
Labour Force Survey, 2021
LFS data are used to produce the well-known ...
91
Statistics Canada
2023-03-16T22:00:56Z
https://hdl.handle.net/11272.1/AB2/HP9TEK
hdl:11272.1/AB2/HP9TEK
2021
Labour Force Survey
Labour Force Survey, 2022
LFS data are used to produce the well-known ...
57
Statistics Canada
2023-03-16T22:01:02Z
https://hdl.handle.net/11272.1/AB2/SRAU7E
hdl:11272.1/AB2/SRAU7E
2022
Labour Force Survey
Labour Force Survey, 2023
LFS data are used to produce the well-known ...
31
Statistics Canada
2024-01-05T17:38:25Z
https://hdl.handle.net/11272.1/AB2/IJU1QK
hdl:11272.1/AB2/IJU1QK
2023
Labour Force Survey
Labour Force Survey, 2006
The Labour Force Survey provides estimates o...
26
Statistics Canada
2023-03-16T21:47:04Z
https://hdl.handle.net/11272.1/AB2/0B5LPL
hdl:11272.1/AB2/0B5LPL
2006
Labour Force Survey
Labour Force Survey, 2008
The Labour Force Survey provides estimates o...
26
Statistics Canada
2023-03-16T21:47:20Z
https://hdl.handle.net/11272.1/AB2/LA3WXI
hdl:11272.1/AB2/LA3WXI
2008
Labour Force Survey
Labour Force Survey, 2011
The Labour Force Survey provides estimates o...
26
Statistics Canada
2023-03-16T21:47:47Z
https://hdl.handle.net/11272.1/AB2/GK3SFF
hdl:11272.1/AB2/GK3SFF
2011
Labour Force Survey
Looking at Files
We now know the data sets we’re interested in. A dataset is made up of multiple files, however, and we likely don’t want all the files for every dataset. We’ll start by getting the file list for one of our datasets. Once we can do this for one dataset, we can iterate over all the datasets, although we won’t implement that in this workshop.
Up until now, we’ve been using the search api. We’ll now use the datasets api.
1
datasets_api<-"datasets/"
Like before, we’ll build our query. We’ll start with the 2011 LFS dataset. This is a multi-step process, as the first thing we need to do is get the id for the dataset, which is different from the global_id we retrieved from the search api, but we need the global_id, piped into the datasets api, to get the id.
1
2
3
id<-lfs[lfs$name=="Labour Force Survey, 2011","global_id",drop=TRUE]# we want a simple vector returnedquery<-paste0(":persistentId/?persistentId=",id)# limit to the id of interest, see dataverse API documentation(dataset_api_call<-paste0(base_url,datasets_api,query))# build the call
Files are generally indicated as being either data, documentation, or command files. We’ll isolate just the data files in the list. All relevant data files have a directoryLabel equal to Data. We can use this to access only the data files.
1
data_files<-Filter(function(x)x$directoryLabel=="Data",dataset_file_resp$data)# filter by directoryLabel
From which we’ll see that there are two different formats available to us. Let’s grab just the .tab files.
1
relevant_data_files<-Filter(function(x)grepl("tab$",x$label),data_files)# filter to files ending in tab
And then download them.
1
dir.create("LFS_2011")# create a directory
1
## Warning in dir.create("LFS_2011"): 'LFS_2011' already exists
1
2
3
4
5
6
7
8
9
dest_dir<-"LFS_2011/"# store it in a variable# loop through the files and download themfor(iin1:length(relevant_data_files)){file_name<-relevant_data_files[[i]]$labelfile_id<-relevant_data_files[[i]]$dataFile$idurl_root<-"https://abacus.library.ubc.ca/api/access/datafile/"download.file(url=paste0(url_root,file_id),destfile=paste0(dest_dir,file_name))}
And build an annual set
1
2
3
4
5
lfs_2011_list<-lapply(list.files(dest_dir),function(x)read.delim(file=paste0(dest_dir,x)))# read in each tab file into a listlfs_2011_df<-do.call(rbind,lfs_2011_list)# bind the list items into a dfsummary(lfs_2011_df[,c(1:11)])|>kbl()|>kable_styling(bootstrap_options="striped")