7 Activities: Acquiring Data

7.1 Finding, Importing, and Cleaning Data

Throughout this exercise, it will be helpful to have the Data Import Cheat Sheet handy.

Additional readings:

Data import from Wickham and Grolemund
Missing data from Wickham and Grolemund
DataCamp import tutorial
DataCamp data cleaning tutorial
An Introduction to Data Cleaning in R - de Jong & van der Loo

Required packages:

library(tidyverse)
library(ggplot2)

In practice, data science is not as glamorous as building classifiers and creating visualizations all the time. Data scientists spend 80% of their time acquiring and cleaning data. While the skill of data acquisition is best learned through experience, this section of the course will outline the most common approaches to acquiring data.

When importing and cleaning a dataset, take careful notes in your R Markdown. Explain where you found the dataset. Record the steps you took to clean and import the data in case somebody else needs to replicate your analysis. You should also make sure to cite and credit the creator of the dataset if it is relevant.

7.1.1 Finding Existing Data Sets

Figure 7.1: An example Google search.

The easiest way to get data is by finding an existing dataset that has been created by somebody else. Search engines such as Google can be excellent tools, especially when using file type filters. For example, if you are looking for a dataset about movie reviews, you might search for “movie reviews filetype:csv”. You could also try searching for other common filetypes that are compatible with R, such as .tsv, .xls, .xlsx, or rds.

Another good resource for datasets are compendiums of datasets such as the excellent and continuously-evolving awesome-public-datasets GitHub repo, Kaggle datasets or the data.world website website. You can find links to other similar compendiums at the end of the awesome-public-datasets page.

7.1.2 Loading Datasets

Once you have a dataset, it’s time to load it into R. Don’t be frustrated if this step takes some time. A few tips:

DataCamp has a fantastic import tutorial that demonstrates how to import data in a many different common file formats.
The import functions read_csv, read_csv2, and read_tsv from the readr package are faster than their counterparts read.csv, read.csv2, and read.tsv from the base package for large files. They also have more flexible parsers (e.g., for dates, times, percentages). We recommend you use these functions instead of the base functions like read.csv. The package fread has other import functions and is also faster for large datasets. For smaller data sets (say 1MB or less), there won’t be that much difference in time for the three different packages.
read_csv2 is for semi-colon delimited files, whereas read_csv is for comma delimited files.
The readr functions automatically guess the type of data in each column (e.g., character, double, integer). You will often see a message just after the import telling you what it chose for each column. If you think there is an issue, you can use the function problems() to detect problems, and/or specify how the columns should be imported. See the section on “Parsing data types” in the Data Import Cheat Sheet for more info.
If you have trouble importing a dataset, try to first import it into a different data such as Google Sheets or Excel tool and then export it as a TSV or CSV before reading it into R.
For really messy data, OpenRefine is complicated but powerful (YouTube demo).
When you are importing a large file, you might want to first try importing a subset of the data. For example, if you want to take the first 17 rows only, you can write read_csv("file.csv",n_max=17)
Similarly, you might want to skip the first $n$ lines of the file when importing, select only certain columns to read in, or choose a random subset of the rows. See the cheat sheet for instructions on these tasks or just google!

Exercise 1. Find a dataset not built in to R related to 1) Macalester College, 2) a personal hobby or passion, or 3) your hometown, or a place you have lived. Load it in R, and construct one visualization of the data.

7.1.3 Cleaning Datasets

Cleaning Categorical Variables

First we want to make sure the factors are “clean.” For example, true and TRUE and T will be three different factors. The easiest way to manage this is to look at the levels for the factor and replace values with a messy factor to a clean one. For example, the following code cleans up values in true/false values in column X:

levels(df$X) <- list(TRUE=c("T", "true"), FALSE=c("f", "FALSE", "N", "No"))

TODO: also introduce replace method here.

Exercise 2. Cleaning up IMDB

We will use a slightly “messied” version of the IMDB 5000 Dataset, collected by chuansun76 on Kaggle.¹²

Use read_csv to load the IMDB 5000 dataset from “https://www.macalester.edu/~dshuman1/data/112/imdb_5000_messy.csv”, and save it as imdbMessy.
Print out the column names.
Examine the color column. What are the existing values?
How often does each occur? Hint: use the count function from the plyr package.
The read_csv read in the color values as strings. For this exercise, let’s convert them to factor using the code: imdbMessy$color<-as.factor(imdbMessy$color).
Select what you think is the best value for each level and replace “messy” versions of the value with clean ones by assigning to the levels() function as shown above. How many entries are there for each level now?

Addressing Missing Data

Finally, you should look for and address missing data, encoded as NA (not available) in R. There is no single formula for dealing with NAs. You should first look to see how many NAs appear in each column:

colSums(is.na(imdbMessy))

Study the individual observations with NAs carefully. Why do you think they are missing? Are certain types of observations more likely to have NAs?

You have several options for dealing with NAs:

You can remove observations with one or more NAs (see complete.cases).
You can remove columns with many NA values.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column.
You can use packages such as missForest that fill in missing values with statistical predictions.¹³

There is no perfect approach to dealing with NAs, and you must think carefully about how removing or replacing missing data may affect your work.

Exercise 3. Addressing missing values in the IMDB dataset.

Print out the number of NAs in each of the columns.
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
Create a new dataframe that removes observations that have NAs for actor_1_facebook_likes.
Create a second new dataframe that replaces NAs in actor_1_facebook_likes with 0.

Hint: There are a variety of ways to replace values in a dataset. One way is using the replace() function with dplyr. replace takes three arguments: the original data for the column, an expression that returns true if the value should be replaced, and the replacement value. For example, the following replaces NA values in the x column of the messy_df dataset with 0.

# Option 1. First create a cleaned version of the variable.
# Then replace the existing messy variable with the clean one.
cleaned_x = replace(messy_df$x, is.na(messy_df$x), 0)
clean_df <-
    messy_df %>% 
    mutate(x = cleaned_x)

# Option 2. Do it all in one go:
clean_df <-
    messy_df %>% 
    mutate(x = replace(x, is.na(x), 0))

7.1.4 Additional exercises

For the additional data import activity you will use information collected by FiveThirtyEight about College Majors. Although you can find clean versions of this data online, your work below should not explicitly use any cleaned datasets.

Exercise 4: Import the recent-grads-messy.tsv dataset from “http://www.shilad.com/recent-grads-messy.tsv”.

Exercise 5: The Total variable requires some cleaning. When you are done, there should be 16 different levels that are consistently named. Hint: use the replace function to avoid having to write out all the levels.

Exercise 6: Find and address NAs in the dataset. Hint: In this dataset, you should be able to replace all the NAs with reasonable values.

7.2 Introduction to SQL

Additional readings and video tutorial:
1. SQL Tutorial for Data Analysis
2. Data Explorer Tutorial
3. Baumer, Kaplan, and Horton, Modern Data Science with R, Chapter 12
4. DataCamp: Intro to SQL for Data Science

In this activity we will use the following modules:

library(tidyverse)

If you find yourself analyzing data within a medium or large organization, you will probably draw on data stored within a centralized data warehouse. Data warehouses contain vast collections of information; far more than a desktop computer can easily analyze and they typically rely on structured data repositories called SQL databases.

Data scientists interacting with data warehouses often follow a pattern that balances the scalability of SQL databases and the expressivity of data science langauges like R and Python. After finding and understanding data relevant to a project¹⁴, a data scientist writes SQL code that creates a filtered, simplified or aggregated version of the data within the warehouse. This smaller dataset is then exported as a CSV and analyzed in R.¹⁵

SQL databases can be conceptually thought of as a collection of data tables, where each table resembles a data frame. While there is a core subset of SQL supported by all databases, different databases (Hive, Postgres, MySQL, Oracle, RedShift, etc.) use slightly different variants of SQL. Event though SQL is a complex language, the basic data wrangling techniques we learned earlier (filtering, joining and summarizing) follow easily replicable patterns and cover the majority of needs.

7.2.1 Stack Exchange Data Explorer

Stack Exchange Explorer for [up vs down votes for questions and answers by weekday](https://data.stackexchange.com/travel/query/1718/up-vs-down-votes-by-day-of-week-of-question-or-answer)

Figure 7.2: Stack Exchange Explorer for up vs down votes for questions and answers by weekday

We will experiment with the Stack Exchange Data Explorer, a website that provides a SQL interface for all the data in StackExchange. StackExchange powers the StackOverflow programming question and answer site, but it also powers question and answer sites related to 126 topics including English, Travel, Bicycles, and Parenting.

StackExchange provides an in-depth Data Explorer Tutorial. We will take a quick walk through the basics of SQL using the data explorer. I chose to analyze the Travel Data Explorer, but you could perform the steps below on any one of StackExchange data explorer sites

Head to the Stack Exchange Data Explorer for Travel. You see a list of queries other users have created in the past. These queries are for all Stack Exchange sites, so some may not be relevant. Queries about your activity (for example, “How many upvotes do I have for each tag?”) will not be useful either if you do not have activity for the particular site.

Click on one of them and you see the SQL code for the query. Then click the “Run Query” button to get results. For example, you might look at the number of up vs down votes for questions and answers by weekday and notice that for questions, Tuesday has the highest up vs. down vote ratio (21.6) and Saturday has the lowest (18.8). You can contemplate hypotheses for this difference!

7.2.2 Basic Select Queries

Let’s experiment with our own queries. Click on “Compose Query” in the upper right, and notice the tables are shown in the right. As a reminder, a table is similar to a data frame. Each table lists the columns stored within the table and the data types for the columns. Look through the tables for Posts, Users, and Comments. Do the columns generally make sense, and correspond to the StackOverflow website? There’s a description of the tables and columns (called a schema) available on StackExchange’s Meta Q&A Site.

Now enter your first query and click the “Run Query” button:

SELECT TOP(100) Id, Title, Score, Body, Tags
FROM Posts

In this query we already see several important features of SQL:

SELECT tells SQL that a query is coming.
TOP(100) only returns the first 100 rows.¹⁶
Id, Title, Score, Body, Tags determines what columns are included in the result
FROM Posts determines the source dataset.

From glancing at the results, it appears that this table contains both questions and answers. Let’s try to focus on answers. Looking again at the Schema Description, notice that there is a PostTypeId column in Posts, and a value of 1 corresponds to questions. Let’s update our query to only include questions:

SELECT TOP(100)
Id, Title, Score, Body, Tags
FROM Posts
WHERE PostTypeId = 1

The SQL command WHERE is like the filter command we have been using in dplyr.¹⁷

Exercise 1: Find the title and score of Posts that have a score of at least 110. Hint: TOP is not necessary here because you want all result rows.

Exercise 2: Find posts whose title contains some place you are interested in (you pick!). Hint: use SQL’s LIKE operator.

Note that you can look up the actual webpage for any question using its Id. For example, if the Id is 19591, the webpage URL would be https://travel.stackexchange.com/questions/19591/. Look up a few of the questions by their Id.

It’s unclear how the 100 questions we saw were selected from among the 25,000 total questions.¹⁸ Let’s try to arrange the Posts by score. The following query surfaces the top scoring question: OK we’re all adults here, so really, how on earth should I use a squat toilet?

SELECT TOP(100)
Id, Title, Score, Body, Tags
FROM Posts
WHERE PostTypeId = 1
ORDER BY Score DESC

The ORDER BY ??? DESC syntax is similar to R’s arrange(). You can leave off the DESC if you want the results ordered smallest to largest.

We could also find the highest rated questions tagged “italy” (the top question is Does Venice Smell?):

SELECT TOP(100)
Id, Title, Score, Body, Tags
FROM Posts
WHERE PostTypeId = 1 AND Tags LIKE '%italy%'
ORDER BY Score DESC

Exercise 3: Pick two tags that interest you and you think will occur together and find the top voted posts that contain both.

7.2.3 SQL Summarization

So far, we have covered the equivalent of R’s selecting, filtering, and arranging. Let’s take a look at grouping and summarizing now, which has similar structures in both R and SQL. Imagine we want to see how many posts of each type there are. This query shows us that there are 25K questions and 42K answers.

SELECT 
PostTypeId, COUNT(Id) numPosts
FROM posts
GROUP BY PostTypeId 
ORDER BY PostTypeId

Note two characteristics of SQL summarization here:

The GROUP BYclause indicates the table column for grouping, much like R’s group_by.
There is no explicit summarize. Instead, all columns that appear in the SELECT except for those listed in GROUP BY must make use of an aggregate function. COUNT(*) is one of these, and is the equivalent of R’s n(). Many other aggregate functions exist, including MAX, SUM, AVG, and many others. Every aggregate function requires a column as an argument (even COUNT() which doesn’t logically need one).
The aggregate column (in this case COUNT(Id)) must immediately be followed by a name that will be used for it in the results (in this case numPosts). This can be particularly useful if you want to order by the aggregated value.

Exercise 4: Change the previous query so it orders the result rows by the number of posts of that type. Hint: Reuse the name you assigned to the aggregate function.

Exercise 5: Find the most commonly used tagsets applied to posts. Note that I am not asking you to count the most common individual tags — this would be more complex because multiple tags are squashed into the Tags field.

7.2.4 SQL Joins

Finally, as with R, we often want to join data from two or more tables. The types of joins in SQL are the same as we saw with R (inner, outer, left, right). Most commonly we want to perform an INNER join, which is the default if you just say JOIN.

Let’s say we wanted to enhance the earlier query to find the highest scoring answers with some information about each user.

SELECT TOP(100)
Title, Score, DisplayName, Reputation
FROM posts p
JOIN users u
ON p.OwnerUserId = u.Id
WHERE PostTypeId =1
ORDER BY Score Desc

We see a few notable items here:

The JOIN keyword must go in between the two tables we want to join.
Each table must be named. In this case we named posts p and users u.
We need to specify the relationship that joins the two tables. In this case, a posts OwnerUserId column refers to the Id column in the users table.

Exercise 6: Create a query similar to the one above that identifies the top rated comments instead of posts.

7.2.5 Additional Exercises

The first few exercises will ask you to analyze Stack Exchange badges.

Exercise 7: Count the number of total badges that have been given out. Hint: count the number of rows in the relevant table.

Exercise 8: Find how many times each badge has been awarded, sorted from most awarded to least awarded.

Exercise 9: Find a badge that looks interesting to you. Find all the user DisplayNames that have received the badge, along with the date at which they received it.

Exercise 10: Show the users who have received the most badges, along with how many they have received.

The next few activities analyze user activity. These activities mimic the common workflow of creating datasets in SQL that you analyze in R. Note that StackExchange’s data explorer has a button that lets you download a CSV containing the results of a SQL query.

Exercise 11: Export a CSV file containing information about each user: DisplayName, Id, Reputation, Age, and CreationDate.

Exercise 12: Calculate the number of posts per user, and the total number of upvotes across all posts per user and export this file as a CSV.

Exercise 13: Calculate the number of comments per user, and the total number of upvotes across all comments per user (this is the sum of the Score variable under the Comments table) and export this file as a CSV.

Exercise 14: Import these three datasets into R. Visualize the relationship between the three datasets. Include at least one visualization comparing each of:

information from the user CSV and the post CSV, and
information from the user CSV and comment CSV

To receive full credit your visualizations must tell a compelling story.

7.3 Public Web APIs

Required libraries:

library(tidyverse)
library(ggplot2)
library(lubridate)
library(urltools)

In this lesson you’ll learn how to collect data from websites such as The New York Times, Zillow, and Google. While these sites are primarily known for the information they provide to humans browsing the web, they (along with most large websites) also provide information to computer programs.

Humans use browsers such as Firefox or Chrome to navigate the web. Behind the scenes, our browsers communicate with web servers using a technology called HTTP.

Programming languages such as R can also use HTTP to communicate with web servers. We will show later how it is possible for R to “scrape” data from almost any web page. However, it’s easiest to interact with websites that are specifically designed to communicate with programs. These Web APIs focus on transmitting data, rather than images, colors, or other appearance-related information.

An enormous variety of web APIs provide data accessible to programs written in R (and almost any other programming language!). Almost all reasonably large commercial websites offer APIs. Todd Motto has compiled an excellent list of Public Web APIs on GitHub. Browse the list to see what kind of information is available.

7.3.1 Wrapper Packages

Possible readings:

NY Times API
NY Times Blog post announcing the API
Video showing how to use the NY Times API
RTimes Vignette
rOpenSci has a good collection of wrapper packages

In R, it is easiest to use Web APIs through a wrapper package, an R package written specifically for a particular Web API. The R development community has already contributed wrapper packages for most large Web APIs. To find a wrapper package, search the web for “R Package” and the name of the website. For example, a search for “R Reddit Package” returns RedditExtractor and a search for “R Weather.com Package” surfaces weatherData.

This activity will build on the New York Times Web API, which provides access to news articles, movie reviews, book reviews, and many other data. Our activity will specifically focus on the news search API, which finds information about news articles that contain a particular word or phrase.

We will use the RTimes package that provides functions for some (but not all) of the NYTimes APIs. Take a look at the RTimes Vignette, which shows some of the available features of the RTimes API. Jump forward to the Article Search API section and answer the following questions:

Exercise 1: What do you think the rtimes function below does? How does it communicate with the NY Times? Where is the data about articles stored?

res <- as_search(q="gamergate", begin_date = "20140101", end_date = "20160101")

To get started with the NY Times API, you must register and get an authentication key. Signup only takes a few seconds, and it lets the New York Times make sure nobody abuses their API for commercial purposes. It also rate limits their API and ensures programs don’t make too many requests per day. For the NY Times API, this limit is 1000 calls per day. Be aware that most APIs do have rate limits — especialy for their free tiers.

Once you have signed up, the New York Times will email you an authentication key, which is a 32 digit string with numbers and the letters a-e.

Store this in a variable as follows (this is just an example ID, not an actual one):

rtimes_key <- "c935b213b2dc1218050eec976283dbbd"

Now, let’s use the key to issue our first API call. We’ll adapt the code we see in the vignette to do what we need.

library(rtimes)

# Pause for one second; Don't make requests to often
Sys.sleep(1)

# Tell rtimes what our API key is
Sys.setenv(NYTIMES_AS_KEY = rtimes_key)

# Issue our first API call
res <- as_search(q="gamergate", begin_date = "20140101", end_date = "20160101")

Something magical just happened. Your computer sent a message to the New York Times and asked for information about every article about Gamergate between 2014 and 2016. Thousands of public Web APIs allow your computer to tap into almost any piece of public digital information on the web.

Let’s take a peek at the structure of the results. The RTimes Vignette and documentation provide some details on the structure. You can also look at the data in the “Environment” tab in the upper right of RStudio. The result’s main data is contained in res$data:

data <- res$data
colnames(data)
##  [1] "web_url"                 "snippet"                
##  [3] "source"                  "multimedia"             
##  [5] "keywords"                "pub_date"               
##  [7] "document_type"           "new_desk"               
##  [9] "section_name"            "type_of_material"       
## [11] "_id"                     "word_count"             
## [13] "score"                   "abstract"               
## [15] "print_page"              "headline.main"          
## [17] "headline.kicker"         "headline.content_kicker"
## [19] "headline.print_headline" "headline.name"          
## [21] "headline.seo"            "headline.sub"           
## [23] "byline.original"         "byline.person"          
## [25] "byline.organization"
data$web_url
##  [1] "https://www.nytimes.com/2015/12/17/technology/personaltech/women-who-play-games-shun-gamer-label.html"                
##  [2] "https://www.nytimes.com/2015/11/17/world/europe/paris-attacks-give-rise-to-fakes-and-misinformation.html"             
##  [3] "https://bits.blogs.nytimes.com/2015/10/27/backlash-grows-over-sxsws-canceled-video-game-panels/"                      
##  [4] "https://bits.blogs.nytimes.com/2015/10/30/sxsw-convenes-online-harassment-event-following-uproar/"                    
##  [5] "https://bits.blogs.nytimes.com/2015/10/26/sxsw-cancels-gamer-panels-after-threats/"                                   
##  [6] "https://www.nytimes.com/2015/10/27/nytnow/your-tuesday-evening-briefing-federal-budget-world-series-donald-trump.html"
##  [7] "https://www.nytimes.com/2015/10/28/nytnow/latest-news-dennis-hastert-walgreens-kansas-city-royals.html"               
##  [8] "https://bits.blogs.nytimes.com/2015/05/14/reddit-introduces-anti-harassment-policy/"                                  
##  [9] "https://www.nytimes.com/2015/05/02/opinion/enlisting-bystanders-to-fight-online-abuse.html"                           
## [10] "https://www.nytimes.com/2015/03/08/movies/in-the-documentary-gtfo-female-video-gamers-fight-back.html"
data$headline.main
##  [1] "Women Who Play Games Shun ‘Gamer’ Label"                                  
##  [2] "Paris Attacks Give Rise to Fakes and Misinformation"                      
##  [3] "Backlash Grows Over SXSW Interactive&#8217;s Canceled Video Game Panels"  
##  [4] "SXSW Convenes Online Harassment Event Following Uproar"                   
##  [5] "SXSW Cancels Gamer Panels After Threats"                                  
##  [6] "Your Tuesday Evening Briefing: Federal Budget, World Series, Donald Trump"
##  [7] "Your Wednesday Briefing: Dennis Hastert, Volkswagen, Kansas City Royals"  
##  [8] "Reddit Introduces Anti-Harassment Policy"                                 
##  [9] "Enlisting Bystanders to Fight Online Abuse"                               
## [10] "In the Documentary ‘GTFO,’ Female Video Gamers Fight Back"
data$abstract
##  [1] NA                                                                                                                                                                                                                                                                                                                                
##  [2] NA                                                                                                                                                                                                                                                                                                                                
##  [3] "A day after SXSW Interactive canceled two video game panels related to the so-called GamerGate movement over threats of violence, two digital media organizations threatened to pull out of the tech conference."                                                                                                                
##  [4] "SXSW Interactive faced controversy earlier this week for canceling two panels related to the issue of online harassment. Now the confab has decided to hold a full day of panel discussions on the topic."                                                                                                                       
##  [5] "Both panels appear to have been dedicated to exploring issues that have polarized the gaming community since last year, when an online movement known by the Twitter hashtag #GamerGate first formed."                                                                                                                           
##  [6] NA                                                                                                                                                                                                                                                                                                                                
##  [7] NA                                                                                                                                                                                                                                                                                                                                
##  [8] "Reddit users who view or experience harassment on the service will now be able to e-mail the company and request to have content and offenders removed from the site."                                                                                                                                                           
##  [9] "Editorial holds that until social networks develop improved means of preventing online harassment, enlisting help of bystanders is one of best protective measures victims can adopt; notes development of internet platforms, such as HeartMob, that provide ways for bystanders to help without directly confronting harasser."
## [10] "Shannon Sun-Higginson's movie GTFO is one of several documentaries that examine pervasive culture of sexism and harassment of female players and developers in video game community."

While this looks interesting, it seems unlikely that only ten articles in the New York Times referenced Gamergate during this time period. What could be happening? The answer to this mystery lies in the meta attribute of the results:

res$meta
## # A tibble: 1 x 3
##    hits offset  time
##   <int>  <int> <int>
## 1    34      0    21

The tibble in res$data is only showing us the first ten results of 34 total results. This fragmenting of large result sets is called pagination, and it is regularly used by Web APIs to reduce the network and computation needs of a single API call. By looking at the rtimes documentation, you may learn how to fetch the second page of results. Note that page counting starts at 0, so the first page is index 0 and the second page is index 1:


# This pauses for 1 second. 
# It is required when knitting to prevent R from issuing too many requests to 
# The NY Times API at a time. If you don't have it you will get an error that
# says "Too Many Requests (429)"
Sys.sleep(1)

res2 <- as_search(q="gamergate", begin_date = "20140101", end_date = '20160101', page = 1)
res2$meta
## # A tibble: 1 x 3
##    hits offset  time
##   <int>  <int> <int>
## 1    34     10     7
res2$data$headline.main
##  [1] "Annalee Newitz on the Future: ‘I Want People to Be Able to Have Cultural Debates Without Having War’"
##  [2] "Intel Allocates $300 Million for Workplace Diversity"                                                
##  [3] "ThinkUp Helps the Social Network User See the Online Self"                                           
##  [4] "What We're Reading: Great Times Reads of 2014 From Our Top Editors"                                  
##  [5] "2014: The Year in Questions – Quiz Answers"                                                          
##  [6] "Jane Pratt on Why Writing for Young Women Never Gets Old"                                            
##  [7] "Trying to Swim in a Sea of Social Media Invective"                                                   
##  [8] "Death Threats, and Other Growing Pains"                                                              
##  [9] "Can Twitter Stop Harassment?"                                                                        
## [10] "Twitter Improves Tools for Users to Report Harassment"

We could write code that repeats this process for each of the 4 pages of search results and merges them into a single result dataset. Although this is what you would do for many APIs, the author of the rtimes package has given us a convenient way to do this using the all_results parameter for as_search¹⁹


Sys.sleep(1)
res3 <- as_search(q="gamergate", 
                  begin_date = "20140101", 
                  end_date = '20160101', 
                  all_results = TRUE)

Exercise 2: Create a chart visualizing the number of articles appearing in the NY Times over time between Jan 1, 2014 and Jan. 1, 2016. Hint: Year + month provides a good level of granularity for the anlaysis, and they appear as a consistently located substring within the data.²⁰

Exercise 3: There is often a mismatch between the structure of the data returned by a Web API and R’s table-oriented format. Getting the data you need from a Web API may require extra data manipulation. For example, for this exercise, consider the keywords associated with each article in the res3 response.

What type of data are in the keyword variable? Look at both the top-level data-frame and an observation within the top-level data-frame. Hint: You can look at the first record of a data-frame by using df[[1]].
Retrieve the keywords for the first article.
Create a “flattened” data-frame for the keyword variable containing one row per observation of a keyword. Hint: look at the bind_rows dplyr function.
Visualize the frequent keywords for the articles.

7.3.2 Accessing Web APIs Using JSON

Wrapper packages such as rtimes provide a convenient way to interact with Web APIs. However, many Web APIs have incomplete wrapper packages, or no wrapper package at all. Fortunately, most Web APIs share a common structure that R can access relatively easily. There are two parts to each Web API: the request, which corresponds to a function call, and the response, which corresponds to the function’s return value.²¹

As mentioned earlier, a Web API call differs from a regular function call in that the request is sent over the internet to a webserver, which performs the computation and calculates the return result, which is sent back over the Internet to the original computer.

7.3.2.1 Web API Requests

Possible readings:
1. Understanding URLs
2. urltools Vignette

The request for a Web API call is usually encoded through the URL, the web address associated with the API’s webserver. Let’s look at the URL associated with the first rtimes as_search example we did. Open the following URL in your browser (you should replace MY_KEY with the api key you were given earlier).

http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gamergate&api-key=MY_KEY

The text you see in the browser is the response data. We’ll talk more about that in a bit. Right now, let’s focus on the structure of the URL. You can see that it has a few parts:

http:// — The scheme, which tells your browser or program how to communicate with the webserver. This will typically be either http: or https:.
api.nytimes.com — The hostname, which is a name that identifies the webserver that will process the request.
/svc/search/v2/articlesearch.json — The path, which tells the webserver what function you would like to call.
?q=gamergate&api-key=MY_KEY — The query parameters, which provide the parameters for the function you would like to call. Note that the query can be thought of as a table, where each row has a key and a value. In this case, the first row has key q and value gamergate and the second row has value MY_KEY. The query parameters are preceded by a ?. Rows in the key-value table are separated by ‘&’, and individual key-value columns are separated by an =.

Typically, each of these URL components will be specified in the API documentation. Sometimes, the scheme, hostname, and path (http://api.nytimes.com/svc/search/v2/articlesearch.json) will be referred to as the endpoint for the API call.

We will use the urltools module to build up a full URL from its parts. We start by creating a string with the endpoint and then add the parameters one by one using param_set and url_encode:

url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("marlon james"))
url <- param_set(url, "api-key", url_encode(rtimes_key))
url

Copy and paste the resulting URL into your browser to see what the NY Times response looks like!

Exercise 4: You may be wondering why we need to use param_set and url_encode instead of writing the full url by hand. This exercise will illustrate why we need to be careful.

Repeat the above steps, but create a URL that finds articles related to Ferris Bueller's Day Off (note the apostrophe). What is interesting about how the title appears in the URL?
Repeat the steps above for the phrase Nico & Vinz (make sure you use the punctuation mark &). What do you notice?
Take a look at the Wikipedia page describing percent encoding. Explain how the process works.

7.3.2.2 Web API Responses

Possible readings:
1. A Non-Programmer’s Introduction to JSON
2. Getting Started With JSON and jsonlite
3. Fetching JSON data from REST APIs

We now discuss the structure of the web response, the return value of the Web API function. Web APIs generate string responses. If you visited the earlier New York Times API link in your browser, you would be shown the string response from the New York Times webserver:

{"status":"OK","copyright":"Copyright (c) 2017 The New York Times Company. 
All Rights Reserved.","response":
{"docs":[{"web_url":"https://www.nytimes.com/2017/06/27/arts/milkshake-duck-meme.html","snippet":
"Oxford Dictionaries is keeping a close eye on a term that describes someone who 
rapidly gains and inevitably loses the internet’s intense love.","blog":{},"source":
"The New York Times","multimedia":[{"type":"image","subtype":"xlarge","url":
"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-articleLarge.jpg",
"height":350,"width":600,"rank":0,"legacy":
{"xlargewidth":600,"xlarge":"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-articleLarge.jpg","xlargeheight":350}},{"type":"image","subtype":"wide",
"url":"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-thumbWide.jpg",
"height":126,"width":190,"rank":0,"legacy":
{"wide":"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-thumbWide.jpg","widewidth":190,"wideheight":126}},
{"type":"image","subtype":"thumbnail",
"url":"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-thumbStandard-v2.jpg",
"height":75,"width":75,"rank":0,"legacy":
{"thumbnailheight":75,"thumbnail":"images/2017/06/23/world/23Ducktweet_xp/23Ducktweet_xp-thumbStandard-v2.jpg",
"thumbnailwidth":75}}],"headline":{"main":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’",
"print_headline":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’"},"keywords":[{"isMajor":"N","rank":2,"name":"subject","value":"Slang"},
{"isMajor":"N","rank":3,"name":"subject","value":"Social Media"},...

If you stared very hard at the above response, you may be able to interpret it. However, it would be much easier to interact with the response in some more structured, programmatic way. The vast majority of Web APIs, including the New York Times, use a standard called JSON (Javascript Object Notation) to take data and encode it as a string. To understand the structure of JSON, take the NY Times web response in your browser, and copy and paste it into an online JSON formatter. The formatter will add newlines and tabs to make the data more human interpretable. You’ll see the following:

{  
   "status":"OK",
   "copyright":"Copyright (c) 2017 The New York Times Company. All Rights Reserved.",
   "response":{  
      "docs":[  
      
        # A HUGE piece of data, with one object for each of the result articles
        
      ],
      "meta":{  
         "hits":2350,
         "offset":0,
         "time":72
      }
   }
}

You’ll notice a few things in the JSON above:

Strings are enclosed in double quotes, for example "status" and "OK".
Numbers are written plainly, like 2350 or 72.
Some data is enclosed in square brackets [ and ]. These data containers can be thought of as R lists.
Some data is enclosed in curly braces { and }. These data containers are called Objects. An object can be thought of as a single observation in a table. The columns or variables for the observation appear as keys on the left (hits, offset, etc.). The values appear after the specific key separated by a colon (2350, and 0, respectively). Thus, we can think of the meta object above as:

hits	offset	time
2350	0	72

Let’s repeat the NY Times search for gamergate, but this time we will peform the Web API call by hand instead of using the rtimes wrapper package. We will use the jsonlite package to retrieve the response from the webserver and turn the string response into an R object. The fromJson function sends our request out over and across the web to the NY Times webserver, retrieves it, and turns it from a JSON-formatted string into R data.


Sys.sleep(1)

library(jsonlite)

# Rebuild the URL
url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("gamergate"))
url <- param_set(url, "api-key", url_encode(rtimes_key))

# Send the request to the webserver over the Internet and 
# retrieve the JSON response. Turn the JSON response into an
# R Object.
response_js <- fromJSON(url)

The jsonlite makes the keys and values of an object available as attributes. For example, we can fetch the status:

response_js$status
## [1] "OK"

While some keys in the object are associated with simple values, such as "status", others are associated with more complex data. For example, the key "response" is associated with an object that has two keys: "docs", and "meta". "meta" is another object: { "hits":63, "offset":0, "time":194 }. We can retrieve these nested attributes by sequentially accessing the object keys from the outside in. For example, the inner "hits" attribute would be accessed as follows:

response_js$response$meta$hits
## [1] 68

Exercise 5: Retrieve the data associated with 1) the copyright key of the response_js object, and 2) the time attribute nested within the meta object.

The majority of the data is stored under response, in docs. Notice that docs is a list, where each element of the list is a JSON object that looks like the following:

 {  
  "web_url":"https://www.nytimes.com/2017/06/27/arts/milkshake-duck-meme.html",
  "snippet":"Oxford Dictionaries is keeping a close eye on a term that describes someone who rapidly gains and inevitably loses the internet’s intense love.",  
  "blog":{  },  
  "source":"The New York Times",  
  "multimedia":[    
      ... A LIST OF OBJECTS ...  
  ],  
  "headline":{    
     "main":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’",  
     "print_headline":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’"  
  },  
  "keywords":[    
      ... A LIST OF OBJECTS ...  
  ],  
  "pub_date":"2017-06-27T12:24:20+0000",  
  "document_type":"article",  
  "new_desk":"Culture",  
  "byline":{    
     "original":"By JONAH ENGEL BROMWICH"  
  },  
  "type_of_material":"News",  
  "_id":"59524e7f7c459f257c1ac39f",  
  "word_count":1033,  
  "score":0.35532707,  
  "uri":"nyt://article/a3e5bf4a-6216-5dba-9983-73bc45a98e69"  
},

jsonlite makes lists of objects available as a data frame, where the columns are the keys in the object (web_url, snippet, etc.)

docs_df <- response_js$response$docs
class(docs_df)
## [1] "data.frame"
colnames(docs_df)
##  [1] "web_url"          "snippet"          "blog"            
##  [4] "source"           "multimedia"       "headline"        
##  [7] "keywords"         "pub_date"         "document_type"   
## [10] "new_desk"         "byline"           "type_of_material"
## [13] "_id"              "word_count"       "score"           
## [16] "uri"              "print_page"       "section_name"
dim(docs_df)
## [1] 10 18

7.3.2.3 Nested Data Frames

Here is some code to generate queries on NY Times articles about the Red Sox. It fetches the first thirty entries in batches of 10.

Sys.sleep(1)

url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("Red Sox"))
url <- param_set(url, "api-key", url_encode(rtimes_key))
url <- param_set(url, "page",0)
res1 <- fromJSON(url)

Sys.sleep(1)
url <- param_set(url, "page", 1)
res2 <- fromJSON(url)

Sys.sleep(1)
url <- param_set(url, "page", 2)
res3 <- fromJSON(url)

docs1 <- res1$response$docs
docs2 <- res2$response$docs
docs3 <- res3$response$docs

Each of these docs vaariables is a table with ten entries (articles) and the same 18 variables:

names(docs1)
##  [1] "web_url"          "snippet"          "print_page"      
##  [4] "blog"             "source"           "multimedia"      
##  [7] "headline"         "keywords"         "pub_date"        
## [10] "document_type"    "new_desk"         "section_name"    
## [13] "byline"           "type_of_material" "_id"             
## [16] "word_count"       "score"            "uri"

Now we want to stack the tables on top of each other to get a single table with 30 rows and 18 variables. If you try the following command:

bind_rows(docs1,docs2,docs3)

then you will get an error saying “Error in bind_rows_(x, .id) : Argument 4 can’t be a list containing data frames.”

What is happening???

Let’s check out the first column of the docs1 table:

docs1$web_url
##  [1] "https://www.nytimes.com/2017/10/09/sports/baseball/mlb-playoffs-red-sox-astros.html"                          
##  [2] "https://www.nytimes.com/2017/10/09/sports/red-sox-price-astros-sweep.html"                                    
##  [3] "https://www.nytimes.com/2017/11/14/obituaries/bobby-doerr-dead-red-sox.html"                                  
##  [4] "https://www.nytimes.com/2017/12/03/sports/shohei-ohtani-yankees.html"                                         
##  [5] "https://www.nytimes.com/2017/12/08/obituaries/tracy-stallard-who-gave-up-historic-maris-homer-dies-at-80.html"
##  [6] "https://topics.nytimes.com/topic/organization/boston-red-sox"                                                 
##  [7] "https://www.nytimes.com/2017/10/22/sports/alex-cora-red-sox-manager.html"                                     
##  [8] "https://www.nytimes.com/2017/10/11/sports/baseball/john-farrell-boston-red-sox-fired.html"                    
##  [9] "https://www.nytimes.com/2017/10/05/sports/baseball/mlb-playoffs-astros-red-sox.html"                          
## [10] "https://www.nytimes.com/2017/10/06/sports/baseball/red-sox-rafael-devers-dominican-vitilla.html"

It lists the web addresses of the first ten sites returned in the search. It is a vector of ten character string, which is just fine for one column of data in our table.

Now let’s check out the headline variable:

docs1$headline
##                                                                       main
## 1         With Justin Verlander in Relief, Astros Oust Red Sox in A.L.D.S.
## 2  Red Sox Rouse Themselves to Beat Astros and Avoid Another Playoff Sweep
## 3            Bobby Doerr, 99, Red Sox Hall of Fame Second Baseman, Is Dead
## 4               Shohei Ohtani Spurns the Yankees, Seeking a Smaller Market
## 5             Tracy Stallard, Who Gave Up Historic Maris Homer, Dies at 80
## 6                                                           Boston Red Sox
## 7                                      Alex Cora Hired as Red Sox’ Manager
## 8                                        Red Sox Fire Manager John Farrell
## 9         Jose Altuve Flexes His Muscles and the Astros Flatten Chris Sale
## 10 Dominican Players Sharpen Their Skills With a Broomstick and Bottle Cap
##                                             kicker
## 1                              Astros 5, Red Sox 4
## 2                             Red Sox 10, Astros 3
## 3                                                 
## 4                                             <NA>
## 5                                             <NA>
## 6                                             <NA>
## 7                                             <NA>
## 8                                             <NA>
## 9  Astros 8, Red Sox 2 | Houston Leads Series, 1-0
## 10                                            <NA>
##                                     content_kicker
## 1                              Astros 5, Red Sox 4
## 2                             Red Sox 10, Astros 3
## 3                                                 
## 4                                             <NA>
## 5                                             <NA>
## 6                                             <NA>
## 7                                             <NA>
## 8                                             <NA>
## 9  Astros 8, Red Sox 2 | Houston Leads Series, 1-0
## 10                                            <NA>
##                                                                print_headline
## 1       With Aces Showing Up in Relief, Astros Finish Off the Red Sox’ Season
## 2             Red Sox Snap Out of Their Doldrums on Offense and Avoid a Sweep
## 3        Bobby Doerr, 99, Second Baseman  For Red Sox and Hall of Famer, Dies
## 4                                 Japanese Star Declines to Sign With Yankees
## 5                                Tracy Stallard, 80; Gave Up a Historic Homer
## 6                                                                        <NA>
## 7                               In a First  For Boston,  A Latino  Is Manager
## 8                                           Red Sox Fire Manager John Farrell
## 9  Altuve Makes a Big Impression as the Astros Roll Over the Red Sox and Sale
## 10                       Dominican Players Hone Skills With AidOf Broomsticks
##    name seo sub
## 1    NA  NA  NA
## 2    NA  NA  NA
## 3    NA  NA  NA
## 4    NA  NA  NA
## 5    NA  NA  NA
## 6    NA  NA  NA
## 7    NA  NA  NA
## 8    NA  NA  NA
## 9    NA  NA  NA
## 10   NA  NA  NA

The headline variable is actually a data frame that contains three variables: main, kicker, and print_headline. That is, we have nested data frames. This is a common problem when scraping data from JSON files, and it is why we are not able to directly bind the rows of our three tables on top of each other.

We can check out the type of variable in each column with the class function:

sapply(docs1,class)
##          web_url          snippet       print_page             blog 
##      "character"      "character"      "character"     "data.frame" 
##           source       multimedia         headline         keywords 
##      "character"           "list"     "data.frame"           "list" 
##         pub_date    document_type         new_desk     section_name 
##      "character"      "character"      "character"      "character" 
##           byline type_of_material              _id       word_count 
##     "data.frame"      "character"      "character"        "integer" 
##            score              uri 
##        "numeric"      "character"

We see that blog, headline, and byline are the three problem columns that each contain their own data frames.

The solution is to flatten these variables, which generates a new column in the outer table for each of the columns in the inner tables.

docs1_flat<-flatten(docs1)
names(docs1_flat)
##  [1] "web_url"                 "snippet"                
##  [3] "print_page"              "source"                 
##  [5] "multimedia"              "keywords"               
##  [7] "pub_date"                "document_type"          
##  [9] "new_desk"                "section_name"           
## [11] "type_of_material"        "_id"                    
## [13] "word_count"              "score"                  
## [15] "uri"                     "headline.main"          
## [17] "headline.kicker"         "headline.content_kicker"
## [19] "headline.print_headline" "headline.name"          
## [21] "headline.seo"            "headline.sub"           
## [23] "byline.original"         "byline.person"          
## [25] "byline.organization"
sapply(docs1_flat,class)
##                 web_url                 snippet              print_page 
##             "character"             "character"             "character" 
##                  source              multimedia                keywords 
##             "character"                  "list"                  "list" 
##                pub_date           document_type                new_desk 
##             "character"             "character"             "character" 
##            section_name        type_of_material                     _id 
##             "character"             "character"             "character" 
##              word_count                   score                     uri 
##               "integer"               "numeric"             "character" 
##           headline.main         headline.kicker headline.content_kicker 
##             "character"             "character"             "character" 
## headline.print_headline           headline.name            headline.seo 
##             "character"               "logical"               "logical" 
##            headline.sub         byline.original           byline.person 
##               "logical"             "character"                  "list" 
##     byline.organization 
##             "character"

The headline variable is now replaced with three separate columns for headline.main, headline.kicker, and headline.print_headline. The byline variable actually only had one column in its inner table, which was called original, so now we have byline.original. The blog variable contained an empty data frame, so it has been removed. The overall result is a new flat table with 19 columns, and no more nested data frames.

Once the data is flattened, we can bind rows:

all_docs<-bind_rows(flatten(docs1),flatten(docs2),flatten(docs3))
dim(all_docs)

Exercise 6:

Select your own article search query that interests you. You may want to play with NY Times online search or the API web search console to find a query that is interesting, but not overly popular. You can change any part of the query you would like. Your query should have at least 30 matches.
Retrieve data for the first three pages of search results from the article search API. Hint: repeat the process we used above three times for your search query.
Create a data frame that joins together the docs data frames (as shown above) for the three pages of results. Hint: remember the bind_rows function we used earlier.
Visualize the number of search results per day or month in your result set.

Exercise 7:

Browse toddomotos’ list of Public APIS and abhishekbanthia’s list of Public APIs. Your assignment is to visualize data from two of these APIs. This is an individual assignment, but you will need to coordinate with your group.

There are some constraints on the APIs you may choose:

At least one API must be closely related to your project, but ideally both will have a (possibly tenuous) connection. This will make your life easier when you start your project!
You must use two different API services. For example, you personally cannot use two different API calls to the NY Times.
You cannot analyze the same exact API call as anybody else in your group (e.g. NY Times article search). Note that your group can analyze different API calls from the same service (e.g. NY Times article search vs. NY Times geo API)

For each API:

Sketch out one interesting visualization that relies on the public API you selected earlier. Make sure the exact data you need is available. If it’s not, try a new visualization or API.
If a wrapper package is available, you may use it, but you should also try to create the request URL and retrieve the JSON data using the techniques we showed earlier, without the wrapper package.
Visualize the data you collected and describe the results.

Here are a few criteria you should consider when looking for APIs:

You may use a wrapper package for at most one of the APIs.
You must use the JSON approach we illustrated above for at least one of the APIs; not all APIs support JSON.²²
Stay away from APIs that require OAuth for Authorization unless you are prepared for extra work before you get data! Most of the large social APIs (Facebook, LinkedIn, Twitter, etc.) require OAuth. toddomoto’s page lists this explicitly, but you’ll need to dig a bit if the API is only on abhishekbanthia’s list.
You will probably need to explore several different APIs before you find one that works well for your interests and this assignment.
Beware of the rate limits associated with the API you choose. These determine the maximimum number of API calls you can make per second, hour or day. Though these are not always officially published, you can find them by Google (for example) GitHub API rate limit. If you need to slow your program down to meet the API insert calls to Sys.sleep(1) as we did above for the NY Times API.

7.4 Data Acquisition: Scraping

Possible Readings:

Required modules:

library(rvest)
library(tidyverse)

While a great deal of data is available via Web APIs, not all of it is. Programs can use a process called web scraping to collect data that is available to humans (via web browsers) but not computer programs.

7.4.1 Finding CSS Selectors

In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the Macalester Registrar’s Fall 2017 Class Schedule you can visually see that the data is represented in a table. The first column shows the course number, the second the title, etc.

Figure 7.3: Subset of Macalester registration data for Fall 2017.

We will identify data in a webpage using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages. For example, the CSS selector “a” selects all hyperlinks in a webpage (“a” represents for “anchor” links in HTML), “table > tr > td:nth-child(2)” would find the second column of an HTML table.

Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget for Chrome. ²³ You “teach” the" Selector Gadget which data you are intereted in on a web page, and it will show you the CSS Selector for this data. We will eventually use this selector in R.

*Warning**: Websites change! Often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.

First watch the Selector Gadget video and install the plugin.

Figure 7.4: Selector Gadget icon.

Head over to the Macalester Registrar’s fall 2017 class schedule. Click the selector gadget icon in the top right corner of Chrome. As you mouse over the webpage, different parts will be highlighted in orange. Click on the first course number, AMST 101-01. You’ll notice that the Selector Gadget information in the lower right describes what you clicked on:

Figure 7.5: The Selector Gadget in action: Mousing over the course number (left), and the result pane after clicking on the course number (right).

Scroll through the page to verify that only the information you intend (the course number) is selected. The selector panel shows the CSS selector (.class-schedule-course-number) and the number of matches for that CSS selector (758).

Now that we have the selector for the course number, let’s find the selector for the days of the week. Clear the selector and by clicking the “Clear” button on the result pane, then click the W under days for AMST 101-01. You will notice that the selector was too broad and highlighted information we don’t want. You need to teach Selector Gadget a correct selector by clicking the information you don’t want to turn it red. Once this is done, you should have 758 matches and a CSS selector of .class-schedule-course-title+ .class-schedule-label

Figure 7.6: Results for the Selector Gadget after selecting the day field.

Exercise 1: Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 758 results:

Course Name
Course Time
Room
Instructor
Avail. / Max
Course Description (make sure you only match 758!)
General Education Requirements (make sure you only match 758, beware of the Mac copyright banner at the bottom of the page!)
Distribution Requirements (make sure you only match 758!)

Answers: (others are possible, depending on how you teach Selector Gadget)

Course Name: .class-schedule-course-title
Course Time: .class-schedule-label:nth-child(4)
Room: .class-schedule-label:nth-child(5)
Instructor: .class-schedule-label:nth-child(6)
Avail. / Max: .class-schedule-label:nth-child(7)
Course Description: .collapsed p:nth-child(1)
General Education Requirements: #content p:nth-child(2)
Distribution Requirements: .collapsed p:nth-child(3)

7.4.2 Retrieving Data Using RVest & CSS Selector

Now that we have identified CSS selectors for the information we need, let’s fetch the data in R. We will be using the rvest package, which retrieves information from a webpage and turns it into R data frames:

library(tidyverse)
library(rvest)
fall2017 <- read_html("https://www.macalester.edu/registrar/schedules/2017fall/class-schedule/#crs10008")

Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the course numbers and names as a vector:

# Retrieve and inspect course numbers
course_nums <- 
  fall2017 %>%
  html_nodes(".class-schedule-course-number") %>%
  html_text()
head(course_nums)
## [1] "AMST 101-01" "AMST 103-01" "AMST 200-01" "AMST 203-01" "AMST 219-01"
## [6] "AMST 229-01"

# Retrieve and inspect course names
course_names <- 
  fall2017 %>%
  html_nodes(".class-schedule-course-title") %>%
  html_text()
head(course_names)
## [1] "Explorations of Race and Racism"                     
## [2] "The Problems of Race in US Social Thought and Policy"
## [3] "Critical Methods for American Studies Research"      
## [4] "Politics and Inequality: American Welfare State"     
## [5] "In Motion: African Americans in the United States"   
## [6] "Narrating Black Women's Resistance"

course_df <- data.frame(number=course_nums, name=course_names)
head(course_df)
##        number                                                 name
## 1 AMST 101-01                      Explorations of Race and Racism
## 2 AMST 103-01 The Problems of Race in US Social Thought and Policy
## 3 AMST 200-01       Critical Methods for American Studies Research
## 4 AMST 203-01      Politics and Inequality: American Welfare State
## 5 AMST 219-01    In Motion: African Americans in the United States
## 6 AMST 229-01                   Narrating Black Women's Resistance

What happens when we try to grab the instructors in the same manner?

course_instructors <- 
  fall2017 %>%
  html_nodes(".class-schedule-label:nth-child(6)") %>% 
  html_text()
head(course_instructors)
## [1] "Instructor: Gutierrez, Harris"     
## [2] "Instructor: Karin Aguilar-San Juan"
## [3] "Instructor: Nathan Titman"         
## [4] "Instructor: Lesley Lavery"         
## [5] "Instructor: Crystal Moten"         
## [6] "Instructor: Crystal Moten"

In front of each entry is “Instructor:”, which we don’t really need stored in every entry of our data frame. This is because the website is set up in a responsive manner to change when the browser window is narrowed or you are on a mobile device.

Figure 7.7: View of the schedule when the browser window is narrower.

To clean up the data, we can chop off the first 12 characters of each entry using the command substr:²⁴

course_instructors_short<-substr(course_instructors,start=13,stop=nchar(course_instructors))
head(course_instructors_short)
## [1] "Gutierrez, Harris"      "Karin Aguilar-San Juan"
## [3] "Nathan Titman"          "Lesley Lavery"         
## [5] "Crystal Moten"          "Crystal Moten"

Exercise 2: Create a data frame that contains all the information about courses you found selectors for earlier (9 columns). Do not include any extraneous information like “Instructor:”.

Exercise 3: Create a chart that shows the number of sections offered per department. Hint: The department is a substring of the course number.²⁵

Exercise 4: Analyze the typical length of course names by department. To do so, create a dplyr pipeline that creates a new data frame based on your courses dataframe, with the following changes:

New columns for the length of the title of a course and the length of the description of the course. Hint: nchar.
Remove departments that have fewer than 10 sections of courses. To do so, group by department, then remove observations in groups with fewer than 10 sections (Hint: use filter with n()). Then ungroup the data so it flattens it back into a regular dataframe. This is one of the rare cases when we will use group_by without summarize.
Create a visualization of the differences across groups in lengths of course names or course descriptions. Think carefully about the visualization you should be using!

Note: The Selector Gadget isn’t very good at creating a CSS selector for the Site name. You can retrieve the name using the selector .DescriptionCell p:first-child a.

7.4.3 Additional Exercises: Analyze Alexa Top Ranks

This is an open-ended activity that asks you to analyze data from Alexa’s Top Sites. For this task, you will need to scrape at least one dataset from the Alexa site.

You are welcome to pursue an analysis that interests you. Here are some examples you might choose:

What is the relationship between the four primary data columns shown on the site?
How much do countries agree on their top sites? Do certain countries agree more often? You may want to limit your approach by sampling countries.

Note: The Selector Gadget isn’t very good at creating a CSS selector for the Site name. You can retrieve the name using the selector .DescriptionCell p:first-child a.

Another option for part (d) would be to leave them as strings and then use string processing to define the levels. We’ll learn this technique soon.↩
This is dangerous unless you know what you are doing.↩
A word of caution: It can be difficult to discover and understand information in a data warehouse that is relevant to a project. Data is often produced across different arms of a large organization, and documentation describing the data can be scattered, missing, or out-of-date. The best way to decode information is by talking to the people who produced it!↩
An alternative is to use an R package like DBI to generate the SQL calls directly in R (similar to what we did with API wrapper packages); however, this often results in more trouble than it is worth.↩
The StackExchange data explorer uses a variant of SQL called Transact SQL that is supported by Microsoft databases. TOP(100) is a non-standard SQL feature supported by T-SQL. For most databases you would accomplish the same goal by adding LIMIT 100 to the end of the query.↩
Note that whereas we used the double equals == for comparison in R, the SQL WHERE command takes just a single =.↩
To count the number of posts, run {SQL} SELECT COUNT(Id) FROM Posts Where PostTypeId = 1.↩
Warning: Setting all_results to TRUE asks rtimes to ask for each page individually. In the case of our query, this is 4 separate API requests to the New York Times. Recall that you are limited to 1000 API calls per day, so a query such as Obama or Trump would immediately use up our entire quota for the day. You can include the {r cache=TRUE} around the block to tell R to remember the results of the block after the first time you use it, so that it will not redo the query each time you knit your file.↩
You can either use substring or functions from the lubridate package.↩
Although we imply that a Web API call corresponds to a single function on the webserver, this is not necessarily the case. Still, we use this language because the analogy fits well.↩
If you want to use an API that does not support JSON, you can check if there is an R wrapper package.↩
Yes, you must have the Chrome web browser installed to do this!↩
This is a simple example of text processing with regular expressions. We’ll learn how to deal with more complicated situations in the text processing unit.↩
Yes, COMP and MATH are the same department, but for this exercise you can just show the results by four letter department code, e.g., with COMP and MATH separate.↩