2 Activities: Introductory

2.1 Getting Started in RStudio

2.1.1 Review + Assignment

As you might guess from the name, “Data Science” requires data. Working with modern (large, messy) data sets requires statistical software. We’ll exclusively use RStudio. Why?

  • it’s free
  • it’s open source (the code is free & anybody can contribute to it)
  • it has a huge online community (which is helpful for when you get stuck)
  • it’s one of the industry standards
  • it can be used to create reproducible and lovely documents (In fact, this entire course manual that you’re currently reading was constructed entirely within RStudio!)



Download R & RStudio

To get started, take the following two steps in the given order. Even if you already have R/RStudio, make sure to update to the most recent versions. Further, if you get stuck, visit the ITS help desk.
STEP 1: Download & install the R statistical software at https://mirror.las.iastate.edu/CRAN/
STEP 2: Download & install the FREE version of RStudio at https://www.rstudio.com/products/rstudio/download/

What’s the difference between R and RStudio? Mainly, RStudio requires R – thus it does everything R does and more. We will be using RStudio exclusively.




A quick tour of RStudio
Open RStudio! You should see four panes, each serving a different purpose:
You also watched a short video tour of RStudio that summarized some basic features of the console.





  1. Warm-Up
    1. Perform a simple calculation: calculate 90/3.
    2. RStudio has built-in functions to which we supply the necessary arguments: function(arguments). Use a built-in function to calculate the square root of 25.
    3. Use a built-in function to repeat the number “5” 8 times.
    4. Use the seq function to create the vector (0, 3, 6, 9, 12). (The video doesn’t cover this!)
    5. Repeat this vector 3 times.



  1. Assignment
    We often want to store our output for later use (why?). The basic idea in RStudio:
    name <- output

    Try the following syntax line by line. NOTE: RStudio ignores any content after the #. Thus we use this to ‘comment’ and organize our code.

    #type square_3
    square_3
    
    #calculate 3 squared
    3^2    
    
    #store this as "square_3"
    square_3 <- 3^2    
    
    #type square_3 again!
    square_3
    
    #do some math with square_3
    square_3 + 2











2.1.2 Tidy Data

Not only does “Data Science” require statistical software, it requires DATA! Consider the Google definition:






Types of Data

With this definition in mind, which of the following are examples of data?

  • tables

    ##   family father mother sex height nkids
    ## 1      1   78.5   67.0   M   73.2     4
    ## 2      1   78.5   67.0   F   69.2     4
    ## 3      1   78.5   67.0   F   69.0     4
    ## 4      1   78.5   67.0   F   69.0     4
    ## 5      2   75.5   66.5   M   73.5     4
    ## 6      2   75.5   66.5   M   72.5     4
  • photo

  • video

  • text / tweets






Converting to Data Tables
We’ll mostly work with data that look like this:

##   family father mother sex height nkids
## 1      1   78.5   67.0   M   73.2     4
## 2      1   78.5   67.0   F   69.2     4
## 3      1   78.5   67.0   F   69.0     4
## 4      1   78.5   67.0   F   69.0     4
## 5      2   75.5   66.5   M   73.5     4
## 6      2   75.5   66.5   M   72.5     4

This isn’t as restrictive as it seems. How can we convert the above signals, photos, videos, and text to a data table format?





Example

After a scandal among FIFA officials, fivethirtyeight.com posted an analysis of FIFA viewership, How to Break FIFA. Here’s a snapshot of the data used in this article:

country confederation population_share tv_audience_share gdp_weighted_share
United States CONCACAF 4.5 4.3 11.3
Japan AFC 1.9 4.9 9.1
China AFC 19.5 14.8 7.3
Germany UEFA 1.2 2.9 6.3
Brazil CONMEBOL 2.8 7.1 5.4
United Kingdom UEFA 0.9 2.1 4.2
Italy UEFA 0.9 2.1 4.0
France UEFA 0.9 2.0 4.0
Russia UEFA 2.1 3.1 3.5
Spain UEFA 0.7 1.8 3.1







Tidy Data

The data table above is in tidy format. Tidy data tables have three key features:

  1. Each row represents a unit of observation.
  2. Each column represents a variable (ie. an attribute of the cases that can vary from case to case). Each variable is 1 of 2 types:
    • quantitative = numerical
    • categorical = discrete possibilities/categories
  3. Contains only data, no analysis, summaries, footnotes, comments, etc.




  1. Units of observation & Variables
    1. What are the units of observation in the FIFA data?
    2. What are the variables? Which are quantitative? Which are categorical?
    3. Are these tidy data?



  1. Tidy vs Untidy
    Check out the following data. Explain why they are untidy and how we can tidy them.
    1. Data 1: FIFA

      country confederation population share tv_share
      United States CONCACAF i don’t know* 4.3% *look up later
      Japan AFC 1.9 4.9%
      China AFC 19.5 14.8%
      total=24%
    2. Data 2: Gapminder life expectancies by country

      country 1952 1957 1962
      Asia Afghanistan 28.8 30.3 32.0
      Bahrain 50.9 53.8 56.9
      Africa Algeria 43.0 45.7 48.3









2.1.3 Data Basics in RStudio

For now, we’ll focus on tidy data. In a couple of weeks, you’ll learn how to turn untidy data into tidy data.



  1. Import data
    The first step to working with data in RStudio is getting it in there! How we do this depends on its format (eg: Excel spreadsheet, csv file, txt file) and storage locations (eg: online, within Wiki, desktop). Luckily for us, the fifa_audience data are stored in the fivethirtyeight RStudio package.

    #load the fivethirtyeight package
    library(fivethirtyeight)
    
    #load the fifa data
    data("fifa_audience")
    
    #store this under a shorter, easier name
    fifa <- fifa_audience



  1. Examining data structure
    Before we can analyze our data, we must understand its structure. Try out the following functions. For each, write a comment after # that describes its action.

    #(what does View do?)
    View(fifa)  
    
    #(what does head do?)
    head(fifa)  
    
    #(what does dim do?)
    dim(fifa)           
    
    #(what does names do?)
    names(fifa)         



  1. Codebooks
    Data are also only useful if we know what they measure! The fifa data table is tidy – it doesn’t have any helpful notes. Rather, information about the data is stored in a separate codebook. Codebooks can be stored in many ways (eg: Google docs, word docs, etc). Here the authors have made their codebook available in RStudio (under the original fifa_audience name). Check it out:

    ?fifa_audience
    1. What does population_share measure?
    2. What are the units of population_share?



  1. Examining a single variable
    1. We might want to access & focus on a single variable. To this end, we can use the $ notation:

      fifa$tv_audience_share
      fifa$confederation
    2. It’s important to understand the format/class of each variable (quantitative, categorical, date, etc) in both its meaning and its structure within RStudio:

      class(fifa$tv_audience_share)
      class(fifa$confederation)
    3. If a variable is categorical (either in character or factor format), we can determine its levels / category labels:

      levels(fifa$confederation)
      levels(factor(fifa$confederation))




  1. New data!
    There’s a data set named comic_characters in the fivethirtyeight package.
    1. Load the data.
    2. Give a 1 sentence summary of what these data measure. (HINT: codebook!)
    3. What are the units of observation? How many observations are there?
    4. Examine the first rows of the data set.
    5. What’s the class of the date variable?
    6. Get a list of all variable names.













2.2 R Markdown and Reproducible Research

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. - Reproducible Research, Coursera

Useful Resources:

Research often makes claims that are difficult to verify. A recent study of published psychology articles found that less than half of published claims could be reproduced. One of the most common reasons claims cannot be reproduced is confusion about data analysis. It may be unclear exactly how data was prepared and analyzed, or there may be a mistake in the analysis.

In this course we will use an innovative new format called R Markdown that dramatically increases the transparency of data analysis. R Markdown interleaves data, R code, graphs, tables, and text and packages it an easily writeable and publishable format.

To use R Markdown, you will write an R Markdown formatted file in RStudio and then ask RStudio to knit it into an HTML document (or occasionally a PDF or MS Word document). For an example, take a look at this Sample RMarkdown and the HTML webpage it creates.


  1. Deduce the R Markdown Format Look at the Rmd and HTML page side-by-side linked above.

    1. How are bullets, italics, and section headers represented in the R Markdown file?

    2. How does R code appear in the R Markdown file?

    3. In the HTML webpage, do you see the R code, the output of the R code, or both?


Now take a look at the R Markdown Cheatsheet. Look up the R Markdown features from the previous question on the cheatsheet. There’s a great deal more information there.

Complete the following. If you get stuck along the way, refer to the R Markdown Cheatsheet linked above, search the web for answers, or ask for help!


  1. Create your first R Markdown file
    Create a new R Markdown about your favorite food.
    1. Create a new file in RStudio (File -> New File -> R Markdown) called First Markdown.
    2. Make sure you can compile (Knit) the Markdown into a webpage.
    3. Create a brief essay about your favorite food. Make sure to include:
      • Two sections
      • A picture from the web
      • A bullet list
      • A numbered list
    4. Add R code to your R Markdown
      • Print the dimensions of the the bechdel dataset (hint: the bechdel name is available in the fivethirtyeight package)
      • Create a second chunk that prints the first few rows of the bechdel dataset.
    5. Compile the document.