3 Activities: Visualization

3.1 Introduction to Data Visualization



3.1.1 Motivation

Check out this excerpt from the (2017) Wiki page on Global warming:

Or this excerpt of temperature data from NOAA




The Wiki text and NOAA table both provide us with data on global warming. Yet visualizations would make these excerpts even more powerful! Why?

  • Visualizations help us understand what we’re working with: What are the scales of our variables? Are there any outliers, i.e. unusual cases? What are the patterns among our variables?
  • This understanding will inform our next steps: What method of analysis / model is appropriate?
  • Once our analysis is complete, visualizations are a powerful way to communicate our findings and tell a story.



Consider a bunch of different visualizations from the Wiki page (some more successful than others):






Image: Temperature over time (left) and 2010 greenhouse gas emissions by sector (right)





Image: Changes in glacier thickness (left) and global warming projections (right)





Image: Projected change in annual temperature, late 20th - mid 21st century (left) and a 6 foot sea level rise (right)







Beyond the traditional visualization of climate change:







More Examples





3.1.2 Features of Good (& Bad) Visualizations

As the “One dataset visualized 25 ways” example demonstrates, there’s not one right way to visualize a dataset. However, there are guiding principles that distinguish between “good” and “bad” graphics. It’s easy to be a critic, so let’s start with some bad visualizations. For each visualization below, identify areas for improvement. (NOTE: You can find more examples of bad viz at WTF Visualizations.)





Images: http://viz.wtf/











A better plot of changes in global temperature:













Properties of Effective Visualizations

  1. Content
    Display meaningful information.

  2. Design
    Minimize ambiguity: provide scales, label axes, etc
    Eliminate “chart junk” (distracting & unnecessary adornments)

  3. Ethics
    Do not present the data in a way that misleads the audience.




With this in mind let’s examine what Edward Tufte, a noted data viz expert, considers to be “probably the best statistical graphic ever drawn”. Drawn by Charles Minard, you can even buy a print for your dorm wall):









3.1.3 Grammar of Graphics

In this course we’ll largely construct visualizations using the ggplot function in RStudio. NOTE: gg is short for “grammar of graphics”. Though the ggplot learning curve can be steep, its grammar is intuitive and generalizable once mastered. Let’s explore the concepts behind this grammar before diving into the syntax.

The following plots represent the different components of graphics in general & ggplot in particular:



Blood pressure readings from a random subset of the NHANES data set.

Figure 3.1: Blood pressure readings from a random subset of the NHANES data set.

Blood pressure readings from a random subset of the NHANES data set.

Figure 3.1: Blood pressure readings from a random subset of the NHANES data set.




Components of a graphic

  • frames
    The position scale describing how data are mapped to the x and y axes.


  • glyphs
    The basic graphical unit that represents a piece of information. Other terms used include mark and symbol. In its original sense, in archeology, a glyph is a carved symbol. For example, a heiroglyph:
    Heiroglyph
    A data glyph is also a mark that encodes the value of a variable or relationship among variables:


  • aesthetic
    a visual property of a glyph such as position, size, shape, color, etc.
    may be mapped based on data values: smoker -> color
    may be set to particular non-data related values: color is black


  • facet
    a subplot that separates a single graph into multiple graphs, one per subset of the data


  • scale
    a mapping that translates data values into aesthetics.
    example: never-> pink; former-> aqua; current-> green


  • guide
    An indication for the human viewer of the scale. This allows the viewer to translate aesthetics back into data values.
    examples: x- and y-axes, various sorts of legends







EXERCISE: Eye Training for the Layered Grammar of Graphics

For your assigned graphic, discuss the following seven questions with your group:

  • Frame
    What variables constitute the frame?

  • Glyphs
    What glyphs are used?

  • Aesthetics
    • What are the aesthetics for those glyphs?
    • Which variable is mapped to each aesthetic?
  • Facets
    • Are facets used?
    • If so, which variable is used for faceting?
  • Scales & Guides
    Which scales are displayed with a guide?

  • Data
    What raw data would be required for this plot, and what form should it be in?


Here are the graphics examples, all taken from the New York Times website:

  1. Admissions gap
  2. Medicare hospital charges
  3. Housing prices
  4. Baseball pitching
  5. Phillips curve
  6. School mathematics ratings
  7. Corporate taxes










3.2 Univariate Visualizations



3.2.1 Practice

Data Visualization Workflow + ggplot

There’s no end to the number and type of visualizations you could make. Thus the process can feel overwhelming. FlowingData has some good recommendations for data viz workflow:

  • Ask the data questions.
    Simple research questions will guide the types of visualizations that you should construct.

  • Start with the basics and work incrementally.
    Before constructing complicated or multivariate or interactive graphics, start with simple visualizations. An understanding of the simple patterns provides a foundation upon which to build more advanced analyses & visualizations.

  • Focus
    Reporting a large number of visualizations can overwhelm the audience & obscure your conclusions. Instead, pick out a focused yet comprehensive set of visualizations.


In this course we’ll largely construct visualizations using the ggplot function in RStudio. Though the ggplot learning curve can be steep, its “grammar” is intuitive and generalizable once mastered. The ggplot plotting function is stored in the ggplot2 package:

   

library(ggplot2)

The best way to learn about ggplot is to just play around. Don’t worry about memorizing the syntax. Rather, focus on the patterns and potential of their application. Here’s a link to a helpful cheat sheet:













Getting Started

The “Bechdel test”, named after cartoonist Alison Bechdel, tests whether movies meet the following criteria:

  • there are \(\ge\) 2 (named) female characters;
  • these women talk to each other…
  • about something other than a man

In the fivethirtyeight.com article “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”, the authors analyze which Hollywood movies do/don’t pass the test. Their data are available in the fivethirtyeight package:




library(fivethirtyeight)
data(bechdel)
head(bechdel)
year imdb title clean_test binary budget_2013 domgross_2013 intgross_2013
2013 tt1711425 21 & Over notalk FAIL 13000000 25682380 42195766
2012 tt1343727 Dredd 3D ok PASS 45658735 13611086 41467257
2013 tt2024544 12 Years a Slave notalk FAIL 20000000 53107035 158607035
2013 tt1272878 2 Guns notalk FAIL 61000000 75612460 132493015
2013 tt0453562 42 men FAIL 40000000 95020213 95020213
2013 tt1335975 47 Ronin men FAIL 225000000 38362475 145803842




  1. Before diving into any visualizations of these data, we first must understand its structure and contents.
    1. What are the units of observation and how many units are in this sample?
    2. What are the levels of the clean_test and binary variables?
    3. Check out the codebook for bechdel. What’s the difference between domgross_2013 and domgross?



  1. We’ll consider univariate visualizations of the clean_test and budget_2013 variables.
    1. What features would we like a visualization of the categorical clean_test variable to capture?
    2. What features would we like a visualization of the quantitative budget_2013 variable to capture?





CATEGORICAL UNIVARIATE VISUALIZATIONS



Research Question:

Among the movies in our sample, what fraction pass the Bechdel test? Among those that fail the test, in which way do they fail (eg: there are no women, there are women but they only talk about men, etc)?


  1. To answer the above research question, we can explore the categorical clean_test variable. A table provides a simple summary of the number of movies that fall into each clean_test category:

    table(bechdel$clean_test)



A bar chart provides a visualization of this table. In examining the bar chart, keep your eyes on the following.


Visualizing Categorical Variables

In examining plots of a categorical variable, take note of the following features:

  • variability
    Are cases evenly spread out among the categories or are some categories more common than others?
  • contextual implications
    In the context of your research, what do you learn from the bar chart? How would you describe your findings to a broad audience?



  1. Try out the code below that builds up from a simple to a customized bar chart. At each step determine how each piece of code contributes to the plot.

    #plot 1: set up a plotting frame (a blank canvas)
    ggplot(bechdel, aes(x=clean_test))
    
    #plot 2: what changed / how did we change it?
    ggplot(bechdel, aes(x=clean_test)) + 
        geom_bar()
    
    #plot 3: what changed / how did we change it?
    ggplot(bechdel, aes(x=clean_test)) + 
        geom_bar() +
        labs(x="Outcome of Bechdel Test", y="Number of movies")
    
    #plot 4: what changed / how did we change it?
    ggplot(bechdel, aes(x=clean_test)) +
        geom_bar(color="purple") +
        labs(x="Outcome of Bechdel Test", y="Number of movies")
    
    #plot 5: what changed / how did we change it?
    ggplot(bechdel, aes(x=clean_test)) +
        geom_bar(fill="purple") +
        labs(x="Outcome of Bechdel Test", y="Number of movies")



  1. Summarize the visualization: what did you learn about the “distribution” of the clean_test variable?





QUANTITATIVE UNIVARIATE VISUALIZATIONS



Research Question:

Among the movies in our sample, what’s the range of budgets? What’s the typical budget? The largest/smallest?


We can answer the above research question by exploring the quantitative budget_2013 variable. Quantitative variables require different summary tools than categorical variables. We’ll explore 2 methods for graphing quantitative variables: histograms & density plots. Both of these has strengths/weaknesses in helping us visualize the distribution of observed values. In their examination, keep your eyes on the following.



Visualizing Quantitative Variables

In examining plots of a quantitative variable, take note of the following features:

  • center
    Where’s the center of the distribution? What’s a typical value of the variable?
  • variability
    How spread out are the values? A lot or a little?
  • shape
    How are values distributed along the observed range? Is the distribution symmetric, right-skewed, left-skewed, bi-modal, or uniform (flat)?
  • outliers
    Are there any outliers, ie. values that are unusually large/small relative to the bulk of other values?
  • contextual implications
    Interpret these features in the context of your research. How would you describe your findings to a broad audience?



  1. Histograms are constructed by (1) dividing up the observed range of the variable into ‘bins’ of equal width; and (2) counting up the number of cases that fall into each bin. Try out the code below. At each step determine how each piece of code contributes to the plot.

    #plot 1: set up a plotting frame
    ggplot(bechdel, aes(x=budget_2013))
    
    #plot 2: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram()
    
    #plot 3: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram() +
        labs(x="Budget ($)", y="Number of movies")
    
    #plot 4: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram(color="white") +
        labs(x="Budget ($)", y="Number of movies")
    
    #plot 5: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram(fill="white") +
        labs(x="Budget ($)", y="Number of movies")
    
    #plot 6: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram(color="white", binwidth=500000) +
        labs(x="Budget ($)", y="Number of movies")
    
    #plot 7: what changed / how did we change it?
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_histogram(color="white", binwidth=200000000) +
        labs(x="Budget ($)", y="Number of movies")



  1. Summarize the visualizations.
    1. Describe the “goldilocks problem” in choosing a bin width that’s not too wide and not too narrow, but just right.
    2. What did you learn about the “distribution” of the budget_2013 variable?
    3. Why does adding color="white" improve the visualization?



  1. Density plots are essentially smooth versions of the histogram. Instead of sorting cases into discrete bins, the “density” of cases is calculated across the entire range of values. The greater the number of cases, the greater the density! The density is then scaled so that the area under the density curve always equals 1 and the area under any fraction of the curve represents the fraction of cases that lie in that range. Try the following code:

    #set up the plotting frame
    ggplot(bechdel, aes(x=budget_2013))
    
    #add a density curve
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_density()
    
    #add axis labels
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_density() +
        labs(x="Budget ($)")
    
    #add a color
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_density(color="red") +
        labs(x="Budget ($)")
    
    #add a fill
    ggplot(bechdel, aes(x=budget_2013)) +
        geom_density(fill="red") +
        labs(x="Budget ($)")



  1. The histogram and density plot both allow us to visualize the distribution of a quantitative variable. What are the pros/cons of both?









3.2.2 Exercises



  1. Good vs Bad visualizations
    1. Think of your favorite hobby or extracurricular interest. Find an example of a “good visualization” online related to this interest.
      • Include a screenshot of the visualization and a source for this visualization.
      • Summarize the content of the visualization. What does it communicate to the audience? What did you learn?
      • Summarize the features that make this a good visualization.
    2. Find an example of a “bad visualization” online. Be sure to choose a visualization from the wild, ie. do not go directly to viz.wtf.
      • Include a screenshot of the visualization and a source for this visualization.
      • Summarize the content of the visualization. What is it trying to communicate to the audience?
      • Summarize the features that make this a bad visualization.



  1. In July 2016, fivethirtyeight.com published the article “Hip-Hop is Turning on Donald Trump.” You can find the supporting data table hiphop_cand_lyrics in the fivethirtyeight package:

    library(fivethirtyeight)
    data("hiphop_cand_lyrics")
    1. What are the cases in this data set?
    2. Use RStudio functions to:
      • summarize the number of cases in hiphop_cand_lyrics
      • examine the first cases of hiphop_cand_lyrics
      • list out the names of all variables in hiphop_cand_lyrics



  1. Let’s start our investigation of hip hop data by asking “Who?”. That is, let’s identify patterns in which 2016 presidential candidates popped up in hip hop lyrics.
    1. Use an RStudio function to determine the category labels used for the candidate variable.
    2. Construct a table of the number of cases that fall into each candidate category.
    3. Construct a single plot that allows you to investigate the prevalence of each candidate in hip hop. Make the following modifications:
      • change the axis labels
      • change the fill colors
    4. Summarize your findings about the 2016 candidates in hip hop.



  1. Next, consider the release dates of the hip hop songs.
    1. Construct a histogram of the release dates with the following modifications:
      • change the fill color of the bins
      • change the bin width to a meaningful size
    2. Construct a density plot of the release dates with the following modifications:
      • change the fill color
    3. Summarize your findings about release date



  1. No class will teach you everything you need to know about RStudio or programming in general, thus being able to find help online is an important skill. To this end, make a single visualization that incorporates the following modifications to your density plot from above. This will require a little Googling.
    • Add a title.
    • Add transparency to the fill color.
    • Calculate the mean (ie. average) release date and median release date:

      mean(hiphop_cand_lyrics$album_release_date)
      median(hiphop_cand_lyrics$album_release_date)
      Add 2 vertical lines to your plot, one representing the mean and the other representing the median. Use 2 different colors.
    • Change the limits of the x-axis to range from 1980-2020.

3.3 Bivariate Visualizations

Practice

Visualization of the Day

https://demographics.virginia.edu/DotMap/





  1. The story + exploring data structure
    The outcome of the 2016 presidential election surprised many people. To better understand it ourselves, we’ll explore county-level election outcomes and demographics. The following data set combines 2008/2012/2016 county-level election returns from Tony McGovern on github, county-level demographics from the df_county_demographics data set within the choroplethr R package, and red/purple/blue state designations from http://www.270towin.com/:

    elect <- read.csv("https://www.macalester.edu/~ajohns24/data/electionDemographics16.csv")

    Let’s get to know these data.

    #Check out the first rows of elect.  What are the units of observation?  The variables?    
    
    #How much data do we have?    
    
    #What are the names of the variables?    



  1. Explore the win column
    The winrep_2016 variable indicates whether or not the Republican (Trump) won the county in 2016, thus is categorical. Let’s construct both numerical and visual summaries of Trump wins/losses. (Before you do, what do you anticipate?)

    #Construct a table (a numerical summary) of the number of counties that Trump won/lost
    table(???)
    
    #Attach a library needed for ggplots
    library(???)
    
    #Construct a bar chart (a visual summary) of this variable.    
    ggplot(???, aes(???)) 
    ggplot(???, aes(???)) + 
        geom_???()




  1. Explore vote percentages
    The perrep_2016 variable includes a bit more detail about Trump’s support in each county.
    1. Since it’s quantitative we need different tools to visually explore the variability in perrep_2016. To this end, construct & interpret both a histogram and density plot of perrep_2016. (Before you do, what do you anticipate?)

      ggplot(elect, aes(???))
      
      #histogram
      ggplot(elect, aes(???)) + 
          geom_???(color="white")
      
      #density plot
      ggplot(elect, aes(???)) + 
          geom_???()
    2. Thus far, we have a good sense for how Trump’s support varied from county to county. We don’t yet have a good sense for why. What other variables (ie. county features) might explain some of the variability in Trump’s support from county to county? Which of these variables do you think will be the best predictors of support? The worst?












Visualizing Relationships

We’ve come up with a list of variables that might explain some of the variability in Trump’s support from county to county. Thus we’re interested in the relationship between:

  • response variable: the variable whose variability we would like to explain
    (Trump’s percent of the vote)
  • predictors: variables that might explain some of the variability in the response
    (percent white, per capita income, state color, etc)

Our goal is to construct visualizations that allow us to examine/identify the following features of the relationships among these variables:

  • relationship trends
  • relationship strength (degree of variability from the trend)
  • outliers in the relationship







A THOUGHT EXPERIMENT

Consider a subset of counties & variables:

county abb perrep_2016 perrep_2012 winrep_2016 StateColor
Elbert County CO 73.53 72.52 TRUE blue
Rockdale County GA 35.82 41.37 FALSE purple
Clay County MN 46.55 44.91 TRUE blue
McDonald County MO 80.15 72.84 TRUE purple
Alcorn County MS 79.95 75.11 TRUE red
Roger Mills County OK 87.94 83.75 TRUE red

Before constructing visualizations of the relationship among any set of these variables, we need to understand what features these should have. As with univariate plots, the appropriate visualization also depends upon whether the variables are quantitative or categorical. In groups, draw a visualization of the relationship between the given pair of variables for the 6 counties above.

  1. Visualize the relationship between perrep_2016 (the response) and perrep_2012 (the predictor).

  2. Visualize the relationship between perrep_2016 (the response) and StateColor (the predictor). Think: how might we modify the below density plot of perrep_2016 to distinguish between counties in red/purple/blue states?

    ggplot(elect, aes(x=perrep_2016)) + 
        geom_density()

  3. Visualize the relationship between Trump’s county-levels wins/losses winrep_2016 (the response) and StateColor (the predictor). Think: how might we modify the below bar plot of winrep_2016 to distinguish between counties in red/purple/blue states?

    ggplot(elect, aes(x=winrep_2016)) + 
        geom_bar()










Basic Rules for Constructing Graphics

Instead of memorizing which plot is appropriate for which situation, it’s best to simply recognize patterns in constructing graphics:

  • Each quantitative variable requires a new axis. (We’ll discuss later what to do when we run out of axes!)

  • Each categorical variable requires a new way to “group” the graphic (eg: using colors, shapes, separate facets, etc to capture the grouping)

  • For visualizations in which overlap in glyphs or plots obscures the patterns, try faceting or transparency.










QUANTITATIVE vs QUANTITATIVE

Let’s start by exploring the relationship between Trump’s 2016 support (perrep_2016) and Romney’s 2012 support (perrep_2012), both quantitative variables.

  1. Scatterplots & Glyphs
    Both perrep_2016 and perrep_2012 are quantitative, thus require their own axes. Traditionally, the response variable is placed on the y-axis. Once the axes are set up, each case is represented by a “glyph” at the coordinates defined by these axes.

    1. Plot a scatterplot of perrep_2016 vs perrep_2012 with different glyphs: points or text.

      #just a graphics frame
      ggplot(elect, aes(y=perrep_2016, x=perrep_2012))
      
      #add a layer with "point" glyphs
      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point()    
      
      #add a layer with symbol glyphs
      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point(shape=3)
      
      #add a layer with "text" glyphs
      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_text(aes(label=abb))
    2. Summarize the relationship between the Republican candidates’ support in 2016 and 2012. Be sure to comment on:
      • the strength of the relationship (weak/moderate/strong)
      • the direction of the relationship (positive/negative)
      • outliers (In what state do counties deviate from the national trend? Explain why this might be the case)



  1. Capture the trend with “smooths”
    The trend of the relationship between perrep_2016 and perrep_2012 is clearly positive and (mostly) linear. We can highlight this trend by adding a model “smooth” to the plot.
    1. Add a layer with a model smooth:

      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point() + 
          geom_smooth()
    2. Construct a new plot that contains the model smooth but does not include the individual cases (eg: point glyphs).
    3. Notice that there are gray bands surrounding the blue model smooth line. What do these gray bars illustrate/capture and why are they widest at the “ends” of the model?
    4. By default, geom_smooth adds a smooth, localized model line. To examine the “best” linear model, we can specify method="lm":

      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point() + 
          geom_smooth(method="lm")



  1. Modify the scatterplots
    As with univariate plots, we can change the aesthetics of scatterplots.
    1. Add appropriate axis labels to your scatterplot. Label the y-axis “Trump 2016 support (%)” and label the x-axis “Romney 2012 support (%)”.
    2. Change the color of the points.

      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point(color="brown")
    3. Add some transparency to the points. NOTE: alpha can be between 0 (complete transparency) and 1 (no transparency).

      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point(alpha=0.5)
      ggplot(elect, aes(y=perrep_2016, x=perrep_2012)) + 
          geom_point(alpha=0.2)
    4. Why is transparency useful in this particular graphic?



  1. More scatterplots
    2012 results aren’t the only possible predictor of 2016 results. Consider two more possibilities.
    1. Construct a scatterplot of perrep_2016 and median_rent. Summarize the relationship between these two variables.
    2. Construct a scatterplot of perrep_2016 and percent_white. Summarize the relationship between these two variables.
    3. Among perrep_2012, median_rent and percent_white, which is the best predictor of perrep_2016? Why?









QUANTITATIVE vs CATEGORICAL

Consider a univariate histogram & density plot of perrep_2016:

To visualize the relationship between Trump’s 2016 support (perrep_2016) and the StateColor (categorical) we need to incorporate a grouping mechanism. Work through the several options below.

  1. Side-by-side density plots
    1. Construct a density plot for each group.

      ggplot(elect, aes(x=perrep_2016, fill=StateColor)) + 
          geom_density()
    2. Notice that ggplot randomly assigns colors to group based on alphabetical order. In this example, the random color doesn’t match the group itself (red/purple/blue)! We can fix this:

      ggplot(elect, aes(x=perrep_2016, fill=StateColor)) + 
          geom_density() + 
          scale_fill_manual(values=c("blue","purple","red"))
    3. The overlap between the groups makes it difficult to explore the features of each. One option is to add transparency to the density plots:

      ggplot(elect, aes(x=perrep_2016, fill=StateColor)) + 
          geom_density(alpha=0.5) + 
          scale_fill_manual(values=c("blue","purple","red"))
    4. Yet another option is to separate the density plots into separate “facets” defined by group:

      ggplot(elect, aes(x=perrep_2016, fill=StateColor)) + 
          geom_density(alpha=0.5) + 
          scale_fill_manual(values=c("blue","purple","red")) + 
          facet_wrap( ~ StateColor)



  1. Side-by-side histograms
    Let’s try a similar strategy using histograms to illustrate the relationship between perrep_2016 and StateColor.
    1. Start with the default histogram:

      ggplot(elect, aes(x=perrep_2016, fill=StateColor)) + 
          geom_histogram(color="white") 
    2. That’s not very helpful! Separate the histograms into separate facets for each StateColor group.





  1. Just for fun: more options!
    Density plots and histograms aren’t the only type of viz we might use…
    1. Construct side-by-side violins and side-by-side boxplots (see description below).

      #violins instead
      ggplot(elect, aes(y=perrep_2016, x=StateColor)) + 
          geom_violin() 
      
      #boxes instead
      ggplot(elect, aes(y=perrep_2016, x=StateColor)) + 
          geom_boxplot() 

      Box plots are constructed from 5 numbers - the minimum, 25th percentile, median, 75th percentile, and maximum value of a quantitative variable:

    2. In the future, we’ll typically use density plots instead of histograms, violins, and boxes. Explain at least 1 pro and 1 con of the density plot.





  1. Let’s not forget the most important purpose of these visualizations! Summarize the relationship between Trump’s 2016 county-level support among red/purple/blue states.









CATEGORICAL vs CATEGORICAL

Finally, suppose that instead of Trump’s percentage support, we simply want to explore his county-level wins/losses:

Specifically, let’s explore the relationship between winrep_2016 and StateColor, another categorical variable.


  1. Side-by-side bar plots
    We saw above that we can incorporate a new categorical variable into a visualization by using grouping features such as color or facets. Let’s add information about StateColor to our bar plot of winrep_2016.
    1. Construct the following 4 bar plot visualizations.

      #a stacked bar plot
      ggplot(elect, aes(x=StateColor, fill=winrep_2016)) + 
          geom_bar()
      
      #a side-by-side bar plot
      ggplot(elect, aes(x=StateColor, fill=winrep_2016)) + 
          geom_bar(position="dodge")
      
      #a proportional bar plot
      ggplot(elect, aes(x=StateColor, fill=winrep_2016)) + 
          geom_bar(position="fill")    
      
      #faceted bar plot
      ggplot(elect, aes(x=StateColor, fill=winrep_2016)) + 
          geom_bar() +
          facet_wrap( ~ winrep_2016)
    2. Name one pro and one con of using the “proportional bar plot” instead of 1 of the other 3 options.

    3. What’s your favorite bar plot from part a? Why?

















Exercises

  1. Hot dog warm-up!
    In the annual Nathan’s hot dog eating contest, people compete to eat as many hot dogs as possible in 10 minutes. Data on past competitions were compiled by Nathan Yau for “Visualize This: The FlowingData Guide to Design, Visualization, and Statistics”:

    hotdogs <- read.csv("http://datasets.flowingdata.com/hot-dog-contest-winners.csv")
    1. Construct a visualization of the winning number of hot dogs by year. THINK: Which is the response variable?
    2. Temporal trends are often visualized using a line plot. Add a geom_line() layer to your plot from part a.
    3. Summarize your observations about the temporal trends in the hot dog contest.



  1. All but 2 of the past winners are from the U.S. or Japan:

    table(hotdogs$Country)
    ## 
    ##       Germany         Japan        Mexico United States 
    ##             1             9             1            20

    Use the following code to filter out just the winners from U.S. and Japan and name this hotdogsSub. (Don’t worry about the code itself - we’ll discuss similar syntax later in the semester!)

    library(dplyr)
    hotdogsSub <- hotdogs %>% 
        filter(Country %in% c("Japan","United States"))
    1. Using a density plot approach without facets, construct a visualization of how the number of hot dogs eaten varies by country.
    2. Repeat part a using a density plot approach with facets.
    3. Repeat part a using something other than a density plot approach. (There are a few options!)
    4. Summarize your observations about the number of hot dogs eaten by country.






  1. The Bechdel test
    Recall the “Bechdel test” data from the previous activity. As a reminder, the “Bechdel test” tests whether movies meet the following criteria:
    • there are \(\ge\) 2 female characters
    • the female characters talk to each other
    • at least 1 time, they talk about something other than a male character
      In the fivethirtyeight.com article “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”, the authors analyze which Hollywood movies do/don’t pass the test. Their data are available in the fivethirtyeight package:
    library(fivethirtyeight)
    data(bechdel)

    In investigating budgets and profits, the authors “focus on films released from 1990 to 2013, since the data has significantly more depth since then.” Use the following code to filter out just the movies in these years and name the resulting data set Beyond1990 (don’t worry about the syntax):

    library(dplyr)
    Beyond1990 <- bechdel %>% 
        filter(year >= 1990)
    1. Construct a visualization that addresses the following research question: Do bigger budgets (budget_2013) pay off with greater box office returns (domgross_2013)? In constructing this visualization, add a smooth to highlight trends and pay attention to which of these variables is the response.
    2. Using your visualization as supporting evidence, answer the research question.
    3. Part of the fivethirtyeight article focuses on how budgets (budget_2013) differ among movies with different degrees of female character development (clean_test). Construct a visualization that highlights the relationship between these two variables. There are many options - some are better than others!
    4. Using your visualization as supporting evidence, address fivethirtyeight’s concerns.





  1. Recreating a fivethirtyeight graphic
    NOTE: The following exercise is inspired by a similar exercise proposed by Albert Kim, one of the fivethirtyeight package authors.
    Return to the fivethirtyeight.com article and examine the plot titled “The Bechdel Test Over Time”.
    1. Summarize the trends captured by this plot. (How has the representation of women in movies evolved over time?)
    2. Recreate this plot! To do so, you’ll need to create a new data set named newbechdel in which the order of the Bechdel categories (clean_test) and the year categories (yearCat) match those used by fivethirtyeight. Don’t worry about the syntax:

      library(dplyr)
      newbechdel <- bechdel %>% 
          mutate(clean_test=factor(bechdel$clean_test, c("nowomen","notalk","men","dubious","ok"))) %>% 
          mutate(yearCat=cut(year, breaks=seq(1969,2014,by=5)))

      Further, you’ll need to add the following layer in order to get a color scheme that’s close to that in the article:

      scale_fill_manual(values = c("red","salmon","pink","steelblue1","steelblue4"))

      NOTE that your plot won’t look exactly like the authors’, but should be close to this:



  1. Geographical Data: Point Processes

    The Starbucks data, provided by Danny Kaplan, contains information about every Starbucks in the world:

    Starbucks <- read.csv("https://www.macalester.edu/~ajohns24/Data/Starbucks.csv")
    1. Starbucks includes the Latitude and Longitude of each location. Construct a visualization of the relationship between these two. THINK: Which of these should go on the y-axis?

    2. The point pattern probably looks familiar! To highlight the geographical nature of this scatterplot, we can superimpose the points on top of a map. To this end, construct the following three maps using the ggmap function in the ggmap library. NOTE: You might first have to install the ggmap package.

      library(ggmap)
      WorldMap <- get_map(location="Africa", zoom=2)
      ggmap(WorldMap) +
          geom_point(data=Starbucks, aes(x=Longitude,y=Latitude), alpha=0.2)
      
      
      US_map <- get_map(location="United States", zoom=3)
      ggmap(US_map) +
          geom_point(data=Starbucks, aes(x=Longitude,y=Latitude), alpha=0.2)
      
      
      TC_map <- get_map(location=c(lon=-93.1687,lat=44.9398))
      ggmap(TC_map) +
          geom_point(data=Starbucks, aes(x=Longitude,y=Latitude))
    3. Re-examine the syntax. Explain the purpose of the zoom argument and how it works.

    4. Construct a new map of Starbucks locations in your birth state (if you were born in the U.S.) or birth country (if you were born outside the U.S..)




  1. Geographical Data: Measurement by Area
    Geographical data needn’t be expressed by latitude & longitude. Reconsider the elect data which included county-level election and demographic variables:

    elect <- read.csv("https://www.macalester.edu/~ajohns24/data/electionDemographics16.csv")

    Thus instead of plotting point locations of some occurrence (eg: Starbucks presence), we want to visualize county-level measurements. First load the following libraries. Make sure they are installed first!

    library(choroplethr)
    library(choroplethrMaps)
    1. Construct the following three maps of Trump’s county-level support (perrep_2016). Note that county_choropleth requires the variable of interest to be stored as value in the elect data.

      #use but don't worry about this syntax
      elect <- elect %>% mutate(value=perrep_2016)
      
      #make the maps!
      county_choropleth(elect)
      county_choropleth(elect, state_zoom="minnesota")
      county_choropleth(elect, state_zoom="minnesota", reference_map = TRUE)
    2. Summarize the trends in the three plots above.

    3. Make and summarize the trends in a national map of winrep_2016, the indicator of whether or not Trump won each county. Don’t forget to first define this as your value of interest:

      elect <- elect %>% mutate(value=winrep_2016)
    4. Make and summarize the trends in a national map of a different elect variable of your choice!









3.4 Beyond Bivariate Relationships

3.4.1 Practice

Though far from a perfect assessment of academic preparedness, SAT scores are often used as one measurement of a state’s education system. The education data stored at https://www.macalester.edu/~ajohns24/data/sat.csv contain various education variables for each state:

head(education)
##        State expend ratio salary frac verbal math  sat  fracCat
## 1    Alabama  4.405  17.2 31.144    8    491  538 1029   (0,15]
## 2     Alaska  8.963  17.6 47.951   47    445  489  934 (45,100]
## 3    Arizona  4.778  19.3 32.175   27    448  496  944  (15,45]
## 4   Arkansas  4.459  17.1 28.934    6    482  523 1005   (0,15]
## 5 California  4.992  24.0 41.078   45    417  485  902  (15,45]
## 6   Colorado  5.443  18.4 34.571   29    462  518  980  (15,45]



  1. Getting Started
    After importing the data (and saving it as education), construct visualizations that help you address the following research questions.

    • Research question 1
      The fracCat variable categorizes the fraction of a state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%). How many states fall into each of these categories?

    • Research question 2
      To what degree do average SAT scores vary from state to state? What’s a typical SAT?

    • Research question 3
      To what degree does per pupil spending (expend) explain this variability? What about teacher salary? Is there anything that surprises you here?! NOTE: Include some model smooths to help highlight the trends.











A THOUGHT EXPERIMENT

Both expend and salary explain some of the variability in sat scores. So why not include both in a plot of sat? Or incorporate other variables that might illuminate the counterintuitive result. Let’s! Take a look at some of the data below:

State expend ratio salary frac verbal math sat fracCat
Alabama 4.405 17.2 31.144 8 491 538 1029 (0,15]
Alaska 8.963 17.6 47.951 47 445 489 934 (45,100]
Arizona 4.778 19.3 32.175 27 448 496 944 (15,45]
Arkansas 4.459 17.1 28.934 6 482 523 1005 (0,15]
California 4.992 24.0 41.078 45 417 485 902 (15,45]
Colorado 5.443 18.4 34.571 29 462 518 980 (15,45]

Before constructing visualizations of the relationship among any set of these variables, we need to understand what features these should have. In groups, draw a visualization of the relationship between the given set of variables.

  1. Visualize how the variability in sat (the response) can be explained by expend and salary (the predictors). Try to come up with at least 2 approaches. THINK: How did we visualize sat vs expend? How can we adapt this to include salary information?

  2. Visualize how the variability in sat (the response) can be explained by expend and fracCat (the predictors). Try to come up with at least 2 approaches. THINK: How did we visualize sat vs expend? How can we adapt this to include fracCat information?









VISUALIZING >2 QUANTITATIVE VARIABLES

  1. Scatterplots for >2 Quantitative Variables
    Three dimensional scatterplots drawn on two dimensional surfaces are notoriously misleading. Consider the alternatives here.
    1. Construct each plot below and summarize the strategy that’s being used to include information about expend in the scatterplot of sat vs salary.

      #plot 1
      ggplot(education, aes(y=sat, x=salary, color=expend)) + 
          geom_point() + 
          geom_smooth(se=FALSE, method="lm")
      
      #plot 2
      ggplot(education, aes(y=sat, x=salary, size=expend)) + 
          geom_point() + 
          geom_smooth(se=FALSE, method="lm")
      
      #plot 3
      ggplot(education, aes(y=sat, x=salary, color=cut(expend,2))) + 
          geom_point() + 
          geom_smooth(se=FALSE, method="lm")
      
      #plot 4
      ggplot(education, aes(y=sat, x=salary, color=cut(expend,3))) + 
          geom_point() + 
          geom_smooth(se=FALSE, method="lm")
    2. Which of the plots is your favorite? Why?

    3. Summarize the trivariate relationship between sat, salary, and expend.





INCORPORATING CATEGORICAL VARIABLES

  1. Construct a visualization of sat by fracCat. Summarize the relationship between these two variables and explain why it makes intuitive sense.



  1. fracCat and expend both explain some of the variability in sat scores. Let’s incorporate both in our analysis.
    1. You have all the tools you need to construct a visualization of sat vs fracCat and expend. Be sure to incorporate a model line: geom_smooth(method="lm")

    2. In all previous plots of sat vs expend (without fracCat) we saw a negative relationship - SAT scores decrease as spending increases. What do you see now? What’s the relationship between sat and expend in states with a low fraction of students that take the SAT? States with a medium fraction? States with a high fraction?



  1. Simpson’s Paradox!
    Wait a minute: In the scatterplot of sat vs expend, it appeared that the more states spend on students, the worse their SAT scores. However, when we account for the fraction of the state’s students that take the test, we see that SAT scores actually increase with per pupil expenditure. This phenomenon is known as a “Simpson’s Paradox”.
    1. To convince yourself that this phenomenon isn’t merely due to the way in which we categorized the low/medium/high fracCat categories, check out a plot of sat vs expend and frac (the raw fractions):

      ggplot(education, aes(y=sat, x=expend, color=frac, size=frac)) + 
          geom_point()
    2. To get a better sense of what’s going on here, plot the state names at their coordinates:

      ggplot(education, aes(y=sat, x=expend, color=frac)) + 
          geom_text(aes(label=State))
    3. Putting all of this together, explain this Simpson’s Paradox. That is, why does it appear that SAT scores decrease as spending increases even though the opposite is true?





3.4.2 Exercises

  1. The US_births_2000-2014 data within the fivethirtyeight package contains the number of U.S. births on each day from Jan 1, 2000 to Dec 31, 2014:

    #load the fivethirtyeight library
    suppressPackageStartupMessages(library(fivethirtyeight))
    
    #load the births data
    data(US_births_2000_2014)

    For now, let’s focus on just 2014. Use the following code (but don’t worry about the syntax) to create the Births2014 data set:

    library(dplyr)
    Births2014 <-  US_births_2000_2014 %>% 
        filter(year==2014)
    1. Construct a univariate plot that allows you to visualize the variability in births from day to day in 2014.

    2. The time of year might explain some of this variability. Construct a plot that illustrates the relationship between births and date in 2014. THINK: which of these should go on the y-axis?

    3. One goofy thing that stands out are the 2-3 distinct groups of points. Add a layer to this plot that explains the distinction between these groups.

    4. Explain why you think births are lower in 1 of these groups than in the other.

    5. There are some exceptions to the rule revealed in parts c & d, ie. some cases that should belong to group 1 but behave like the cases in group 2. Explain why these cases are exceptions - what explains the anomalies / why these are special cases?

    6. Summarize your investigation in 1-2 sentences.



  1. The data set US_births_1994_2003 data set contains similar data from the previous decade. Combine the US_births_1994_2003 and US_births_2000_2014 into 1 data table using the following code (don’t worry about the syntax):

    allyears <- full_join(US_births_1994_2003, US_births_2000_2014)
    1. Construct 1 graphic that illustrates births trends across 1994-2014 and days of the week using geom_point().

    2. Construct 1 graphic that illustrates births trends across 1994-2014 and days of the week using geom_smooth() (without geom_point()).

    3. Summarize your investigation in 1-2 sentences. Be sure to comment on both the common seasonal trends within years as well as trends across the years.



  1. One of the focuses of the related fivethirtyeight.com article was birth trends on Friday the 13th (of which some people are superstitious). Use the following code to construct a data set that only contains Friday births and includes a variable fri13 which indicates whether the Friday falls on the 13th day of the corresponding month: Don’t worry about the syntax:

    frionly <- allyears %>%
            filter(day_of_week=="Fri") %>%
            mutate(fri13=(date_of_month == 13))

    Using the frionly data, construct a plot that illustrates the distribution of births among Fridays that fall on & off the 13th. Comment on whether you see any evidence of superstition.



  1. Stacking
    The dannyViz data, a tribute to Danny Kaplan, contains enrollment data for statistics-related courses at Mac from 2000-2016:

    dannyViz <- read.csv("https://www.macalester.edu/~ajohns24/data/dannyViz.csv")

    So that RStudio recognizes the Course numbers at categories, be sure to run the following code:

    dannyViz$Course <- as.factor(dannyViz$Course)
    1. Construct a single visualization of how enrollments (Total) in each Course have fluctuated by Year. Since this is temporal data, it makes sense to use geom_line() instead of geom_point(). NOTE: 110 has turned into the current 112 in which you sit!

    2. That plot is nice for summarizing the trends of each individual course. However, it’s tough to get a sense of the cumulative enrollments in these courses over time. To this end, construct the following plot using geom_area(). Use this to summarize the overall and individual trends in statistics enrollments.

      ggplot(dannyViz, aes(x=Year, y=Total, fill=Course)) + 
          geom_area(color="black")
    3. The following data, motivated by the work of Nathan Yau in “Visualize This” and provided by http://flare.prefuse.org/, summarize occupation trends from 1850-2000:

      jobs <- read.csv("https://www.macalester.edu/~ajohns24/Data/jobtrends.csv")

      Construct a geom_area() visualization of occupation trends over time. Facet these by sex (1=male, 2=female).

    4. Summarize 3 interesting trends from this visualization.



3.5 Visualization Wrap-Up

3.5.1 Practice

Now that we’ve learned the basics of constructing visualizations, let’s consider using visualizations to tell a story. Here are some examples:



  1. Tell a Story About the Bechdel Test
    As a class, let’s tell a story about movies that do/don’t pass the Bechdel test using the following prompts as inspiration:
    • How many movies fail/pass? In what way?
    • Why? Is it because movies that fail the test had more money?
    • What about the return on spending?


    To get you started:

    #load the data
    library(fivethirtyeight)
    library(dplyr)
    data(bechdel)
    
    #wrangle the data (don't worry about the syntax)
    bechdel <- mutate(bechdel, domgains=domgross_2013/budget_2013, intgains=intgross_2013/budget_2013)

    NOTE: To deal with the extreme right skew in budgets and gross earnings, you may need to convert the x-axis and/or y-axis to the log scale:

    #add at the end of a ggplot
    scale_x_log10()
    scale_y_log10()





  1. Tell a Story About Jam of the Week
    Online spaces now augment physical spaces where people share, critique, and study musical performance. This research studies “Jam of the Week”, an online Facebook community with over 50,000 members.

    The jotw data contain the first 30,000+ posts to the Facebook group:

    jotw <- read.csv("https://www.macalester.edu/~ajohns24/data/jam_of_the_week.csv") 
    Each row represents a single jam posted to the group. Variables include:
    • gender = gender of the musician in the post
    • num_reactions, num_comments, etc = number of reactions to, comments on, etc the post
    • year_day, week_day, hour = indicators of when the jam was posted

    If you’re curious about the other variables, talk to the walking code book (Shilad).


    Working with the people around you, tell a story about Jam of the Week. In doing so, consider the following research question as a prompt: To what extent does online behavior adhere to or transcend existing biases related to gender?








3.5.2 Exercises

In January 2017, fivethirtyeight.com published an article on hate crime rates across the US. A tidied up version of their data are available at

https://www.macalester.edu/~ajohns24/data/hate_crimes_extra.csv
  1. Getting Started
    1. Load the data and store it as US_crime.

    2. What are the units of observation and how many observations are there?

    3. These US_crime data (which you should use for each exercise) were generated from the hate_crimes data set in the fivethirtyeight package. Examine the codebook for the hate_crimes data. Note that US_crime contains these same variables plus 3 more:
      • crimes_pre = average daily hate crimes per 100,000 population (2010-2015)
      • crimes_post = average daily hate crimes per 100,000 population (November 9-18, 2016)
      • crimes_diff = difference in the average daily hate crimes per 100,000 population after the election vs before the election (crimes_post - crimes_pre)
      • trump_win = an indicator of whether Trump won

      In comparing hate crime rates before and after the election, why is it better to examine crimes_pre vs crimes_post than avg_hatecrimes_per_100k_fbi vs hate_crimes_per_100k_splc?



  1. Trends in hate crimes
    1. Explain why, if we want to study possible connections between hate crime rates and the election, we should use crimes_diff instead of crimes_post in our analysis.

    2. Write a mini-story with 3 visualizations and 1-2 sentences per visualization that examines the following relationships:
      • (univariate) variability in crimes_diff (be sure to note how these values compare to 0 & the contextual significance of this);
      • the relationship between crimes_diff vs trump_win
      • the relationship between crimes_diff vs share_vote_trump
      • the relationship between crimes_diff vs gini_index and trump_win



  1. MULTIVARIATE VISUALIZATIONS
    There are several variables we haven’t considered yet! Instead of cherry picking 2-3 variables at a time, we can visualize all at once. Before you get started, use the following syntax to eliminate some redundant variables and give the data row names. (Don’t worry about the syntax itself!)

    #make a copy of US_crime
    US_crime_new <- US_crime
    
    #treat states as row names
    row.names(US_crime_new) <- US_crime_new$state
    
    #take out some variables
    library(dplyr)
    US_crime_new <- select(US_crime_new, -c(state,median_house_inc,hate_crimes_per_100k_splc,avg_hatecrimes_per_100k_fbi,crimes_pre,crimes_post,trump_win))

    Confirm that your dimensions match those below:

    dim(US_crime_new)
    ## [1] 51  9
    1. Heat map (plain)
      Use the syntax below to construct a heat map. Note that each variable (column) is scaled to indicate states (rows) with high values (pink) to low values (blue). With this in mind you can scan across rows & across columns to visually assess which states & variables are related, respectively.

      crime_mat <- data.matrix(US_crime_new)
      heatmap(crime_mat, Rowv=NA, Colv=NA, scale="column", col=cm.colors(256))
    2. Heat map with row clusters
      It can be tough to identify interesting patterns by visually comparing across rows and columns. Including dendrograms helps to identify interesting clusters. First construct a heat map which identifies interesting clusters of rows (states). Comment on 2 interesting clusters. (Eg: do you note any regional patterns?)

      heatmap(crime_mat, Colv=NA, scale="column", col=cm.colors(256))
    3. Heat map with column clusters
      We can also construct a heat map which identifies interesting clusters of columns (variables). Comment on 2 interesting clusters. (Eg: Which variable seems to be the most indicative of Trump’s support?)

      heatmap(crime_mat, Rowv=NA, scale="column", col=cm.colors(256))
    4. Star plots
      There’s more than one way to visualize multivariate patterns. Construct the following 2 star plot visualizations. Like heat maps, these visualizations indicate the relative scale of each variable for each state. With this in mind, use the star maps to identify which state is the most “unusual”.

      stars(crime_mat, flip.labels=FALSE, key.loc=c(15,1.5))
      stars(crime_mat, flip.labels=FALSE, key.loc=c(15,1.5), draw.segments=TRUE)