Lab 4. Advanced Plotting

Author

Jared Joseph

Introduction

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 4: Advanced Plotting

Visualizations help us understand our own data, and communicate our data to other people. Today we will be focussing on making visuals that we would want to share with others. While there are some good rules of thumb, visualizations are an art as well as a science. Take some time in making visualizations you think look good, as long as they are still faithful to the underlying data.

Some things to keep in mind:

What is the story of your visualization?
Does your visualization leave any question marks for the viewer?
Is there anything you can remove from your visualization and still keep your message clear?
Did you include the data source?
Have you tried to make your visualization accessable?

Tip

Our questions are going to start becoming more complex as we learn more data science skills. It is recommended you annotate your code with comments to keep track of what each step is doing.

The Data

The data we will be using today comes from the U.S. Department of Education College Scorecard. The college scorecard collects important data about universities across the country such as cost of attendance, acceptance rates, graduation rates, and the income of students from that institution after graduation. It also gives information on the student body, including how graduates from each major do after graduation. If you are curious, you can see Smith’s page here.

The full scorecard data set is huge. It includes information about over 6500 institutions in the U.S., and has more than 3000 columns documenting information about those institutions. Today we will be using the rscorecard package to get a subset of the data.

To use the rscorecard package, you will need to get an Application Programming Interface (API) key. API keys grant you direct access to data that is often otherwise limited. You get to pull data directly into R without needing to download files from the web, and the provider gets to limit the data you can actually get. This arrangement is usually beneficial for everyone.

To get an API key for the college scorecard data, you will need to request one on the data.gov web portal. Once you fill out your information, you should get an email almost immediately with a personalized API key. This key is unique to you, so it is important to keep it safe. You should never include your API keys in a code file, especially those you commit with git. Other people can look through your git history and find your personal key to use for nefarious purposes!

Once you have your key, run the following command in your console, replacing <SCORECARD_KEY> with your unique key:

# Load our API library
library(rscorecard)

# Set our API key
sc_key(<SCORECARD_KEY>)

This will save your key to your R environment. You will need to re-run sc_key() whenever you restart R, so it may be helpful to save your key somewhere safe. Run the following code to get our data for today:

# set what variables we want

# school context
scorecard_variables_context = c("unitid", "instnm", "city", "highdeg", "control",
                                "hbcu", "annhi", "tribal", "aanapii", "hsi", "nanti")

# student info
scorecard_variables_students = c("unitid", "instnm", "ugds", "adm_rate",
                                 "costt4_a", "costt4_p", "pcip27", "pctfloan",
                                 "pctpell",  "admcon7", "cdr3")

# Get context data
scorecard_2020_context <- sc_init() |>       # Set up our API 'call'
  sc_year(2020) |>                           # Set the year to only 2020
  sc_filter(stabbr == "MA") |>               # Ask for only MA data
  sc_select_(scorecard_variables_context) |> # Set variables
  sc_get()                                   # Get the thing!

scorecard_2017_context <- sc_init() |>
  sc_year(2017) |>            
  sc_filter(stabbr == "MA") |>
  sc_select_(scorecard_variables_context) |>
  sc_get()

scorecard_2014_context <- sc_init() |>
  sc_year(2014) |>            
  sc_filter(stabbr == "MA") |>
  sc_select_(scorecard_variables_context) |>
  sc_get()

# Get student data
scorecard_2020_student <- sc_init() |>      
  sc_year(2020) |>                   
  sc_filter(stabbr == "MA") |>       
  sc_select_(scorecard_variables_students) |>
  sc_get()                           

scorecard_2017_student <- sc_init() |>
  sc_year(2017) |>            
  sc_filter(stabbr == "MA") |>
  sc_select_(scorecard_variables_students) |>
  sc_get()

scorecard_2014_student <- sc_init() |>
  sc_year(2014) |>            
  sc_filter(stabbr == "MA") |>
  sc_select_(scorecard_variables_students) |>
  sc_get()

We now have six dataframes containing data for MA universities and colleges. You will also need to download the scorecard documentation for this lab to understand the variables. You can find both the Data Dictionary and the Technical Documentation on the scorecard website. I would save both in the docs/ directory within your project folder.

Once you have downloaded the documentation, take some time to read up about each of the variables we will be using. The search function is helpful here.

Question 1

Read through the data documentation and write a short description of what each of the variables in our dataframes mean. Do you see any issues with the variables as they are defined?

REPLACE THIS TEXT WITH YOUR ANSWER

Question 2

Combine all of the scorecard dataframes into one called scorecard_all. You should have only one column for each variable. Optionally, remove the now redundant dataframes from your environment.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Exploratory Data Analyses (EDA) & Cleaning

Once you have combined the data and familiarized yourself with the variables, we need to take some time to understand the dataset.

Question 3

Perform some EDA on our dataset. Confirm that all of the variables are represented by the correct data type in R. Correct them if they are not.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Question 4

Do any other variables require re-coding? If so, do that now.

Tip

Depending on the tool you use to re-code, you may get an error saying “✖ Can’t convert <character> to <integer>.” This is the tidyverse trying to protect you, but being overzealous. You can sidestep this by converting the entire column into a character first, or using as.data.frame() to turn the tibble (a tidyverse dataframe) back into a normal R dataframe.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Communicating with Plots

Question 5

Create a plot that shows the relationship between the admission rate, cost of attendance (4-year programs), and the control of the institution.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Question 6

What patterns can we see in this plot? If you had to explain the story here to someone who cannot see the plot themselves, what would you say?

REPLACE THIS TEXT WITH YOUR ANSWER

Question 7

Create a plot which shows the relationship between the highest degree awarded by an institution and the three-year cohort default rate. I want to see the distribution of the default rate per institution type, rather than just the frequency. Only use 2017 data for this plot.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Question 8

Is there any pattern in three-year cohort default rates by the highest degree type an institution grants?

REPLACE THIS TEXT WITH YOUR ANSWER

CHALLANGE QUESTION

Using any and all of the tools at your disposal (within R), create a “publication ready” (ready to be shared widely) data visualization that highlights (in a good or bad way) Smith College in the scorecard data.

#<REPLACE THIS COMMENT WITH YOR ANSWER>