Lab 2. Exploratory Data Analyses (EDA)

Author

Jared Joseph

Introduction

Tip

This lab will ask you to make several data visualizations. You can use any function/package you like to make these visualizations, as long as you create the correct type, and can interpret the results.

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 2: Exploratory Data Analyses (EDA)

This lab aims to reinforce the R coding fundamentals we learned last time, while also introducing the process of Exploratory Data Analyses (EDA). This lab will also make suggestions as we work regarding the git workflow. Getting a sense of how often you should commit your code is something you develop over time, and varies from person to person. In this lab I’ll highlight a few times you should consider committing your code and pushing it to github.

Note

While we are using pushes to Github as a way to turn in our assignments, don’t mistake it for something like turning in your assignment on moodle. You can, and should, push your code often. Each push saves your code on Github in case something terrible happens to your computer.

In terms of grading, we will look at whatever version you have uploaded when the lab is due.

The Data

Today we will be looking at stop and frisk data from New York City. Stop and frisk laws in New York allowed officers to stop anyone they had a “reasonable suspicion” was the suspect in a crime. Stop and frisk became a cornerstone in NYC policing.

As a result of some high profile shootings in the late 1990s and mid-2000s, both the New York State Attorney General’s Office and the New York Civil Liberties Union began to examine NYPD stop and frisk activity for racial profiling. With pressure from these organizations, in the mid-2000s, information recorded about stop and frisk incidents were released in public databases. In the early 2010s, when the NYPD’s use of stop and frisk went before the US District Court, this data was integral in proving that stop and frisk was being carried out in NYC in an unconstitutional way. The ruling mandated that the NYPD create a policy outlining when stops were authorized, and since the practice has declined considerably. You can read the ACLU report if you would like to learn more.

We are going to be looking at the 2006 stop and frisk data, which was around the time when the tactic had firmly reached prominence. You will need to download the data separately from this lab repository. We will be downloading it from the National Archive of Criminal Justice Data. You can do so here. First navigate to that page, and click the “Download” button. Then select the “Delimited” option.

You will be asked to create an account. You will want to make one using your Smith Google account. Do so by selecting the Google option to log in, and then selecting your Smith account. You do not need to provide an address, but you will need to specify that Smith is a College/University, and your department (use your major or something close).

Accept the presented agreement, and wait for the download to finish. Once it has, open the zip file. Inside that zip file, navigate to the “ICPSR_21660” and then the “DS0001” directory. Inside that directory will be two files: “21660-0001-Codebook.pdf” and “21660-0001-Data.tsv”. Place the “21660-0001-Codebook.pdf” file in the “docs” folder in your project directory, and place the “21660-0001-Data.tsv” file into the “data” folder in your project directory.

Getting the Data Set Up

Now that you have the data files in our project directory, we can load it into R.

Question 1

Load the “21660-0001-Data.tsv” file into an R object called nyc_sf_2006_raw. This may take a moment as the data file is relatively large.

Warning

The data file, “21660-0001-Data.tsv,” is a TSV file rather than the more common CSV. All that really means here is that you need to use read.delim() instead of read.csv() to load the data.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

For this lab, we will be working with the following variables from the larger dataset:

  • CASEID
  • YEAR
  • PCT
  • DATESTOP
  • DAYSTOP2
  • PERSTOP
  • ARSTMADE
  • FRISKED
  • SEARCHED
  • CONTRABN
  • WEPFOUND
  • RACE
  • SEX
  • AGE
Question 2

Make a new dataframe called sf_subset that contains all rows but only the columns mentioned above.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Tip

Now would be a good time to push your code to github. You’ve got code that will get your data into R, which is a common project milestone. You don’t want to lose it!

Understanding the Data

Question 3

Use the functions we have learned so far to investigate the structure of the sf_subset dataframe we just made and the data it contains. Do not create visualizations yet. This may require using multiple commands.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 4

How many incidents of stop and frisk were there in 2006?

REPLACE THIS TEXT WITH YOUR ANSWER

Now that we have the dataframe we will be working with, take some time to get familiar with what all these variables mean. You can read about them in the “21660-0001-Codebook.pdf” we put in the docs/ directory of this project.

Question 5

Open the codebook PDF (21660-0001-Codebook.pdf) and search for each of the variables in our new sf_subset dataframe. Does anything stand out to you? Do you see any limitations in how the data was recorded?

REPLACE THIS TEXT WITH YOUR ANSWER

Data Exploration

Let’s start to ask some questions of our data and see what we can learn.

Question 6

Use a function to get a count of how many people in each racial categorization were stopped in our dataset.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 7

Create a data visualization which graphically represents the number of people in each racial categorization that were stopped in our dataset.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 8

How does the presentation of the results from questions 6 & 7 change your understanding of the data? Is one more effective than the other?

REPLACE THIS TEXT WITH YOUR ANSWER

Question 9

Create a data visualization which graphically represents the period of stops (how long police stopped people for) in our dataset.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 10

What patterns do you see in this visualization?

REPLACE THIS TEXT WITH YOUR ANSWER

Tip

Now would be a good time to commit and push. You’re changing topics, so now is a good time as it would add a checkpoint around a topical section.

Question 11

Find the five-number summary for the age of people stopped in our dataset.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 12

Make a boxplot that graphically represents the age of people stopped in our dataset.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 13

Describe all parts of the box plot you just made and how they relate to the five-number summary.

REPLACE THIS TEXT WITH YOUR ANSWER

Important Relationships

Question 14

Create a table which counts how many stops were made by racial category. Then divide that table by the total number of stops, and multiply it by 100 to create a percentage for each racial category.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 15

What do you notice about the percentage breakdown of the recorded stops?

REPLACE THIS TEXT WITH YOUR ANSWER

Question 16

Create two new dataframes from sf_subset, one that contains only stops where the individual was categorized as “White” called sf_white and a second dataframe containing all other categorizations called sf_other.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 17

Using both of these dataframes, calculate the percentage of stops where individuals identified as “white” and individuals of other categorizations were arrested.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 18

Using both of these dataframes, calculate the percentage of stops where individuals identified as “white” and individuals of other categorizations were carrying “contraband.”

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 19

What story do the results from questions 14-18 tell us? Why are these numbers important?

REPLACE THIS TEXT WITH YOUR ANSWER

Tying it Together

The analyses you performed here are very similar to those used as part of the legal battle against the use of stop and frisk in New York City. Those trials ultimately concluded that the way NYC officers were performing stops was illegal. Even simple analyses, when looking at important issues, can result in major changes.

Stop and frisk is still used in New York City, but it’s use has declined dramatically in part due to the new oversight imposed to assure racial equity.

Stop and Frisk use Over Time (NY ACLU, 2019)

Tip

Now would be a good time to commit and push. You’ve finished things!

CHALLANGE QUESTION

Create a data visualization that shows the day of the week when stops occurred, and also includes a descriptive title, axis labels, labels for the values, and describes the data source.

#<REPLACE THIS COMMENT WITH YOR ANSWER>