Lab 3. Tidy Data, Aggregation, and Merging

Author

Jared Joseph

Introduction

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 3: Tidy Data, Aggregation, and Merging

On a data science project, making our data usable often takes far longer than running the actual analyses we are interested in. A key component of this process is combining multiple data sources. It is rare that a single data source has all the information we need to ask interesting questions, so understanding the different strategies and tools to combine data is a critical skill.

The Data

Today we will be working with census data about Massachusetts 2015-2020, specifically the American Community Survey 5-year Estimates. There are two files for each year, one containing population data (pop_acs5_20XX.csv), the other economic data (econ_acs5_20XX.csv). The census impacts nearly every aspect of life in the United States. In terms of governance, census numbers decide how many representatives each state has, where the districts are that send those representatives, the money that will be given to those districts by the federal government, what that money will be used for, and more. Businesses also use census data for market research, to decide where to expand or close stores, and how to advertise. Social scientists use the census for nearly everything; it provides the baseline against which almost every question is asked.

Once we have all of our data formatted correctly, we can start asking questions regarding the economic situations of the population. Specifically, we will be looking at poverty rates by sex.

To start the lab, load in all 12 CSV files from the data directory into dataframes that share their name (e.g. the pop_acs5_2020.csv will become the pop_acs5_2020 dataframe). After all of the dataframes are in your environment, we can start to prepare them for analyses.

Question 1

Load all of the dataframes into your R environment.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

While not explicitly asked for here, it is a good idea to use the functions we have learned so far to get a sense of our data! The variables in this data are as follows:

pop_total
Total Population
pop_male
Population identified as male
pop_female
Population identified as female
inc_12_pov
Population living below the federal poverty line based on income in the past 12 months
inc_12_pov_m
Population living below the federal poverty line based on income in the past 12 months (identified male)
inc_12_pov_f
Population living below the federal poverty line based on income in the past 12 months (identified female)
inc_12_non
Population living above the federal poverty line based on income in the past 12 months
inc_12_non_m
Population living above the federal poverty line based on income in the past 12 months (identified male)
inc_12_non_f
Population living above the federal poverty line based on income in the past 12 months (identified female)
“estimate”
The estimated value of that variable
“moe”
The margin of error for the estimate

The Shape of Data

Question 2

What orientation is the data in? Wide or Long?

REPLACE THIS TEXT WITH YOUR ANSWER

The first step we will want to undertake is getting the data into a more manageable format. To do that we will need to pivot it.

Question 3

Pivot all of the dataframes into a more usable format.

Tip

Make sure to look at the function documentation! You will need to add some arguments. Remember whenever pivoting you need to think of the key (sometimes called name) which identifies your variables, and the value(s) which are the values associated with that key.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Combining Data

Now that all of our dataframes are pivoted, we can start to combine them. We have one very important step first though! We need to add the year to our dataframes. Otherwise, once they are all combined together, we won’t be able to tell what year the data is for!

Question 4

Add a new column to each of our dataframes called year to indicate what year the data is for.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Now, we can start combining. First, lets combine the two data sets so that we have all of our population data in one dataframe, and all of our economic data in another.

Question 5

Combine the 6 dataframes that contain population data into one dataframe called pop_acs5_all and all of the economic data into one dataframe called econ_acs5_all.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Tip

Your environment is started to get a little cluttered. If you would like to clean it up, you can use the rm() function to remove items from your environment. You can remove all of the single-year data now that we’ve merged everything.

Our last step before we can run analyses is to combine our two topical dataframes.

Question 6

Combine pop_acs5_all and econ_acs5_all into one dataframe called ma_acs5_all.

Tip

Make sure to check how many rows you have in your new dataframe! If you have way more than before, something went wrong.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Creating Measures

Now we can finally start asking questions! First we need to calculate some new columns. A problem we have right now is that the value in each of our cells is a count of how many people fit into each categorization. That’s not particularly helpful for making comparisons when looking at locations because some places simply have more people than others. So, we need to calculate a percentage of how many people in an area live below the poverty line out of all people in the area.

Question 7

Create a new column called per_pov_total which calculates the percentage of people per location living below the poverty line.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 8

Create a two new columns called per_pov_m and per_pov_f which calculates the percentage of people categorized as male and female per location living below the poverty line.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 9

Aggregate by location and find the mean percentage of people living below the poverty line for everyone, people who were identified as male, and people who were identified as female. Assign this table to a new object called agg_table.

#<REPLACE THIS COMMENT WITH YOR ANSWER>
Question 10

Using agg_table, look at the percentages on poverty in MA by sex (as categorized by the census). Use some of the tools we have learned thus far to understand these numbers. Then, write a small paragraph explaining what you found.

REPLACE THIS TEXT WITH YOUR ANSWER

#<REPLACE THIS COMMENT WITH YOR ANSWER>
CHALLANGE QUESTION

Make a plot showing poverty percentages in Northampton over time for both people identified as male and female.

#<REPLACE THIS COMMENT WITH YOR ANSWER>