#<REPLACE THIS COMMENT WITH YOR ANSWER>
Lab 3. Tidy Data, Aggregation, and Merging
Introduction
Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 3: Tidy Data, Aggregation, and Merging
On a data science project, making our data usable often takes far longer than running the actual analyses we are interested in. A key component of this process is combining multiple data sources. It is rare that a single data source has all the information we need to ask interesting questions, so understanding the different strategies and tools to combine data is a critical skill.
The Data
Today we will be working with census data about Massachusetts 2015-2020, specifically the American Community Survey 5-year Estimates. There are two files for each year, one containing population data (pop_acs5_20XX.csv
), the other economic data (econ_acs5_20XX.csv
). The census impacts nearly every aspect of life in the United States. In terms of governance, census numbers decide how many representatives each state has, where the districts are that send those representatives, the money that will be given to those districts by the federal government, what that money will be used for, and more. Businesses also use census data for market research, to decide where to expand or close stores, and how to advertise. Social scientists use the census for nearly everything; it provides the baseline against which almost every question is asked.
Once we have all of our data formatted correctly, we can start asking questions regarding the economic situations of the population. Specifically, we will be looking at poverty rates by sex.
To start the lab, load in all 12 CSV files from the data directory into dataframes that share their name (e.g. the pop_acs5_2020.csv will become the pop_acs5_2020
dataframe). After all of the dataframes are in your environment, we can start to prepare them for analyses.
While not explicitly asked for here, it is a good idea to use the functions we have learned so far to get a sense of our data! The variables in this data are as follows:
- pop_total
- Total Population
- pop_male
- Population identified as male
- pop_female
- Population identified as female
- inc_12_pov
- Population living below the federal poverty line based on income in the past 12 months
- inc_12_pov_m
- Population living below the federal poverty line based on income in the past 12 months (identified male)
- inc_12_pov_f
- Population living below the federal poverty line based on income in the past 12 months (identified female)
- inc_12_non
- Population living above the federal poverty line based on income in the past 12 months
- inc_12_non_m
- Population living above the federal poverty line based on income in the past 12 months (identified male)
- inc_12_non_f
- Population living above the federal poverty line based on income in the past 12 months (identified female)
- “estimate”
- The estimated value of that variable
- “moe”
- The margin of error for the estimate
The Shape of Data
The first step we will want to undertake is getting the data into a more manageable format. To do that we will need to pivot it.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Combining Data
Now that all of our dataframes are pivoted, we can start to combine them. We have one very important step first though! We need to add the year to our dataframes. Otherwise, once they are all combined together, we won’t be able to tell what year the data is for!
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Now, we can start combining. First, lets combine the two data sets so that we have all of our population data in one dataframe, and all of our economic data in another.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Our last step before we can run analyses is to combine our two topical dataframes.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Creating Measures
Now we can finally start asking questions! First we need to calculate some new columns. A problem we have right now is that the value in each of our cells is a count of how many people fit into each categorization. That’s not particularly helpful for making comparisons when looking at locations because some places simply have more people than others. So, we need to calculate a percentage of how many people in an area live below the poverty line out of all people in the area.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
#<REPLACE THIS COMMENT WITH YOR ANSWER>
#<REPLACE THIS COMMENT WITH YOR ANSWER>
#<REPLACE THIS COMMENT WITH YOR ANSWER>
#<REPLACE THIS COMMENT WITH YOR ANSWER>