Intro to Data Science > Worksheets > Aggregation and Merging

Aggregation and Merging

Overview
Problem Sets

Overview

Aside from sub-setting, aggregating and merging data many be the most common task of a data scientist.

Problem Sets

1. Aggregation using `group_by()` and `summarise()`

Often times we want to know how different categories in our data compare to each other. This comparison let’s us get a sense if one group is in some way different than others, and potentially make decisions based on that difference. For a simple example, lets look at our class survey data.

survey = read.csv("https://raw.githubusercontent.com/Intro-to-Data-Science-Template/intro_to_data_science_reader/main/content/class_worksheets/4_r_rstudio/data/survey_data.csv")

If we skim() our data, we have quite a few variables we could group our cases by to make comparisons.

library(skimr)

skim(survey)


Name	survey
Number of rows	15
Number of columns	23
_______________________
Column type frequency:
character	14
logical	6
numeric	3
________________________
Group variables	None

Data summary

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
fav_char	0	1.00	4	39	15
fav_color	1	0.93	3	50	10
b_month	0	1.00	3	9	9
pets	0	1.00	3	17	8
fav_art	3	0.80	17	1158	12
coffee_days	7	0.53	6	62	7
tea_days	10	0.33	6	17	5
soda.pop_days	9	0.40	7	62	5
juice_days	4	0.73	6	62	8
none_days	10	0.33	17	45	5
lt_location	3	0.80	7	47	12
fict	0	1.00	7	11	2
recreation	0	1.00	51	232	15
key	0	1.00	6	41	15

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
major	15	0	NaN	:
other_classes	15	0	NaN	:
car	0	1	0.33	FAL: 10, TRU: 5
pineapple_pizza	0	1	0.67	TRU: 10, FAL: 5
nerd	0	1	0.73	TRU: 11, FAL: 4
hotdog	0	1	0.47	FAL: 8, TRU: 7

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fav_num	1	15.60	14.92	3	7	9	20.5	56	▇▁▂▁▁
mint_choc	1	3.60	1.72	1	2	4	5.0	5	▅▁▁▃▇
hours_sleep	1	6.87	0.74	6	6	7	7.0	8	▆▁▇▁▃

For now, let us explore if there is any difference between those that do and do not categorize a hot dog as a sandwich, given it nearly evenly splits the class. We ca do this using the group_by() function. In essence, we want to group our data by TRUE and FALSE in our hotdog column, and then perform some other analysis. We can direct that analysis using the summarise() function. For example:

library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

survey |>
  group_by(hotdog) |>
  summarise(sleep = mean(hours_sleep))

# A tibble: 2 × 2
  hotdog sleep
  <lgl>  <dbl>
1 FALSE   7.12
2 TRUE    6.57

In this example, we take our survey data, pipe it to group_by() where we specific we want to group by hotdog status, then pipe that grouped data to summarize() where we say we want the mean of our hours_sleep column in an output called sleep. We can see in our example that those that do not consider a hot dog a sandwich, at least in our class, manage to get slightly more sleep on average.

Try copying the format from our example, but group by birthday month, and get both the mean of hours_sleep and a table of nerd. Do you see any concerns with the output?

survey |> group_by(b_month) |> summarise( sleep = mean(hours_sleep), nerd = table(nerd))

We have so many groups, that many most likely only have one member anyway, so no real use grouping.

2. Binding Dataframes

To start, let’s load in some data to practice our merging. Execute the following to create several dataframes we will use to practice.

source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/2_merge_examples.R")

Join Vertically using `rbind`

If you have two dataframes and want to combine them, you need to decide what direction you want to combine them in. If you would like to combine them vertically, or stack the rows on top of each other, you can use rbind or row bind. rbind is helpful if you have two dataframes with the same columns, and you want to combine the cases (rows). However, it will not work if the columns are different. rbind accepts an arbitrary number of dataframes, here we use two: upper_rbind_df and lower_rbind_df.

rbind_outcome = rbind(upper_rbind_df, lower_rbind_df)

Try using rbind to combine our new outcome dataframe with upper_rbind_df again. Why does this work?

Both of the dataframes still have the same columns. Even if rows are repeated, they can still stack vertically.

Join Horizontally using `cbind`

If you would like to combine two dataframes horizontally, so that you add more columns on to a dataframe, you can use cbind or column bind. The cbind function takes an arbitrary number of dataframes as it’s arguments.

cbind_outcome = cbind(left_cbind_df, right_cbind_df)

We can see out new outcome dataframe was created as expected. This method can work, but you must be very confident about the structure of your data.

Try using cbind to combine left_cbind_df with our rbind_outcome dataframe from above. What happens? Why is the result bad?

cbind(left_cbind_df, rbind_outcome)

It works, but produces a result we might not expect with repeated rows data.

3. Merging Dataframes

cbind may work in some situations, but is very “brittle” – or easy to break – with any changes to the data. A more reliable method of merging makes use of a common key between two sources of data. Think of a key like a luggage tag on a suitcase: it’s a small bit of information that clearly links to things, even if they are separated. In terms of data, a key can be anything, like a numerical ID or a string, but they must always be unique. Joins that use the key method will be familiar to anyone who has used SQL in the past, as they follow the same naming convention as in that language.

Inner Join

An inner join tries to join two sets of data using a common key, keeping only the data that exists in both of the original data sets. This is commonly illustrated using a venn diagram similar to the one below. Only the area highlighted with green will be included in the output.

In the context of our data, it might look something like the following if we use hometown as our key. We can use the following code to do an inner merge using dplyr’s inner_join() function. inner_join() requires three arguments, x and y, which are the dataframes we would like to merge, and by which is the key we would like to join by. We will be using left_merge_df and right_merge_df, and asking R to use hometown as the key to join by.

library(dplyr)

inner_outcome = inner_join(x = left_merge_df, y = right_merge_df, by = "hometown")

Everything seems in order. We can see that when using an inner join, cases where both dataframes have a single row with our key will be matched and joined into a single output dataframe. However, if there are rows in either data frame without a match, those rows will be dropped from our data.

Create a new dataframe using rbind() called double_left which binds two copies of left_merge_df on top of each other. Then perform an inner join with this new double_left and right_merge_df. What happens? What was the problem here?

double_left = rbind(left_merge_df, left_merge_df) inner_join(x = double_left, y = right_merge_df, by = ‘hometown’)

The matches are performed multiple times. This essentially doubles our data set, which can cause big problems later.

Outer Join

An Outer Join is the opposite of an inner join. Rather than just looking for those rows which have a key in common, it will join every row, regardless of the keys, inserting blank values where there is no match. You can get a sense of this merge with the venn diagram below, where the green indicates good matches, while the orange indicates partial matches.

In the context of our data, an outer join may look something like the following, again using hometown as our key. The code for an outer join using dplyr is full_join().

full_outcome = full_join(x = left_merge_df, y = right_merge_df, by = "hometown")

In this example, we see that an outer join will keep all available cases, but we introduce NA values into the resulting dataframe where no match could be made. This may or may not be an issue depending on what your next steps are.

Left Join

A left join allows you to pick one of the two dataframes you are joining and prioritize it. It essentially takes all of the cases in the dataframe on the ’left’ side, and searches in ‘right’ dataframe to join what it can. Keeping with the venn diagram representations, it would look like the following:

Using our example data, a left join would look like this. The R code is similar to the previous examples, in this case it is left_join(). Recall that in the function call, x is the first dataframe, or the “left” one.

left_outcome = left_join(x = left_merge_df, y = right_merge_df, by = "hometown")

We can see that while all of the data from our ’left’ side is preserved, unmatched data on the ‘right’ is discarded. You can also technically do a right join using right_join(), which will do the same thing with sides reversed, but moving your prioritized data set to the left is more common.

Try performing a right merge which creates the same outcome as our left merge above. Describe the differences. Why are these differences present?

right_outcome = right_join(x = left_merge_df, y = right_merge_df, by = ‘hometown’)