Day 10 - Aggregation and Merging

Fall 2022

Dr. Jared Joseph

October 05, 2022

Overview

Timeline

  • Aggregation
  • Merging
  • Broken Promises

Goal

To learn commands to aggregate within data sets, and merge across data sets.

dplyr

dplyr bring SQL-like commands into R.


SQL (Structured Query Language) is a language specifically for databases.


There are good base R tools for doing aggregation and merging, but learning dplyr is more generalizable.

Aggregation

By Groups 1

```{r}
library(palmerpenguins)
penguins = data.frame(penguins)

penguins[c(1:3, 21:23, 31:33),]
```
   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1   Adelie Torgersen           39.1          18.7               181        3750
2   Adelie Torgersen           39.5          17.4               186        3800
3   Adelie Torgersen           40.3          18.0               195        3250
21  Adelie    Biscoe           37.8          18.3               174        3400
22  Adelie    Biscoe           37.7          18.7               180        3600
23  Adelie    Biscoe           35.9          19.2               189        3800
31  Adelie     Dream           39.5          16.7               178        3250
32  Adelie     Dream           37.2          18.1               178        3900
33  Adelie     Dream           39.5          17.8               188        3300
      sex year
1    male 2007
2  female 2007
3  female 2007
21 female 2007
22   male 2007
23 female 2007
31 female 2007
32   male 2007
33 female 2007

By Groups 2

By Groups 3

```{r}
library(dplyr)

body_group = penguins |>
  group_by(island) |>
  summarise(body = mean(
    body_mass_g,
    na.rm = TRUE))

body_group
```
# A tibble: 3 × 2
  island     body
  <fct>     <dbl>
1 Biscoe    4716.
2 Dream     3713.
3 Torgersen 3706.
```{r}
barplot(body_group$body,
        names.arg = body_group$island)
```

Merging

Binding

To Bind dataframes is to combine them by directly stacking them together.

Vertically using rbind

Horizontally using cbind

cbind Errors

Joining

To join dataframes is to combine them by using a key, or a common identifier between dataframes.

Inner Join

Inner Join

Outer (Full) Join

Outer (Full) Join

Left Join

Left Join

Broken Promises

Merging as a Problem

… In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office. (emphasis added)

The Dire Stats of De-Anonymization

What percentage of Americans can be uniquely identified using only ZIP code, birth date (including year), and sex?

87%


What percentage of users can be uniquely identified using 3 movie rentals from Netflix?

80%

For Next Time

Topic

LAB 3/Quiz 1 Open

To-Do

  • Finish Worksheet