Day 9 - Tidy Data

Fall 2022

Dr. Jared Joseph

September 26, 2022

Overview

Timeline

  • Lab 1 Grades
  • What is “Tidy” Data
  • Re-coding Variables
  • Data Formats

Goal

To learn about tidy data formats and some basic data cleaning tools to get there.

Lab 1 Assessment

What is “Tidy” Data

Anatomy of a Dataframe

From R for Data Science

What Tidy Means

There should only ever be one piece of information per cell.

Data Tidy Data
Names [“Jared Joseph”] [“Jared”][“Joseph”]
Addresses [“10 Elm Street, Northampton, MA 01063”] [“10 Elm Street”][“Northampton”][“MA”][“01063”]
Date Ranges [“Sep 10, 2021 to Sep 21, 2021”] [“Sep 10, 2021”][“Sep 21, 2021”]
Counts [“Blue: 20; Red: 10; Yellow: 15”] [20][10][15]

Never Touch Raw Data

Raw data files should never be altered. Make all your modifications using code, then save them into a new file if needed.

Reproducibility is Paramount

Re-coding Variables

Re-Coding Data Types

What is wrong with this vector?

```{r}
mystery_vector
```
[1] "1"  "6"  "2"  "6"  "8"  "22" "5"  "7" 

R is treating this vector of numbers like a character vector.

We can force R to treat data like a specific type using the as.XXXX() family of functions.

```{r}
as.numeric(mystery_vector)
```
[1]  1  6  2  6  8 22  5  7

Re-coding Specific Data

Code to Re-code

It’s turtles sub-setting all the way down.

```{r, eval=FALSE}
sf_subset$recoded_day = NA
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 1] <- "Sunday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 2] <- "Monday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 3] <- "Tuesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 4] <- "Wednesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 5] <- "Thursday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 6] <- "Friday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 7] <- "Saturday"
```

Defining Dates

Dates are a bit more tricky.

```{r}
library(lubridate)

sf_subset$DATESTOP[1:10]
class(sf_subset$DATESTOP)
```
 [1] "2006-12-28" "2006-08-09" "2006-12-22" "2006-11-22" "2006-05-18"
 [6] "2006-03-13" "2006-08-09" "2006-12-06" "2006-12-09" "2006-07-25"
[1] "character"


```{r}
ymd(sf_subset$DATESTOP)[1:10]
class(ymd(sf_subset$DATESTOP))
```
 [1] "2006-12-28" "2006-08-09" "2006-12-22" "2006-11-22" "2006-05-18"
 [6] "2006-03-13" "2006-08-09" "2006-12-06" "2006-12-09" "2006-07-25"
[1] "Date"

Lubridate Cheatsheet

Dataframe Formats

Wide and Long Data

Data Science Workshops

Wide Data

```{r}
library(tidyr)

survey <- read.csv("https://raw.githubusercontent.com/Intro-to-Data-Science-Template/intro_to_data_science_reader/main/content/class_worksheets/4_r_rstudio/data/survey_data.csv")
```
```{r}
head(survey)
```
# A tibble: 6 × 23
  fav_char     major fav_c…¹ fav_num other…² b_month car   pinea…³ mint_…⁴ nerd 
  <chr>        <lgl> <chr>     <int> <lgl>   <chr>   <lgl> <lgl>     <int> <lgl>
1 Spike Spieg… NA    Orange        3 NA      Decemb… FALSE FALSE         4 TRUE 
2 Doreamon     NA    purple        9 NA      August  FALSE TRUE          1 FALSE
3 Sherlock Ho… NA    Seafoa…      27 NA      Septem… FALSE TRUE          5 TRUE 
4 Tiana        NA    Purple        6 NA      Septem… TRUE  FALSE         1 FALSE
5 Crush        NA    blue          8 NA      October TRUE  TRUE          5 TRUE 
6 Thor         NA    Yellow        8 NA      May     FALSE FALSE         5 FALSE
# … with 13 more variables: hours_sleep <int>, pets <chr>, fav_art <chr>,
#   coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
#   none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
#   hotdog <lgl>, key <chr>, and abbreviated variable names ¹​fav_color,
#   ²​other_classes, ³​pineapple_pizza, ⁴​mint_choc

To Long Data

```{r}
survey_long <- survey |> pivot_longer(cols = -fav_char, values_transform = as.character)
head(survey_long)
```
# A tibble: 6 × 3
  fav_char      name          value   
  <chr>         <chr>         <chr>   
1 Spike Spiegal major         <NA>    
2 Spike Spiegal fav_color     Orange  
3 Spike Spiegal fav_num       3       
4 Spike Spiegal other_classes <NA>    
5 Spike Spiegal b_month       December
6 Spike Spiegal car           FALSE   

Back to Wide Data

```{r}
survey_wide <- survey_long |> pivot_wider()
head(survey_wide)
```
# A tibble: 6 × 23
  fav_char     major fav_c…¹ fav_num other…² b_month car   pinea…³ mint_…⁴ nerd 
  <chr>        <chr> <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>
1 Spike Spieg… <NA>  Orange  3       <NA>    Decemb… FALSE FALSE   4       TRUE 
2 Doreamon     <NA>  purple  9       <NA>    August  FALSE TRUE    1       FALSE
3 Sherlock Ho… <NA>  Seafoa… 27      <NA>    Septem… FALSE TRUE    5       TRUE 
4 Tiana        <NA>  Purple  6       <NA>    Septem… TRUE  FALSE   1       FALSE
5 Crush        <NA>  blue    8       <NA>    October TRUE  TRUE    5       TRUE 
6 Thor         <NA>  Yellow  8       <NA>    May     FALSE FALSE   5       FALSE
# … with 13 more variables: hours_sleep <chr>, pets <chr>, fav_art <chr>,
#   coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
#   none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
#   hotdog <chr>, key <chr>, and abbreviated variable names ¹​fav_color,
#   ²​other_classes, ³​pineapple_pizza, ⁴​mint_choc

Code Along

For Next Time

Topic

Aggregation and Merging

To-Do

  • Today’s Worksheet
  • Lab 2 Due