Fall 2022
September 26, 2022
To learn about tidy data formats and some basic data cleaning tools to get there.
From R for Data Science
There should only ever be one piece of information per cell.
Data | Tidy Data | |
---|---|---|
Names | [“Jared Joseph”] | [“Jared”][“Joseph”] |
Addresses | [“10 Elm Street, Northampton, MA 01063”] | [“10 Elm Street”][“Northampton”][“MA”][“01063”] |
Date Ranges | [“Sep 10, 2021 to Sep 21, 2021”] | [“Sep 10, 2021”][“Sep 21, 2021”] |
Counts | [“Blue: 20; Red: 10; Yellow: 15”] | [20][10][15] |
Raw data files should never be altered. Make all your modifications using code, then save them into a new file if needed.
What is wrong with this vector?
R is treating this vector of numbers like a character
vector.
It’s turtles sub-setting all the way down.
```{r, eval=FALSE}
sf_subset$recoded_day = NA
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 1] <- "Sunday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 2] <- "Monday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 3] <- "Tuesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 4] <- "Wednesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 5] <- "Thursday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 6] <- "Friday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 7] <- "Saturday"
```
Dates are a bit more tricky.
[1] "2006-12-28" "2006-08-09" "2006-12-22" "2006-11-22" "2006-05-18"
[6] "2006-03-13" "2006-08-09" "2006-12-06" "2006-12-09" "2006-07-25"
[1] "character"
# A tibble: 6 × 23
fav_char major fav_c…¹ fav_num other…² b_month car pinea…³ mint_…⁴ nerd
<chr> <lgl> <chr> <int> <lgl> <chr> <lgl> <lgl> <int> <lgl>
1 Spike Spieg… NA Orange 3 NA Decemb… FALSE FALSE 4 TRUE
2 Doreamon NA purple 9 NA August FALSE TRUE 1 FALSE
3 Sherlock Ho… NA Seafoa… 27 NA Septem… FALSE TRUE 5 TRUE
4 Tiana NA Purple 6 NA Septem… TRUE FALSE 1 FALSE
5 Crush NA blue 8 NA October TRUE TRUE 5 TRUE
6 Thor NA Yellow 8 NA May FALSE FALSE 5 FALSE
# … with 13 more variables: hours_sleep <int>, pets <chr>, fav_art <chr>,
# coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
# none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
# hotdog <lgl>, key <chr>, and abbreviated variable names ¹fav_color,
# ²other_classes, ³pineapple_pizza, ⁴mint_choc
```{r}
survey_long <- survey |> pivot_longer(cols = -fav_char, values_transform = as.character)
head(survey_long)
```
# A tibble: 6 × 3
fav_char name value
<chr> <chr> <chr>
1 Spike Spiegal major <NA>
2 Spike Spiegal fav_color Orange
3 Spike Spiegal fav_num 3
4 Spike Spiegal other_classes <NA>
5 Spike Spiegal b_month December
6 Spike Spiegal car FALSE
# A tibble: 6 × 23
fav_char major fav_c…¹ fav_num other…² b_month car pinea…³ mint_…⁴ nerd
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Spike Spieg… <NA> Orange 3 <NA> Decemb… FALSE FALSE 4 TRUE
2 Doreamon <NA> purple 9 <NA> August FALSE TRUE 1 FALSE
3 Sherlock Ho… <NA> Seafoa… 27 <NA> Septem… FALSE TRUE 5 TRUE
4 Tiana <NA> Purple 6 <NA> Septem… TRUE FALSE 1 FALSE
5 Crush <NA> blue 8 <NA> October TRUE TRUE 5 TRUE
6 Thor <NA> Yellow 8 <NA> May FALSE FALSE 5 FALSE
# … with 13 more variables: hours_sleep <chr>, pets <chr>, fav_art <chr>,
# coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
# none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
# hotdog <chr>, key <chr>, and abbreviated variable names ¹fav_color,
# ²other_classes, ³pineapple_pizza, ⁴mint_choc
Aggregation and Merging
SDS 192-03: Intro to Data Science