Fall 2022
September 26, 2022
To learn about tidy data formats and some basic data cleaning tools to get there.
There should only ever be one piece of information per cell.
Data | Tidy Data | |
---|---|---|
Names | [“Jared Joseph”] | [“Jared”][“Joseph”] |
Addresses | [“10 Elm Street, Northampton, MA 01063”] | [“10 Elm Street”][“Northampton”][“MA”][“01063”] |
Date Ranges | [“Sep 10, 2021 to Sep 21, 2021”] | [“Sep 10, 2021”][“Sep 21, 2021”] |
Counts | [“Blue: 20; Red: 10; Yellow: 15”] | [20][10][15] |
Raw data files should never be altered. Make all your modifications using code, then save them into a new file if needed.
What is wrong with this vector?
R is treating this vector of numbers like a character
vector.
It’s turtles sub-setting all the way down.
```{r, eval=FALSE}
sf_subset$recoded_day = NA
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 1] <- "Sunday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 2] <- "Monday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 3] <- "Tuesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 4] <- "Wednesday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 5] <- "Thursday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 6] <- "Friday"
sf_subset$recoded_day[sf_subset$DAYSTOP2 == 7] <- "Saturday"
```
Dates are a bit more tricky.
[1] "2006-12-28" "2006-08-09" "2006-12-22" "2006-11-22" "2006-05-18"
[6] "2006-03-13" "2006-08-09" "2006-12-06" "2006-12-09" "2006-07-25"
[1] "character"
# A tibble: 6 × 23
fav_char major fav_c…¹ fav_num other…² b_month car pinea…³ mint_…⁴ nerd
<chr> <lgl> <chr> <int> <lgl> <chr> <lgl> <lgl> <int> <lgl>
1 Spike Spieg… NA Orange 3 NA Decemb… FALSE FALSE 4 TRUE
2 Doreamon NA purple 9 NA August FALSE TRUE 1 FALSE
3 Sherlock Ho… NA Seafoa… 27 NA Septem… FALSE TRUE 5 TRUE
4 Tiana NA Purple 6 NA Septem… TRUE FALSE 1 FALSE
5 Crush NA blue 8 NA October TRUE TRUE 5 TRUE
6 Thor NA Yellow 8 NA May FALSE FALSE 5 FALSE
# … with 13 more variables: hours_sleep <int>, pets <chr>, fav_art <chr>,
# coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
# none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
# hotdog <lgl>, key <chr>, and abbreviated variable names ¹fav_color,
# ²other_classes, ³pineapple_pizza, ⁴mint_choc
```{r}
survey_long <- survey |> pivot_longer(cols = -fav_char, values_transform = as.character)
head(survey_long)
```
# A tibble: 6 × 3
fav_char name value
<chr> <chr> <chr>
1 Spike Spiegal major <NA>
2 Spike Spiegal fav_color Orange
3 Spike Spiegal fav_num 3
4 Spike Spiegal other_classes <NA>
5 Spike Spiegal b_month December
6 Spike Spiegal car FALSE
# A tibble: 6 × 23
fav_char major fav_c…¹ fav_num other…² b_month car pinea…³ mint_…⁴ nerd
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Spike Spieg… <NA> Orange 3 <NA> Decemb… FALSE FALSE 4 TRUE
2 Doreamon <NA> purple 9 <NA> August FALSE TRUE 1 FALSE
3 Sherlock Ho… <NA> Seafoa… 27 <NA> Septem… FALSE TRUE 5 TRUE
4 Tiana <NA> Purple 6 <NA> Septem… TRUE FALSE 1 FALSE
5 Crush <NA> blue 8 <NA> October TRUE TRUE 5 TRUE
6 Thor <NA> Yellow 8 <NA> May FALSE FALSE 5 FALSE
# … with 13 more variables: hours_sleep <chr>, pets <chr>, fav_art <chr>,
# coffee_days <chr>, tea_days <chr>, soda.pop_days <chr>, juice_days <chr>,
# none_days <chr>, lt_location <chr>, fict <chr>, recreation <chr>,
# hotdog <chr>, key <chr>, and abbreviated variable names ¹fav_color,
# ²other_classes, ³pineapple_pizza, ⁴mint_choc
Aggregation and Merging
SDS 192-03: Intro to Data Science