Day 26 - Data Cleaning

Fall 2022

Dr. Jared Joseph

November 07, 2022

Overview

Timeline

  • Recap on Re-coding
  • Some More Types
  • Making Decisions
  • Cleaning Text Data
  • “A Living Hell”

Goal

Equip ourselves with some new tools for dealing with messy data.

Notices

  • Survey open for Recap day (please fill out today)
  • All grading caught up
  • OpenRefine cut
  • DSC WAV Deadline (You need 2 faculty references!)
  • VOTE

Recap on Re-coding

We can use class() to test data types.


The main data types in R are:

  • logical
  • numeric
  • character
  • factor

We can coerce data into these type with the as.XXXX() family of functions.

```{r}
# make example vector
type_vec = c(1, 0, 1, 1, 0)

# coerce to logical
as.logical(type_vec)
```
[1]  TRUE FALSE  TRUE  TRUE FALSE
```{r}
# "coerce" to numeric
as.numeric(type_vec)
```
[1] 1 0 1 1 0
```{r}
# coerce to characters
as.character(type_vec)
```
[1] "1" "0" "1" "1" "0"
```{r}
# coerce to factor
as.factor(type_vec)
```
[1] 1 0 1 1 0
Levels: 0 1

Some More Types

Factors

Factors are used for categorical data in R.


Different from characters because:

  • You can define all possible options, even if they don’t appear in this set of data
  • You can order the options to set a clear hierarchy
  • Under the hood, factors are labeled numbers
```{r}
cm_finish_reading = c("80%", "60%", "20%", "60%",
                      "0%", "80%", "20%", "80%",
                      "20%", "80%", "40%")
cm_finish_reading
```
 [1] "80%" "60%" "20%" "60%" "0%"  "80%" "20%" "80%" "20%" "80%" "40%"
```{r}
factor_reading = factor(cm_finish_reading,
                        levels = c("0%", "20%",
                                   "40%", "60%",
                                   "80%", "100%"),
                        ordered = TRUE)
factor_reading
```
 [1] 80% 60% 20% 60% 0%  80% 20% 80% 20% 80% 40%
Levels: 0% < 20% < 40% < 60% < 80% < 100%
```{r}
as.numeric(factor_reading)
```
 [1] 5 4 2 4 1 5 2 5 2 5 3

Date-time

```{r}
library(lubridate)

mts_times
class(mts_times)
```
 [1] "11/1/2022 18:05:32" "11/1/2022 20:42:06" "11/2/2022 10:06:15"
 [4] "11/2/2022 11:29:30" "11/2/2022 13:33:39" "11/2/2022 13:41:50"
 [7] "11/3/2022 17:20:35" "11/4/2022 0:12:10"  "11/4/2022 8:59:15" 
[10] "11/4/2022 11:22:54" "11/4/2022 16:08:14"
[1] "character"


```{r}
mdy_hms(mts_times)
class(mdy_hms(mts_times))
```
 [1] "2022-11-01 18:05:32 UTC" "2022-11-01 20:42:06 UTC"
 [3] "2022-11-02 10:06:15 UTC" "2022-11-02 11:29:30 UTC"
 [5] "2022-11-02 13:33:39 UTC" "2022-11-02 13:41:50 UTC"
 [7] "2022-11-03 17:20:35 UTC" "2022-11-04 00:12:10 UTC"
 [9] "2022-11-04 08:59:15 UTC" "2022-11-04 11:22:54 UTC"
[11] "2022-11-04 16:08:14 UTC"
[1] "POSIXct" "POSIXt" 

Date-time Math

lubridate lets you do math with dates (and that’s really cool).


It will also make your dates play nice with plots.

```{r}
today()
```
[1] "2022-11-07"
```{r}
today() + 27
```
[1] "2022-12-04"
```{r}
# make an interval
date_interval = interval(mdy("June 1th, 2011"),
                         mdy("July 6th, 2011"))
date_interval
```
[1] 2011-06-01 UTC--2011-07-06 UTC
```{r}
mdy("June 5th, 2011") %within% date_interval
```
[1] TRUE

Making Decisions

Outliers

A value far from most others in a set of data.

There are many ways to define an outlier. One commonly taught in stats classes uses the Interquartile Range (IQR).

outlier < Q1 - 1.5(IQR) OR outlier > Q3 + 1.5(IQR)

However, just because something is an outlier, does not mean it is invalid.

```{r}
page_box = boxplot(Pageviews,
                   main = "SCMA Page Views",
                   ylab = "Views",
                   sub = "Source: SCMA 2021-2022 AY")
```

```{r}
page_box$out[1:10]
```
 [1] 43966 16413  6883  6763  4379  4026  3737  3665  3494  3454

Sanity Checks

A “sanity check” is checking for violations of reasonable assumptions

  • In a dataset from elementary school: Ages > 20?
  • In a highway traffic dataset: 10 > Speed > 100?
  • In an income dataset: Income < 0?
  • In a votes dataset: Polling > 90%?
  • etc.

A violation does not necessarily mean the data is bad or should be removed, but it should be investigated.

Making Rules

If you think of a good check, don’t let it disappear!

In a dataset from elementary school: Ages > 20?
if(any(school$ages >= 20)){stop("Ages are suspect! (> 20)")}
In a highway traffic dataset: 10 > Speed > 100?
if(any(traffic$speed > 100 | traffic$speed < 10)){stop("Some speeds are abnormal.")}
In an income dataset: Income < 0?
if(any(money$income <= 0)){stop("Someone made negative income?")
In a votes dataset: Polling > 90%?
if(any(polling$share >= 90)){stop("An individual got a suspicious number of votes.")}

Imputation

Imputation is the process of filling unknown values given known ones

Some common imputation methods:

  • Simple
    • Fill missing values with the mean of all other values
    • Fill in missing values with the mode of all other values
  • Complex
    • Perform a regression to fill in missing values
    • Use the K Nearest Neighbors to find similar cases and fill using that
    • Use some other machine learning algorithm to fill

Imputation is a powerful tool, but potentially dangerous. Always remember it is an educated guess at best!

Cleaning Text Data

The stringr Package

stringr is to text as ggplot is to plotting.


Many tools that give you more options than the base grep and sub functions.

“regex” AKA “Regular Expressions”

Regular expressions let you search for parts of a string using very complex rules.

regex can do nearly anything; it is the nuclear option of working with text. It is basically its own coding language.


It is also one of the most painful things to code and bug-test in existence.


https://regex101.com/ is your best friend.

String Distance with stringdist

You can use various algorithms to test the “distance” or difference between two strings.

```{r}
library(stringdist)

# Measure the distance between
# north and east using osa
stringdist("Northampton",
           "Easthampton",
           method = "osa")
```
[1] 3
```{r}
# Measure the distance between
# north and east using soundex
stringdist("Northampton",
           "Easthampton",
           method = "soundex")
```
[1] 1

Approximate (Fuzzy) Matching

We can use string distance to make approximate (or fuzzy) matches, or matches that are not exact.


In this example, I compere every element between two character vectors, and then match based on those that have the smallest distance.


This can be immensely helpful in cases where you do not have a clean key. However, it is just an educated guess!

vec1 = c("cat", "dog", "fish", "rock")
vec2 = c("the cat", "a dog",
         "the best rock", "some fish")
dist_matrix = stringdistmatrix(vec1, vec2,
                               method = "cosine")
dist_matrix
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2301996 0.7418011 0.6026403 1.0000000
[2,] 1.0000000 0.2254033 0.8675468 0.8259223
[3,] 0.8333333 1.0000000 0.7705843 0.2462216
[4,] 0.8333333 0.7763932 0.5411685 0.8492443
best_matches = amatch(vec1, vec2,
                      method = "cosine",
                      maxDist = 1)
best_matches
[1] 1 2 4 3
data.frame(vec1, vec2[best_matches])
  vec1 vec2.best_matches.
1  cat            the cat
2  dog              a dog
3 fish          some fish
4 rock      the best rock

“A Living Hell”

Lazy Data Cleaning

So back in 2002, when MaxMind was first choosing the default point on its digital map for the center of the U.S., it decided to clean up the measurements and go with a simpler, nearby latitude and longitude: 38°N 97°W or 38.0000,-97.0000.

As a result, for the last 14 years, every time MaxMind’s database has been queried about the location of an IP address in the United States it can’t identify, it has spit out the default location of a spot two hours away from the geographic center of the country.

And that precise GPS location is exactly where the Arnold family lives.

For Next Time

Topic

  • Class Survey
  • Recap Day

To-Do

  • Finish Worksheet
  • Fill out recap survey