```{r}
# make example vector
type_vec = c(1, 0, 1, 1, 0)
# coerce to logical
as.logical(type_vec)
```
[1] TRUE FALSE TRUE TRUE FALSE
Fall 2022
November 07, 2022
Equip ourselves with some new tools for dealing with messy data.
We can use class()
to test data types.
The main data types in R are:
We can coerce data into these type with the as.XXXX()
family of functions.
Factors are used for categorical data in R.
Different from characters because:
```{r}
cm_finish_reading = c("80%", "60%", "20%", "60%",
"0%", "80%", "20%", "80%",
"20%", "80%", "40%")
cm_finish_reading
```
[1] "80%" "60%" "20%" "60%" "0%" "80%" "20%" "80%" "20%" "80%" "40%"
[1] "11/1/2022 18:05:32" "11/1/2022 20:42:06" "11/2/2022 10:06:15"
[4] "11/2/2022 11:29:30" "11/2/2022 13:33:39" "11/2/2022 13:41:50"
[7] "11/3/2022 17:20:35" "11/4/2022 0:12:10" "11/4/2022 8:59:15"
[10] "11/4/2022 11:22:54" "11/4/2022 16:08:14"
[1] "character"
[1] "2022-11-01 18:05:32 UTC" "2022-11-01 20:42:06 UTC"
[3] "2022-11-02 10:06:15 UTC" "2022-11-02 11:29:30 UTC"
[5] "2022-11-02 13:33:39 UTC" "2022-11-02 13:41:50 UTC"
[7] "2022-11-03 17:20:35 UTC" "2022-11-04 00:12:10 UTC"
[9] "2022-11-04 08:59:15 UTC" "2022-11-04 11:22:54 UTC"
[11] "2022-11-04 16:08:14 UTC"
[1] "POSIXct" "POSIXt"
lubridate
lets you do math with dates (and that’s really cool).
It will also make your dates play nice with plots.
A value far from most others in a set of data.
There are many ways to define an outlier. One commonly taught in stats classes uses the Interquartile Range (IQR).
outlier < Q1 - 1.5(IQR)
OR outlier > Q3 + 1.5(IQR)
However, just because something is an outlier, does not mean it is invalid.
A “sanity check” is checking for violations of reasonable assumptions
A violation does not necessarily mean the data is bad or should be removed, but it should be investigated.
If you think of a good check, don’t let it disappear!
if(any(school$ages >= 20)){stop("Ages are suspect! (> 20)")}
if(any(traffic$speed > 100 | traffic$speed < 10)){stop("Some speeds are abnormal.")}
if(any(money$income <= 0)){stop("Someone made negative income?")
if(any(polling$share >= 90)){stop("An individual got a suspicious number of votes.")}
Imputation is the process of filling unknown values given known ones
Some common imputation methods:
Imputation is a powerful tool, but potentially dangerous. Always remember it is an educated guess at best!
stringr
Packagestringr
is to text as ggplot
is to plotting.
Many tools that give you more options than the base grep
and sub
functions.
Regular expressions let you search for parts of a string using very complex rules.
regex can do nearly anything; it is the nuclear option of working with text. It is basically its own coding language.
It is also one of the most painful things to code and bug-test in existence.
https://regex101.com/ is your best friend.
stringdist
You can use various algorithms to test the “distance” or difference between two strings.
We can use string distance to make approximate (or fuzzy) matches, or matches that are not exact.
In this example, I compere every element between two character vectors, and then match based on those that have the smallest distance.
This can be immensely helpful in cases where you do not have a clean key. However, it is just an educated guess!
[,1] [,2] [,3] [,4]
[1,] 0.2301996 0.7418011 0.6026403 1.0000000
[2,] 1.0000000 0.2254033 0.8675468 0.8259223
[3,] 0.8333333 1.0000000 0.7705843 0.2462216
[4,] 0.8333333 0.7763932 0.5411685 0.8492443
So back in 2002, when MaxMind was first choosing the default point on its digital map for the center of the U.S., it decided to clean up the measurements and go with a simpler, nearby latitude and longitude: 38°N 97°W or 38.0000,-97.0000.
As a result, for the last 14 years, every time MaxMind’s database has been queried about the location of an IP address in the United States it can’t identify, it has spit out the default location of a spot two hours away from the geographic center of the country.
And that precise GPS location is exactly where the Arnold family lives.
SDS 192-03: Intro to Data Science