Day 7 - Exploratory Data Analysis (EDA)

Fall 2022

Dr. Jared Joseph

September 21, 2022

Overview

Timeline

  • What is EDA
  • Numeric Summaries
  • Visualizing Data
  • Code Along

Goal

Learn fundamental tools to understand datasets within R.

What is EDA

Common EDA Questions

Data Source

  • Where is this data from?
  • Who made it?
  • For what purpose?
  • What is the population?
  • What is the sample like?
  • What do the values mean?
  • What is the margin of error?

Data Structure

  • What data type are each of the variables?
  • Does anything need re-coded?
  • What is the missingness like?
  • What cleaning needs to be done?
  • Are there outliers?
  • What do those outliers mean?

Numeric Summaries

Single Number Summaries

Mean
Arithmetic average of a vector, swayed by outliers
Standard Deviation
How spread out the values are in a vector from the mean
Median
Middle number in a vector, but relies on data structure
Mode
Most common value in vector

Multi-Value Summaries

Summary

```{r}
summary(survey$mint_choc)
```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     4.0     3.6     5.0     5.0 

Table

```{r}
table("mint_choc" = survey$mint_choc)
```
mint_choc
1 3 4 5 
4 1 3 7 
```{r}
table("mint_choc" = survey$mint_choc,
      "hotdog" = survey$hotdog)
```
         hotdog
mint_choc FALSE TRUE
        1     1    3
        3     1    0
        4     2    1
        5     4    3

Skimr

```{r}
skimr::skim(survey)
```
Data summary
Name survey
Number of rows 15
Number of columns 23
_______________________
Column type frequency:
character 14
logical 6
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
fav_char 0 1.00 4 39 0 15 0
fav_color 1 0.93 3 50 0 10 0
b_month 0 1.00 3 9 0 9 0
pets 0 1.00 3 17 0 8 0
fav_art 3 0.80 17 1158 0 12 0
coffee_days 7 0.53 6 62 0 7 0
tea_days 10 0.33 6 17 0 5 0
soda.pop_days 9 0.40 7 62 0 5 0
juice_days 4 0.73 6 62 0 8 0
none_days 10 0.33 17 45 0 5 0
lt_location 3 0.80 7 47 0 12 0
fict 0 1.00 7 11 0 2 0
recreation 0 1.00 51 232 0 15 0
key 0 1.00 6 41 0 15 0

Variable type: logical

skim_variable n_missing complete_rate mean count
major 15 0 NaN :
other_classes 15 0 NaN :
car 0 1 0.33 FAL: 10, TRU: 5
pineapple_pizza 0 1 0.67 TRU: 10, FAL: 5
nerd 0 1 0.73 TRU: 11, FAL: 4
hotdog 0 1 0.47 FAL: 8, TRU: 7

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fav_num 0 1 15.60 14.92 3 7 9 20.5 56 ▇▁▂▁▁
mint_choc 0 1 3.60 1.72 1 2 4 5.0 5 ▅▁▁▃▇
hours_sleep 0 1 6.87 0.74 6 6 7 7.0 8 ▆▁▇▁▃

Visualizing Data

Why Visualize?

Bar Chart

```{r}
#| fig-cap: Barplot of Mint x Chocolate Ratings

barplot(table(survey$mint_choc))
```

Barplot of Mint x Chocolate Ratings

Histogram

```{r}
#| fig-cap: Histogram of typical hours sleep

hist(survey$hours_sleep)
```

Histogram of typical hours sleep

Scatterplot

```{r}
#| fig-cap: Scatterplot of Weight x Miles per Gallon

plot(mtcars$wt, mtcars$mpg)
```

Scatterplot of Weight x Miles per Gallon

Line Chart

Boxplot

```{r}
summary(mtcars$mpg)
```
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 
```{r}
#| fig-cap: Boxplot of Miles per gallon

boxplot(mtcars$mpg)
```

Boxplot of Miles per gallon

Code Along

For Next Time

Topic

Lab 2

To-Do

  • EDA Worksheet