Day 21 - Lists and Apply

Fall 2022

Dr. Jared Joseph

October 26, 2022

Overview

Timeline

  • Iteration Review
  • Lists
  • The Apply Family
  • Parallelization

Goal

To learn the differences and use cases for lists and the apply family of functions.

Iteration Review

In R, iterating on something is working through a vector one element at a time.

Vector = c(2, 4, 6, 8, 10)

  • Iteration 1: c(2, 4, 6, 8, 10)
  • Iteration 2: c(2, 4, 6, 8, 10)
  • Iteration 3: c(2, 4, 6, 8, 10)
  • Iteration 4: c(2, 4, 6, 8, 10)
  • Iteration 5: c(2, 4, 6, 8, 10)

for(X in Y) { Do Z }


Useful when:

  • We want to repeat the same operation several times
  • There is dependence on the outcome of previous operations

Lists

What is a List

Lists are kinda like super-vectors (JSON-like).


They can contain anything in their elements. You could have:

  • A list with one number in each element
  • A list with a vector of temperatures per day
  • A list of dataframes
  • A list of lists
```{r}
list(1, 2, 3, 4)
```
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4
```{r}
list("last_week" = c(50, 42, 50, 54, 57, 60, 59),
     "this_week" = c(58, 58, 65, 67, 60))
```
$last_week
[1] 50 42 50 54 57 60 59

$this_week
[1] 58 58 65 67 60
```{r}
list(data.frame("id" = 1:2, "let" = c("a", "b")),
     data.frame("id" = 3:4, "let" = c("c", "d")))
```
[[1]]
  id let
1  1   a
2  2   b

[[2]]
  id let
1  3   c
2  4   d

Accessing Lists

Getting the content of lists requires special syntax!


Each list element is accessed using double square brackets [[ ]]

```{r}
test_list = list("num_vec" = c(1, 2, 3, 4, 5),
                 "let_vec" = c("a", "b", "c", "c"),
                 "df" = head(mtcars))

test_list
```
$num_vec
[1] 1 2 3 4 5

$let_vec
[1] "a" "b" "c" "c"

$df
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
test_list[[1]]
```
[1] 1 2 3 4 5
```{r}
test_list[["num_vec"]]
```
[1] 1 2 3 4 5
```{r}
test_list[["let_vec"]][1]
```
[1] "a"
```{r}
test_list[["df"]]$mpg
```
[1] 21.0 21.0 22.8 21.4 18.7 18.1
```{r}
cars_df = test_list[[3]]

cars_df$cyl
```
[1] 6 6 4 6 8 6

In Other Words

IF

example_vec_1 = c(1, 2, 3)

example_vec_2 = c(“a”, “b”, “c”)

AND

example_list = list(example_vec_1, example_vec_2)

THEN

example_list[[1]] == example_vec_1 == c(1, 2, 3)

example_list[[2]][3] == example_vec_2[3] == “c”

Apply Family

Logic of Apply

The apply family of functions take every element of a sequence, and does the same thing to all parts.

Anatomy of an Apply Function



apply(X, FUN = function)

Apply does the same thing to each element (roughly) all at once.

Apply FUN to element 1 in X.

Apply FUN to element 2 in X.

Apply FUN to element 3 in X.

Apply FUN to element 4 in X.

Apply FUN to element 5 in X.

Apply FUN to element 6 in X.

Apply FUN to element 7 in X.

Loops vs Apply

Loops

Loops iterate through every element of a sequence one element at a time.

This allows dependence.

  • Iteration 1: c(2, 4, 6, 8, 10)
  • Iteration 2: c(2, 4, 6, 8, 10)
  • Iteration 3: c(2, 4, 6, 8, 10)
  • Iteration 4: c(2, 4, 6, 8, 10)
  • Iteration 5: c(2, 4, 6, 8, 10)

Apply

Apply functions apply the given functions to every element (roughly) at the same time.

This does not allow dependence.

c( 2, 4, 6, 8, 10 )

lapply

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

For every column in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
lapply(X = mtcars, FUN = mean)
```
$mpg
[1] 20.09062

$cyl
[1] 6.1875

$disp
[1] 230.7219

$hp
[1] 146.6875

$drat
[1] 3.596563

$wt
[1] 3.21725

$qsec
[1] 17.84875

$vs
[1] 0.4375

$am
[1] 0.40625

$gear
[1] 3.6875

$carb
[1] 2.8125

sapply

sapply is similar to lapply, but it returns a vector if it can. Be careful as it’s results can surprise you!

For every column in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
sapply(X = mtcars, FUN = mean, simplify = TRUE)
```
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

apply

apply is used for matrices or dataframes. You can supply the MARGIN argument to make it work over rows or columns.

For every column and then every row in mtcars, apply the mean() function.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Columns

```{r}
apply(X = mtcars, MARGIN = 2, FUN = mean)
```
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

Rows

```{r}
apply(X = head(mtcars), MARGIN = 1, FUN = mean)
```
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         29.90727          29.98136          23.59818          38.73955 
Hornet Sportabout           Valiant 
         53.66455          35.04909 

You can write FUN!

You can pass any function to FUN, including one you write!


This means you can do anything over a large collection of data.

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```{r}
lapply(X = mtcars, FUN = function(car){
  
  # get the largest value
  largest = max(car)
  
  # get the smallest value
  smallest = min(car)
  
  # get the difference
  result = largest - smallest
  
  # return the difference
  return(result)
})
```
$mpg
[1] 23.5

$cyl
[1] 4

$disp
[1] 400.9

$hp
[1] 283

$drat
[1] 2.17

$wt
[1] 3.911

$qsec
[1] 8.4

$vs
[1] 1

$am
[1] 1

$gear
[1] 2

$carb
[1] 7

Parallelization

What is Parallelization

Parallelization in R

The built-in parallel package in R offers several tools to run code in parallel.


Mostly, these take the form on apply family functions.


There can be no dependence between elements.

```{r}
library(parallel)

# make a cluster of workers
cl <- makeCluster(getOption("cl.cores", 2))

# perform an sapply in parallel
parSapply(cl, 1:20, get("+"), 3)

# stop the cluster
stopCluster(cl)
rm(cl)
```
 [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

For Next Time

Topic

Lab 6 & Quiz 2 Open

To-Do

  • Finish Worksheet
  • Turn in Project 1