The apply family of functions can be a handy way to perform analyses and transformations of data quickly. Today we will be using both to improve some previous tasks. First we will be revisiting pet_split()
(last time), and solving an old annoyance regarding reading in several data files.
We are going to be using class survey data for lab today. Please load it in using the following:
survey = read.csv("https://raw.githubusercontent.com/Intro-to-Data-Science-Template/intro_to_data_science_reader/main/content/class_worksheets/4_r_rstudio/data/survey_data.csv")
pet_split()
The first thing we will be doing today is combining our pet_split()
function and lapply()
in order to split all of our <DRINK>_day
comma separated columns at once. I’ve provided our fully generalized version of pet_split()
below. Run the following to add it to your environment.
pet_split = function(pet_vector, possible_columns){
# make a base dataframe with rows for each of our cases.
pet_output = data.frame(
"id" = 1:length(pet_vector)
)
# iterate through all options and create a column with NAs for it
for(option in possible_columns){
# make a new column with a character version of each possible option.
pet_output[, as.character(option)] = NA
}
# fill output df
for(option in possible_columns){
# fill dataframe iterativly.
pet_output[ , option] = grepl(option, pet_vector, ignore.case = TRUE)
}
# clear all know options
for(option in possible_columns){
# remove all known options
pet_vector = gsub(pattern = option, pet_vector, replacement = '', ignore.case = TRUE)
}
# clear commas and whitespace
pet_vector = gsub(pattern = ',', pet_vector, replacement = '', ignore.case = TRUE)
pet_vector = trimws(pet_vector)
# Fill in 'other'
pet_output$other = pet_vector
# Turn blanks into NAs
pet_output[pet_output$other == "" & !is.na(pet_output$other), 'other'] = NA
# return output
return(pet_output)
}
Once you’ve got the function in your environment, we are going to use lapply()
to apply it over all the relevant columns in our survey
dataframe.
Use lapply()
to apply pet_split()
to all of the <DRINK>_days
columns in our survey
dataframe. Save the results as drink_dfs
. DO NOT include the ()
after pet_split
when providing it as an argument; it will produce an error.
drink_dfs = lapply(X = survey[, c(“coffee_days”, “tea_days”, “soda.pop_days”, “juice_days”, “none_days”)], FUN = pet_split, possible_columns = c(“monday”, “tuesday”, “wednesday”, “thursday”, “friday”, “saturday”, “sunday”))
You will have gotten a list object of length 5 back. Recall that lists are super-vectors. In this case, each element of our list contains an entire dataframe!
Working with lists is tricky. Because they can contain anything it is hard to have set rules. The only one you can always rely on is that you need to use double square brackets [[ ]]
to work with the contents of a list. For example, in our new drink_dfs
list, we can ask for the content of the first element either by position (drink_dfs[[1]]
) or by name (drink_dfs[["coffee_days"]]
). If you only use a single square bracket with a list (drink_dfs[1]
) you will get the actual first element of a list, which is a list of length 1 with whatever the contents are inside of it. Confusing, I know.
Run str()
on both of the following:
What are the differences between the two?
str(drink_dfs[[1]])
says it is a dataframe, while str(drink_dfs[1])
is a list of length 1 with a dataframe inside of it.
Once we subset the list using [[ ]]
, we can treat it like it was a normal dataframe; auto-complete will even still work! Start typing drink_dfs[["coffee_days"]]$
into the console and hit tab. All of the columns of that dataframe will appear as normal!
A list of dataframes is actually a handy structure though, because it will allow us to work with all of those dataframes simultaneously. In the following example, I use lapply()
to perform a custom function on each of our drink dataframes all at once. First I subset to just the day columns, then a get the column sums for each day.
lapply(drink_dfs, FUN = function(drink){
# for each drink, find what day is was most frequently consumed by our class
# subset to the day columns
days = drink[, c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday")]
# get the column sums for each day
# recall each row is a person in class, and each column is a day
# so the column sums would be the number of people who had X drink per day
class_consumed = colSums(days)
# return output
return(class_consumed)
})
$coffee_days
monday tuesday wednesday thursday friday saturday sunday
4 5 5 2 2 3 5
$tea_days
monday tuesday wednesday thursday friday saturday sunday
0 0 2 0 1 1 3
$soda.pop_days
monday tuesday wednesday thursday friday saturday sunday
4 4 2 2 3 1 3
$juice_days
monday tuesday wednesday thursday friday saturday sunday
6 6 7 7 8 6 5
$none_days
monday tuesday wednesday thursday friday saturday sunday
4 3 1 4 2 2 2
That would have taken so much more typing otherwise. An apply function like this can make our code much cleaner and easier to read. If we did have to type all that out, it also increases the chance we will make a mistake somewhere, and one of our calculations will be different from the others. This way, we know all of them are exactly the same. You could also accomplish the same thing with a for()
loop, and there are rather minimal differences in most cases. Do whatever makes most sense to you!
I promised I would share how to read multiple files at once and here we are. This is actually a common way I use lapply()
. First, we need to have a directory of data files with identical structures. For this example, we will use the data from lab 3 (aggregation and merging). First, specify the path to the data folder of lab 3. You have to find the specific path on your computer.
# change this to match your system
lab_3_data = "path/to/lab-3-tidy-agg-merge-NAME/data/"
Next, we will get a vector of all the file paths of the data inside that folder using list.files()
. I use the pattern argument here to say “I only want files that contain this.” In this case, it is the “econ_” prefix so I only get our “econ_acs5_YEAR.csv” files.
econ_data_paths = list.files(lab_3_data, pattern = "econ_", full.names = TRUE)
Now, we will use lapply()
to read in all of the econ_X dataframes from that lab at once.
all_econ_data = lapply(econ_data_paths, read.csv)
You now have a list of length 6 with all of the econ data from that lab! No copy and pasting required. You can even pivot and merge them all at once too. First, I’ll pivot everything from long to wide:
library(tidyr)
all_econ_data_wide = lapply(all_econ_data,
FUN = pivot_wider,
id_cols = c("GEOID", "NAME"),
names_from = "variable",
values_from = c("estimate", "moe"))
Next, we’ll use basename()
to give each dataframe a year identifier (we’ll just use the file name for now). I’ll do this one in a for()
loop as it’s easier to match the file names vector and our data list elements.
for(i in 1:6){
# get the file name I want
file_name = basename(econ_data_paths[i])
# add that as a column to the matching list element
all_econ_data_wide[[i]]$file_name = file_name
}
Now we can bind them all together. I’ll use do.call()
here. Really this is the only time I ever use it, and I don’t know what else it is helpful for.
merged_econ = do.call(rbind, all_econ_data_wide)
Repeat this process with the “pop_acs5_XXXX” CSVs from lab 3.
lab_3_data = “path/to/lab-3-tidy-agg-merge-NAME/data/”
pop_data_paths = list.files(lab_3_data, pattern = “pop_”, full.names = TRUE)
all_pop_data = lapply(pop_data_paths, read.csv)
all_pop_data_wide = lapply(all_pop_data, FUN = pivot_wider, id_cols = c(“GEOID”, “NAME”), names_from = “variable”, values_from = c(“estimate”, “moe”))
for(i in 1:6){
# get the file name I want file_name = basename(pop_data_paths[i])
# add that as a column to the matching list element all_pop_data_wide[[i]]$file_name = file_name }
merged_pop = do.call(rbind, all_pop_data_wide)