Lab 8. Web Scraping

Author

Jared Joseph

Introduction

Web scraping can give you the ability to create your own datasets when needed. While getting information from one page–like our worksheet from Wednesday–can be helpful, it is really only the first step of building a proper dataset. Today we will be continuing our work scraping the Smith website, with the goal of practicing a full web scraping workflow.

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 8: Web Scraping

The Target

The first target of our scraping will again be the Smith Statistical & Data Sciences webpage. However, instead of stopping with the information on the home page, we will teach our scraper to “crawl” through the site and visit multiple pages to build our dataset.

We will eventually want to visit each faculty page (here is mine, for example), and combine the information on that page (email, office hours, etc.) with what we already have from the home page (name, title, links). Once we have code to do that, we can expand from one program to several..

Homepage Code

Let’s recap on the code we wrote for Wednesday. I’ve provided a copy that works below. It first gets the program homepage into R, makes a dataframe with a row for each faculty member, then adds some columns for their titles and a relative link to their personal pages.

library(rvest)

# get the home page into R
sds_page = read_html('https://www.smith.edu/academics/statistics')

# make a dataframe with all the names
sds_faculty = data.frame('name' = html_text2(html_elements(sds_page, '.fac-inset h3')))

# add titles to the dataframe
sds_faculty$title = html_text2(html_elements(sds_page, '.fac-inset p'))

# get the relative links to each faculty page
sds_faculty$link = html_attr(html_elements(sds_page, '.linkopacity'), name = 'href')

Question 1

Take the code from Wednesday’s worksheet, and turn it into a function called scrape_homepage. This function should accept an argument called url which will be the URL of the program homepage we want to scrape. It should output the sds_faculty dataframe as above.

Then, use your new function to scrape the SDS homepage into an object called sds_faculty.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Digging Deeper

Now that we have the homepage data again, we’re going to dig a little deeper and follow those faculty links. We want to scrape each individual faculty member page to get the faculty member’s email and office hours info.

Question 2

Write code to iterate through the link column of sds_faculty and scrape email addresses and office hours info.

USE THE TEMPLATE BELOW TO MAKE SURE YOUR BOTS ARE POLITE AND WAIT BETWEEN EACH PAGE.

The Sys.sleep(10) function will make R wait 10 seconds between each page, just like the https://www.smith.edu/robots.txt file asks us to. It will also mean the code will take 10 * # faculty seconds to run, so don’t be surprised if it takes a bit.

Tip

When getting your selector targets from faculty pages, you will probably have to try multiple versions before your code will work for all faculty. Try selecting the same info on multiple pages to find the ones that work for all. No way around it, web scraping is just messy.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

# Template

# for(x in y){
#   
#   # your code here
#   
#   # wait the 10 seconds requested by robots.txt
#   Sys.sleep(10)
#   
# }

Tying it Together

Now that we have code to go to each individual faculty page, let’s package that up nicely as well.

Question 3

Incorporate the code from the previous section into our scrape_homepage function such that you can give the function a program homepage URL, and it will return a dataframe with faculty names, titles, page URL, email, and office hour information for all faculty in that program.

#<REPLACE THIS COMMENT WITH YOR ANSWER>

Building a Database

We now have a function that can intake an department’s homepage URL, and get info on all the faculty. The next layer of the onion is all the faculty.

CHALLANGE QUESTION

Using some form of iteration, run our scrape_homepage() function on the homepages for all 50 major programs at Smith. We want each row in our final dataframe to correspond to an individual faculty. Additionally, modify the scrape_homepage() function so it also includes the name of the program the faculty member is a part of. If you run into bugs, incorporate them as error checking in the function.

Actually running that code could take quite a while given the speed restrictions.

#<REPLACE THIS COMMENT WITH YOR ANSWER>