library(rvest)
# get the home page into R
= read_html('https://www.smith.edu/academics/statistics')
sds_page
# make a dataframe with all the names
= data.frame('name' = html_text2(html_elements(sds_page, '.fac-inset h3')))
sds_faculty
# add titles to the dataframe
$title = html_text2(html_elements(sds_page, '.fac-inset p'))
sds_faculty
# get the relative links to each faculty page
$link = html_attr(html_elements(sds_page, '.linkopacity'), name = 'href') sds_faculty
Lab 8. Web Scraping
Introduction
Web scraping can give you the ability to create your own datasets when needed. While getting information from one page–like our worksheet from Wednesday–can be helpful, it is really only the first step of building a proper dataset. Today we will be continuing our work scraping the Smith website, with the goal of practicing a full web scraping workflow.
Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 8: Web Scraping
The Target
The first target of our scraping will again be the Smith Statistical & Data Sciences webpage. However, instead of stopping with the information on the home page, we will teach our scraper to “crawl” through the site and visit multiple pages to build our dataset.
We will eventually want to visit each faculty page (here is mine, for example), and combine the information on that page (email, office hours, etc.) with what we already have from the home page (name, title, links). Once we have code to do that, we can expand from one program to several..
Homepage Code
Let’s recap on the code we wrote for Wednesday. I’ve provided a copy that works below. It first gets the program homepage into R, makes a dataframe with a row for each faculty member, then adds some columns for their titles and a relative link to their personal pages.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Digging Deeper
Now that we have the homepage data again, we’re going to dig a little deeper and follow those faculty links. We want to scrape each individual faculty member page to get the faculty member’s email and office hours info.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
# Template
# for(x in y){
#
# # your code here
#
# # wait the 10 seconds requested by robots.txt
# Sys.sleep(10)
#
# }
Tying it Together
Now that we have code to go to each individual faculty page, let’s package that up nicely as well.
#<REPLACE THIS COMMENT WITH YOR ANSWER>
Building a Database
We now have a function that can intake an department’s homepage URL, and get info on all the faculty. The next layer of the onion is all the faculty.
#<REPLACE THIS COMMENT WITH YOR ANSWER>