Day 29 - Web Scraping

Fall 2022

Dr. Jared Joseph

November 16, 2022

Overview

Timeline

What is Web Scraping
The Legalities of Web Scraping
The Ethics of Web Scraping
The Tools & Limits

Goal

Build a bespoke scraper for the SDS website and understand how it works.

What is Web Scraping

A Quick Definition

Web scraping refers to the process of programmatically collecting data from the internet, typically from web pages.

This is often done in a way not intended by website owners.

Web scraping can be useful, but must be used responsibly.

graph TD
  A[Website] --> B{R}
  B --> C[Dataset]
  
    linkStyle 0 stroke:white
    linkStyle 1 stroke:white

A Word of Warning

If you screw this up, you can get the entire university banned from a website.

An Analogy

Web scraping is like going to an event, eating the hors d’oeuvres, and leaving.

The event (website) wants you to stick around, look at some ads, maybe buy something.

If you take your food politely, you’re probably fine. If you’re a jerk, you’re going to cause a scene.

A Use Case

Say you want to see where Smith faculty got their degrees; see the most common, compare Ivy vs not, etc.

You could:

Go to every faculty page on the smith website and copy/paste the info into a spreadsheet.

Write some code to do all that for you.

The Legalities of Web Scraping

(General ideas, I’m not a lawyer; not legal advice, etc.)

A Primer

The legalities of web scraping are in a grey area. The legalities depend on several factors including:

The kind of data you are trying to get
How you are getting and saving it
What you plan to do with the data once you have it

In general, you should never scrape:

Anything under copywrite
Anything about private people
Anything you need to log-in to see

Terms of Service

Most Terms of Service (ToS) will explicitly state if they dis/allow scraping.

The ToS is often in the footer, or bottom, of the home page.

Look out for terms like:

automated
bot
scrape
crawl

Robots.txt

Most sites also have a robots.txt file.

You can find this file by going to the top page of a website and adding robots.txt to the URL.

It often has detailed rules on what pages you can scrape, and how often. For example, the Smith website wants bots to pause for 10 seconds between pages, and lists many that are disallowed.

The Ethics of Web Scraping

The T3 Dataset

The Tastes, Ties, and Time (T3) Dataset contained the following for nearly an entire cohort (grade level) of students, following them every year from 2006 to 2009:

Race
Gender
Political views
Home state/country
Major
Relationships
Ofﬁcial housing records

This data was collected direct from the university and Facebook with no input from the students.

“With permission from Facebook and the university in question, we ﬁrst accessed Facebook on March 10 and 11, 2006 and downloaded the proﬁle and network data provided by one cohort of college students. This population, the freshman class of 2009 at a diverse private college in the Northeast U.S., has an exceptionally high participation rate on Facebook: of the 1640 freshmen students enrolled at the college, 97.4% maintained Facebook proﬁles at the time of download and 59.2% of these students had last updated their proﬁle within 5 days. (Lewis et al. 2008, p. 331)”

Serious Concerns

“The ‘‘non-identiﬁability’’ of such a dataset is up for debate. A friend network can be thought of as a ﬁngerprint; it is likely that no two networks will be exactly similar, meaning individuals may be able to be iden- tiﬁed in the dataset post-hoc… Further, the authors of the dataset plan to release student ‘‘Favorite’’ data in 2011, which will provide further information that may lead to identiﬁcation. (Stutzman 2008)”

“I think it’s hard to imagine that some of this anonymity wouldn’t be breached with some of the participants in the sample. For one thing, some nationalities are only represented by one person. Another issue is that the particular list of majors makes it quite easy to guess which speciﬁc school was used to draw the sample. Put those two pieces of information together and I can imagine all sorts of identities becoming rather obvious to at least some people. (Hargittai 2008)”

“The Data was Already Public”

The T3 authors said that:

“We have not accessed any information not otherwise available on Facebook”

The T3 project used student research assistants (some who had privileged access to other students networks through mutual or direct friendships) to collect all the data.

Given that, was the data really public?

Even if it was, people did not expect for their data to be collected and aggregated in this way.

The Tools & Limits

Simple Vs. Complex Pages

Simple

Static pages that display pre-set content.

Complex

Dynamic pages that update given user input or other factors.

A Primer on HTML

HTML (HyperText Markup Language) is the language used to build pretty much everything you see on the web.

Very generally, it creates sections on a web page, and you can apply certain properties to that section. For example, the color, size, and style of text. We can use that section structure to get the data we want.

Fun fact, these slides are all HTML! That’s why you view them in a web browser.

<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1>Heading 1</h1>
  <p>Hello world! <b>Bold Hello world!</b></p>
  <a href='jnjoseph.com'>I am a link!</a>
</body>

SelectorGadget

SelectorGadget helps us isolate those elements we want from a web page. In this case, faculty names from the SDS page.

rvest

The rvest package simplifies a lot of basic web scraping.

We can give rvest the element IDs from SelectorGadget to easily compile the data we want.

We can use this process to make our own data from information on the web.

Scrape the Data

Here I’ll grab the names for all the faculty on the SDS page.

I first read the entire page into R using read_html() from rvest.

After that, I essentially subset that web page using the HTML sections I got from SelectorGadget.

Note: If the website ever changes significantly, our scraper code will probably break!

```{r}
library(rvest)

# download the SDS page
sds_page = read_html(
  "https://www.smith.edu/academics/statistics")

# what is it?
class(sds_page)
```

[1] "xml_document" "xml_node"

```{r}
# Get the names of all SDS faculty
html_text2(
  html_elements(sds_page, ".fac-inset h3")
  )
```

 [1] "Ben Baumer"            "Shiya Cao"             "Kaitlyn Cook"         
 [4] "Rosie Dutt"            "Randi L. Garcia"       "Katherine Halvorsen"  
 [7] "Will Hopper"           "Nicholas Horton"       "Jared Joseph"         
[10] "Albert Young-Sun Kim"  "Katherine M. Kinnaird" "Scott LaCombe"        
[13] "Lindsay Poirier"       "Nutcha Wattanachit"    "Faith Zhang"

Make a Dataframe

We can repeat this process to create a whole dataframe of information!

```{r}
sds_df = data.frame(
  "name" = html_text2(html_elements(sds_page, ".fac-inset h3")),
  "title" = html_text2(html_elements(sds_page, ".fac-inset p")),
  "rel_link" = html_attr(html_elements(sds_page, ".linkopacity"), name = "href")
)
```

name	title	rel_link
Ben Baumer	Associate Professor of Statistical & Data Sciences	/academics/faculty/ben-baumer
Shiya Cao	MassMutual Assistant Professor of Statistical and Data Sciences	/academics/faculty/shiya-cao
Kaitlyn Cook	Assistant Professor of Statistical & Data Sciences	/academics/faculty/kaitlyn-cook
Rosie Dutt	Lecturer in Statistical and Data Sciences	/academics/faculty/rosie-dutt
Randi L. Garcia	Associate Professor of Psychology and of Statistical & Data Sciences	/academics/faculty/randi-garcia
Katherine Halvorsen	Professor Emerita of Mathematics & Statistics	/academics/faculty/katherine-halvorsen
Will Hopper	Lecturer in Statistical & Data Sciences	/academics/faculty/will-hopper
Nicholas Horton	Research Associate in Statistical & Data Sciences	/academics/faculty/nicholas-horton
Jared Joseph	Visiting Assistant Professor of Statistical and Data Sciences	/academics/faculty/jared-joseph
Albert Young-Sun Kim	Assistant Professor of Statistical & Data Sciences	/academics/faculty/albert-kim
Katherine M. Kinnaird	Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences	/academics/faculty/katherine-kinnaird
Scott LaCombe	Assistant Professor of Government and of Statistical & Data Sciences	/academics/faculty/scott-lacombe
Lindsay Poirier	Assistant Professor of Statistics & Data Sciences	/academics/faculty/lindsay-poirier
Nutcha Wattanachit	UMass Teaching Associate, Statistical and Data Sciences	/academics/faculty/faculty-nutcha-wattanachit
Faith Zhang	Lecturer of Statistical and Data Sciences	/academics/faculty/faculty-faith-zhang

Now Iterate

We can now get data for individual web pages.

We also have a dataframe column with links to all the specific faculty pages.

We could iterate over those links to go to each of the pages and get more information.

This is where the danger is!

If we program a bot to go to more pages, it will do so as fast as possible unless we tell it otherwise.

Good bots take breaks so as to not overload the website. You can do that in R using Sys.sleep().