Day 34 - Text as Data

Fall 2022

Dr. Jared Joseph

November 30, 2022

Overview

Timeline

  • Text as Data
  • The Text Mining Workflow
  • Code-Along

Goal

To introduce advanced text mining methods and show some of what is possible.

Text as Data

Text so Far

So far, text has been an occasional part of our data sets.


We have tried to split strings to isolate and count specific items or categories.


We have essentially tried to make text into something else.

Pets
None
None
Dog, Plants (two)
None
Dog
Dog
Cat, Rock
Dog
Spider Plant
Cat
Dog, Reptile
Dog
Cat
None
Reptile, Plant

Text as Data

Making text the focus of your analyses requires a different perspective, and set of tools.


Our training so far is a pre-requisite for working with text, but text in R uses new object structures and workflows we will only scrape the surface of today.


Today is a survey to let you know some of what is possible, not a exhaustive catalog!

Chapter 1. Mr. Sherlock Holmes

Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a “Penang lawyer.” Just under the head was a broad silver band nearly an inch across. “To James Mortimer, M.R.C.S., from his friends of the C.C.H.,” was engraved upon it, with the date “1884.” It was just such a stick as the old-fashioned family practitioner used to carry—dignified, solid, and reassuring. “Well, Watson, what do you make of it?” Holmes was sitting with his back to me, and I had given him no sign of my occupation. “How did you know what I was doing? I believe you have eyes in the back of your head.” “I have, at least, a well-polished, silver-plated coffee-pot in front of me,” said he. “But, tell me, Watson, what do you make of our visitor’s stick? Since we have been so unfortunate as to miss him and have no notion of his errand, this accidental souvenir becomes of importance. Let me hear you reconstruct the man by an examination of it.” “I think,” said I, following as far as I could the methods of my companion, “that Dr. Mortimer is a successful, elderly medical man, well-esteemed since those who know him give him this mark of their appreciation.” “Good!” said Holmes. “Excellent!” “I think also that the probability is in favour of his being a country practitioner who does a great deal of his visiting on foot.”

Key Terms

Corpus
A collection of all of our text data. You could think of this like a library.
Document
A unit of observation in our corpus. Think of a single book on a shelf in a library.
Token
The thing that makes up our documents. In our books, this would typically be individual words (but not always!).
Metadata
Data about our data. Important, but not something we want to mix in with our content.

Text Mining in R

The tools to do text mining (and the more advanced natural language processing) move fast!


quanteda and its pals seems to be the front runner in R right now.

  • quanteda
  • quanteda.textmodels
  • quanteda.textstats
  • quanteda.textplots
  • quanteda.sentiment
  • quanteda.tidy

The Text Mining Workflow

Roadmap

Text mining often follows a familiar flow.

  1. Define Tokens/Documents
  2. Prepare Documents
  3. Make Corpus
  4. Clean Tokens
  5. Analyse

1. Define Documents/Tokens

Documents

What do you want to compare?

  • Whole books
  • Chapters from books
  • Paragraphs

Tokens

What holds meaning?

  • Words
  • Word groups
  • Sentences

The choice is important, but always specific to your project!

2. Prepare Documents

Once you have decided what you want to compare, you need to get your data into that format.


Your documents should contain only the meaningful content (tokens) you want to analyse.


As an example, I’ll use our class survey data, specifically what all of you said about your favorite art. Every response will be a document.

Favorite Art
I really enjoy all the work done by TeamLab. Their TeamLab Planets exhibit in Tokyo was one of the coolest things I have ever seen. It is an entire complex of large, interactive art that you walk through. One room was filled with ankle deep water that you waded through, while holographic fish swam by. If you touched any of the fish, they would explode into flowers. I also appreciate it from a technical perspective, as all the art is coded in Python.
NA
To preface, this is kind of an impossible question, because I can’t pick one favorite work of art. However, my favorite form of media is probably books, so I’ll just describe one that I like a lot. I really like the novel Ender’s Game, by Orson Scott Card. I first read it when I was about eleven or twelve (I have since read it at least twice and seen a movie adaptation), and I definitely associate it with the nostalgia of childhood. But that’s certainly not the main reasonI’ve always enjoyed science fiction and literally any novel containing a well-developed plot and a skillful plot twist, as well as social satire and allegories. This work embodies all of those criteria, incorporating science, a structurally sound and intriguing narrative, and social critique. And on top of that, it accomplishes all this using a well-built world populated with well-developed characters, including the protagonist Ender, who is profoundly relatable and likable. It’s the story of a brilliant child caught between a rock and a hard place, a story about war and morality, a story about space and other worlds and extinction and compassion across species. I love it.
The song “All Night” by Beyonce is without a doubt my favorite song of all time. In my opinion, Beyonce is the best performer or all time, living or dead. Her choreography and overall stage presence is unmatched. Furthermore, she is one of the best vocalists of our time. Specifically, in “All Night” she gives us a calming song that still has a strong beat that one could bop to. The song has good vibes and versatility. It can be played at any type of event. Additionally, the background behind the lyrics is complex and further attests to Beyonce’s incredible lyricism. Finally, “All Night” lives in Beyonce’s most iconic album, Lemonade, which she made in response to her husband’s infidelity. “All Night” perfectly captures how Beyonce can turn lemons into lemonade.
My favorite work of art would be photography, because I enjoy looking at everything visually. I feel that I forget moments quickly so being able to see an image lets me relive some memories and notice something new. I also enjoy the process of taking photos but probably when they’re less structured. A quick photo or even a video would do well for me.
Rap and R&B music
As of right now, my favorite work of art is the Percy Jackson series by Rick Riordan. Written for children, it manages to remain humorous to my 20-year-old self. It manages to be both relevant and rich in symbolism pertaining to ancient Greek mythology. It was also my favorite series as a child, so it has that nostalgia value.
My favorite artwork is an artist book by Barbra Kruger named “Thinking of Me, I Mean You, I Mean Me”. This work challenges reality and highlights the wicked ways in which capitalism manipulates us. She does so through photography, intense graphics, and brief but thought-provoking written statements.
Shepard Fairey artworks is my favorite due to the meaning behind the pieces and The Nighthawks Painting which captures my attention with the mood conveyed, style, and colors.
I don’t know if this would count as a single work of art, but one exhibit I’ve seen recently that I really like was Beatrice Glow’s exhibit at the Baltimore Museum of Art called “Once the Smoke Clears”. I liked it a lot because of how it included many different mediums of art, including some that I’ve rarely/never seen in museums, including 3D printed objects and ‘scent experiences’. I also really loved how in it she examined the relationship between colonialism and racism and the tobacco industry throughout history, which wasn’t something that I had really thought that much about before.
My favorite work of art is called Water Lilies by Claude Monet because since my mom really likes Monet, she took us to the MET and I saw a few of his artworks in person. So I instantly loved his work, while looking on the internet I came across Water Lilies and noticed that many of his paintings contained scenes with water in it. I just so happened to pick Water Lilies however because it combines to of my favorite things water and flowers.
my favorite work of art is Claude Monet’s garden paintings, such as the water lilies. I like it because i find it very appealing and beautiful. Theres are different versions of the same garden, painted in different seasons, and weather, which is shown through the art.
NA
NA
The marble sculpture, “Undine Rising from the Fountain”, created by the American artist Chauncey Ives in 1880 depicts the nymph Undine morphing from water form into human form. Ives utilizes technical skill to transform cold marble into flowing, soft fabric that drapes the feminine beauty of his subject. I like it because the stark difference between the media and the subject. It is similar to Raffaelle Monti’s sculptures, which are also a favorite of mine, however, I like that you can see Undine’s expression compared to Monti’s veiled faces.

3. Make Corpus

Now that we have our documents prepared, we can make our corpus


A corpus stores our documents, like a bookshelf holds books.

library(quanteda)

# make our corpus to store our documents
survey_corpus = corpus(survey$fav_art,
                       docnames = survey$fav_char)
# get an idea of what is in our corpus
summary(survey_corpus)
Corpus consisting of 15 documents, showing 15 documents:

                                    Text Types Tokens Sentences
                           Spike Spiegal    66     93         6
                                Doreamon     0      0         0
                         Sherlock Holmes   135    221         8
                                   Tiana    91    157        10
                                   Crush    59     69         4
                                    Thor     6      6         1
 Rhys (from A Court of Thorns and Roses)    50     66         4
                                   Buffy    46     56         3
                             Sasha Braus    26     31         1
                                   Catra    80    114         3
                                 Pikachu    61     89         3
                               My Melody    39     53         3
                           Claire Fraser     0      0         0
                                Shinchan     0      0         0
                                 Kakashi    72    101         4

4. Clean Tokens

The goal is to turn our tokens into the cleanest representations of meaning that we can.

4. Clean Tokens

Options include:

  • Setting everything to lower case
  • Removing punctuation
  • Removing symbols
  • Removing slashes and separators
  • Removing numbers
  • Removing stopwords
  • Simplifying word forms
# Get all of the tokens from the documents in our corpus
survey_tokens = tokens(survey_corpus,
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_separators = TRUE)

# set everything to lower case
survey_tokens = tokens_tolower(survey_tokens)

# remove all the connective tissue
survey_tokens = tokens_remove(survey_tokens, stopwords("en"))

# convert tokens to simple worms
survey_tokens = tokens_wordstem(survey_tokens)
Tokens consisting of 15 documents.
Spike Spiegal :
 [1] "realli"  "enjoy"   "work"    "done"    "teamlab" "teamlab" "planet" 
 [8] "exhibit" "tokyo"   "one"     "coolest" "thing"  
[ ... and 29 more ]

Doreamon :
character(0)

Sherlock Holmes :
 [1] "prefac"   "kind"     "imposs"   "question" "pick"     "one"     
 [7] "favorit"  "work"     "art"      "howev"    "favorit"  "form"    
[ ... and 93 more ]

Tiana :
 [1] "song"    "night"   "beyonc"  "without" "doubt"   "favorit" "song"   
 [8] "time"    "opinion" "beyonc"  "best"    "perform"
[ ... and 60 more ]

Crush :
 [1] "favorit"     "work"        "art"         "photographi" "enjoy"      
 [6] "look"        "everyth"     "visual"      "feel"        "forget"     
[11] "moment"      "quick"      
[ ... and 22 more ]

Thor :
[1] "rap"   "r"     "b"     "music"

[ reached max_ndoc ... 9 more documents ]

5. Analyse

Word Clouds

Frequency Plots

Outputs

Code-Along

For Next Time

Topic

  • Networks as Data
  • Quiz 4 Open

To-Do