Fall 2022
November 30, 2022
To introduce advanced text mining methods and show some of what is possible.
So far, text has been an occasional part of our data sets.
We have tried to split strings to isolate and count specific items or categories.
We have essentially tried to make text into something else.
Pets |
---|
None |
None |
Dog, Plants (two) |
None |
Dog |
Dog |
Cat, Rock |
Dog |
Spider Plant |
Cat |
Dog, Reptile |
Dog |
Cat |
None |
Reptile, Plant |
Making text the focus of your analyses requires a different perspective, and set of tools.
Our training so far is a pre-requisite for working with text, but text in R uses new object structures and workflows we will only scrape the surface of today.
Today is a survey to let you know some of what is possible, not a exhaustive catalog!
Chapter 1. Mr. Sherlock Holmes
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a “Penang lawyer.” Just under the head was a broad silver band nearly an inch across. “To James Mortimer, M.R.C.S., from his friends of the C.C.H.,” was engraved upon it, with the date “1884.” It was just such a stick as the old-fashioned family practitioner used to carry—dignified, solid, and reassuring. “Well, Watson, what do you make of it?” Holmes was sitting with his back to me, and I had given him no sign of my occupation. “How did you know what I was doing? I believe you have eyes in the back of your head.” “I have, at least, a well-polished, silver-plated coffee-pot in front of me,” said he. “But, tell me, Watson, what do you make of our visitor’s stick? Since we have been so unfortunate as to miss him and have no notion of his errand, this accidental souvenir becomes of importance. Let me hear you reconstruct the man by an examination of it.” “I think,” said I, following as far as I could the methods of my companion, “that Dr. Mortimer is a successful, elderly medical man, well-esteemed since those who know him give him this mark of their appreciation.” “Good!” said Holmes. “Excellent!” “I think also that the probability is in favour of his being a country practitioner who does a great deal of his visiting on foot.”
The tools to do text mining (and the more advanced natural language processing) move fast!
quanteda
and its pals seems to be the front runner in R right now.
quanteda
quanteda.textmodels
quanteda.textstats
quanteda.textplots
quanteda.sentiment
quanteda.tidy
Text mining often follows a familiar flow.
What do you want to compare?
What holds meaning?
The choice is important, but always specific to your project!
Once you have decided what you want to compare, you need to get your data into that format.
Your documents should contain only the meaningful content (tokens) you want to analyse.
As an example, I’ll use our class survey data, specifically what all of you said about your favorite art. Every response will be a document.
Favorite Art |
---|
I really enjoy all the work done by TeamLab. Their TeamLab Planets exhibit in Tokyo was one of the coolest things I have ever seen. It is an entire complex of large, interactive art that you walk through. One room was filled with ankle deep water that you waded through, while holographic fish swam by. If you touched any of the fish, they would explode into flowers. I also appreciate it from a technical perspective, as all the art is coded in Python. |
NA |
To preface, this is kind of an impossible question, because I can’t pick one favorite work of art. However, my favorite form of media is probably books, so I’ll just describe one that I like a lot. I really like the novel Ender’s Game, by Orson Scott Card. I first read it when I was about eleven or twelve (I have since read it at least twice and seen a movie adaptation), and I definitely associate it with the nostalgia of childhood. But that’s certainly not the main reasonI’ve always enjoyed science fiction and literally any novel containing a well-developed plot and a skillful plot twist, as well as social satire and allegories. This work embodies all of those criteria, incorporating science, a structurally sound and intriguing narrative, and social critique. And on top of that, it accomplishes all this using a well-built world populated with well-developed characters, including the protagonist Ender, who is profoundly relatable and likable. It’s the story of a brilliant child caught between a rock and a hard place, a story about war and morality, a story about space and other worlds and extinction and compassion across species. I love it. |
The song “All Night” by Beyonce is without a doubt my favorite song of all time. In my opinion, Beyonce is the best performer or all time, living or dead. Her choreography and overall stage presence is unmatched. Furthermore, she is one of the best vocalists of our time. Specifically, in “All Night” she gives us a calming song that still has a strong beat that one could bop to. The song has good vibes and versatility. It can be played at any type of event. Additionally, the background behind the lyrics is complex and further attests to Beyonce’s incredible lyricism. Finally, “All Night” lives in Beyonce’s most iconic album, Lemonade, which she made in response to her husband’s infidelity. “All Night” perfectly captures how Beyonce can turn lemons into lemonade. |
My favorite work of art would be photography, because I enjoy looking at everything visually. I feel that I forget moments quickly so being able to see an image lets me relive some memories and notice something new. I also enjoy the process of taking photos but probably when they’re less structured. A quick photo or even a video would do well for me. |
Rap and R&B music |
As of right now, my favorite work of art is the Percy Jackson series by Rick Riordan. Written for children, it manages to remain humorous to my 20-year-old self. It manages to be both relevant and rich in symbolism pertaining to ancient Greek mythology. It was also my favorite series as a child, so it has that nostalgia value. |
My favorite artwork is an artist book by Barbra Kruger named “Thinking of Me, I Mean You, I Mean Me”. This work challenges reality and highlights the wicked ways in which capitalism manipulates us. She does so through photography, intense graphics, and brief but thought-provoking written statements. |
Shepard Fairey artworks is my favorite due to the meaning behind the pieces and The Nighthawks Painting which captures my attention with the mood conveyed, style, and colors. |
I don’t know if this would count as a single work of art, but one exhibit I’ve seen recently that I really like was Beatrice Glow’s exhibit at the Baltimore Museum of Art called “Once the Smoke Clears”. I liked it a lot because of how it included many different mediums of art, including some that I’ve rarely/never seen in museums, including 3D printed objects and ‘scent experiences’. I also really loved how in it she examined the relationship between colonialism and racism and the tobacco industry throughout history, which wasn’t something that I had really thought that much about before. |
My favorite work of art is called Water Lilies by Claude Monet because since my mom really likes Monet, she took us to the MET and I saw a few of his artworks in person. So I instantly loved his work, while looking on the internet I came across Water Lilies and noticed that many of his paintings contained scenes with water in it. I just so happened to pick Water Lilies however because it combines to of my favorite things water and flowers. |
my favorite work of art is Claude Monet’s garden paintings, such as the water lilies. I like it because i find it very appealing and beautiful. Theres are different versions of the same garden, painted in different seasons, and weather, which is shown through the art. |
NA |
NA |
The marble sculpture, “Undine Rising from the Fountain”, created by the American artist Chauncey Ives in 1880 depicts the nymph Undine morphing from water form into human form. Ives utilizes technical skill to transform cold marble into flowing, soft fabric that drapes the feminine beauty of his subject. I like it because the stark difference between the media and the subject. It is similar to Raffaelle Monti’s sculptures, which are also a favorite of mine, however, I like that you can see Undine’s expression compared to Monti’s veiled faces. |
Now that we have our documents prepared, we can make our corpus
A corpus stores our documents, like a bookshelf holds books.
Corpus consisting of 15 documents, showing 15 documents:
Text Types Tokens Sentences
Spike Spiegal 66 93 6
Doreamon 0 0 0
Sherlock Holmes 135 221 8
Tiana 91 157 10
Crush 59 69 4
Thor 6 6 1
Rhys (from A Court of Thorns and Roses) 50 66 4
Buffy 46 56 3
Sasha Braus 26 31 1
Catra 80 114 3
Pikachu 61 89 3
My Melody 39 53 3
Claire Fraser 0 0 0
Shinchan 0 0 0
Kakashi 72 101 4
The goal is to turn our tokens into the cleanest representations of meaning that we can.
Options include:
# Get all of the tokens from the documents in our corpus
survey_tokens = tokens(survey_corpus,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE)
# set everything to lower case
survey_tokens = tokens_tolower(survey_tokens)
# remove all the connective tissue
survey_tokens = tokens_remove(survey_tokens, stopwords("en"))
# convert tokens to simple worms
survey_tokens = tokens_wordstem(survey_tokens)
Tokens consisting of 15 documents.
Spike Spiegal :
[1] "realli" "enjoy" "work" "done" "teamlab" "teamlab" "planet"
[8] "exhibit" "tokyo" "one" "coolest" "thing"
[ ... and 29 more ]
Doreamon :
character(0)
Sherlock Holmes :
[1] "prefac" "kind" "imposs" "question" "pick" "one"
[7] "favorit" "work" "art" "howev" "favorit" "form"
[ ... and 93 more ]
Tiana :
[1] "song" "night" "beyonc" "without" "doubt" "favorit" "song"
[8] "time" "opinion" "beyonc" "best" "perform"
[ ... and 60 more ]
Crush :
[1] "favorit" "work" "art" "photographi" "enjoy"
[6] "look" "everyth" "visual" "feel" "forget"
[11] "moment" "quick"
[ ... and 22 more ]
Thor :
[1] "rap" "r" "b" "music"
[ reached max_ndoc ... 9 more documents ]
SDS 192-03: Intro to Data Science