Sentiment Analysis in R
Ted Kwartler - DataCamp
Course Description
Add sentiment analysis to your text mining toolkit! Sentiment analysis is used by text miners in marketing, politics, customer service and elsewhere. In this course you will learn to identify positive and negative language, specific emotional intent, and make compelling visualizations. You will end the course by applying your sentiment analysis skills to Airbnb reviews to learn what makes for a good rental. For any violation, please leave a message via dattran.hcmiu@gmail.com.
1 Fast & dirty: Polarity scoring
In the first chapter, you will learn how to apply qdap's sentiment function called polarity() .
1.1 Let's talk about our feelings
1.1.1 Jump right in! Visualize polarity
Sentiment analysis helps you extract an author's feelings towards a subject. This exercise will give you a taste of what's to come!
We created text_df
representing a conversation with person
and text
columns.
Use qdap
's polarity()
function to score text_df
. polarity()
will accept a single character object or data frame with a grouping variable to calculate a positive or negative score.
In this example you will use the magrittr
package's dollar pipe operator %$%
. The dollar sign forwards the data frame into polarity()
and you declare a text column name or the text column and a grouping variable without quotes.
text_data_frame %$% polarity(text_column_name)
To create an object with the dollar sign operator:
polarity_object <- text_data_frame %$%
polarity(text_column_name, grouping_column_name)
More specifically, to make a quantitative judgement about the sentiment
of some text, you need to give it a score. A simple method is a positive
or negative value related to a sentence, passage or a collection of
documents called a corpus. Scoring with positive or negative values only
is called "polarity." A useful function for extracting polarity scores
is counts()
applied to the polarity object. For a quick visual call plot()
on the polarity()
outcome.
text_df
conversation data frame.
# Examine the text data
text_df=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSr1GbdxxFhoZcAqH_pkr-E61NMiKnffJdAPlbfLv5FrfJkTgOeDq8KCv1-WolHMf0N0K-5nUcMH3Ta/pub?gid=6240657&single=true&output=csv")
text_df
## person text
## 1 Nick DataCamp courses are the best
## 2 Jonathan I like talking to students
## 3 Martijn Other online data science curricula are boring.
## 4 Nicole What is for lunch?
## 5 Nick DataCamp has lots of great content!
## 6 Jonathan Students are passionate and are excited to learn
## 7 Martijn Other data science curriculum is hard to learn and to understand
## 8 Nicole I think the food here is good.
%$%
pass text_df
to polarity()
along with the column name text
without quotes. This will print the polarity for all text.
library(magrittr)
library(qdap)
# Calc overall polarity score
text_df %$% polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 8 53 0.214 0.393 0.544
datacamp_conversation
by forwarding text_df
with %$%
to polarity()
. Pass in text
followed by the grouping person
column. This will calculate polarity according to each individual
person. Since it is all within parentheses the result will be printed
too.
# Calc polarity score by person
(datacamp_conversation <- text_df %$% polarity(text, person))
## person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan 2 13 0.577 0.184 3.141
## 2 Martijn 2 18 -0.340 0.054 -6.284
## 3 Nick 2 11 0.428 0.028 15.524
## 4 Nicole 2 11 0.189 0.267 0.707
counts()
to datacamp_conversation
to print the specific emotional words that were found.
# Counts table from datacamp_conversation
counts(datacamp_conversation)
## person wc polarity pos.words neg.words text.var
## 1 Nick 5 0.447 best - DataCamp courses are the best
## 2 Jonathan 5 0.447 like - I like talking to students
## 3 Martijn 7 -0.378 - boring Other online data science curricula are boring.
## 4 Nicole 4 0.000 - - What is for lunch?
## 5 Nick 6 0.408 great - DataCamp has lots of great content!
## 6 Jonathan 8 0.707 passionate, excited - Students are passionate and are excited to learn
## 7 Martijn 11 -0.302 - hard Other data science curriculum is hard to learn and to understand
## 8 Nicole 7 0.378 good - I think the food here is good.
plot()
the datacamp_conversation
.
# Plot the conversation polarity
plot(datacamp_conversation)
Excellent, that was easy!
1.1.2 TM refresher (I)
In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text. To recap, one way to create & clean a corpus is with the functions below. Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful.
-
Turn a character vector into a text source using
VectorSource()
. -
Turn a text source into a corpus using
VCorpus()
. -
Remove unwanted characters from the corpus using cleaning functions like
removePunctuation()
andstripWhitespace()
fromtm
, andreplace_abbreviation()
fromqdap
.
In this exercise a custom clean_corpus()
function has been created using standard preprocessing functions for easier application.
clean_corpus()
accepts the output of VCorpus()
and applies cleaning functions. For example:
processed_corpus <- clean_corpus(my_corpus)
Your R session has a text vector, tm_define
, containing two small documents and the function clean_corpus()
.
tm_vector
by applying VectorSource()
to tm_define
.
# clean_corpus(), tm_define are pre-defined
clean_corpus=function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
tm_define=c("Text mining is the process of distilling actionable insights from text.","Sentiment analysis represents the set of tools to extract an author's feelings towards a subject.")
library(tm)
# Create a VectorSource
tm_vector <- VectorSource(tm_define)
tm_corpus
using VCorpus()
on tm_vector
.
# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)
content()
to examine the contents of the first document in tm_corpus
.
[[1]]
.
# Examine the first document's contents
content(tm_corpus[[1]])
## [1] "Text mining is the process of distilling actionable insights from text."
clean_corpus()
on tm_corpus
. Call this new object tm_clean
.
# Clean the text
tm_clean <- clean_corpus(tm_corpus)
tm_clean
object again to see how the text changed after clean_corpus()
was applied.
# Reexamine the contents of the first doc
content(tm_clean[[1]])
## [1] "text mining process distilling actionable insights text"
Feeling fresh! If you work with text, it's useful to know how to manipulate corpora.
1.1.3 TM refresher (II)
Now let's create a Document Term Matrix (DTM). In a DTM:
- Each row of the matrix represents a document.
- Each column is a unique word token.
- Values of the matrix correspond to an individual document's word usage.
The DTM is the basis for many bag of words analyses. Later in the course, you will also use the related Term Document Matrix (TDM). This is the transpose; that is, columns represent documents and rows represent unique word tokens.
You should construct a DTM after cleaning the corpus (using clean_corpus()
). To do so, call DocumentTermMatrix()
on the corpus object.
tm_dtm <- DocumentTermMatrix(tm_clean)
If you need a more in-depth refresher check out the Text Mining with Bag-of-Words in R course. Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
We've created a VCorpus()
object called clean_text
containing 1000 tweets mentioning coffee. The tweets have been cleaned
with the previously mentioned preprocessing steps and your goal is to
create a DTM from it.
DocumentTermMatrix()
to the clean_text
corpus to create a term frequency weighted DTM called tf_dtm
.
tweets <- read.csv("https://raw.githubusercontent.com/ThanhDatIU/datacamp/main/coffee.csv", stringsAsFactors = FALSE)
coffee_tweets <- tweets$text
coffee_source <- VectorSource(coffee_tweets)
coffee_corpus <- VCorpus(coffee_source)
clean_text=clean_corpus(coffee_corpus)
# clean_text is pre-defined
clean_text
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1000
# Create tf_dtm
tf_dtm <- DocumentTermMatrix(clean_text)
DocumentTermMatrix()
object into a simple matrix with as.matrix()
. Call the new object tf_dtm_m
.
# Create tf_dtm_m
tf_dtm_m <- as.matrix(tf_dtm)
dim()
.
# Dimensions of DTM matrix
dim(tf_dtm_m)
## [1] 1000 3098
# Subset part of tf_dtm_m for comparison
tf_dtm_m[16:20, 2975:2984]
## Terms
## Docs went were west westin westside wet wfriends what whatever whatislifeee
## 16 0 0 0 0 0 0 0 0 0 0
## 17 0 0 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0 0
Delightful use of a DocumentTermMatrix! These things crop up regularly in text mining.
1.2 Zipf's law & subjectivity lexicon
1.2.1 What is a subjectivity lexicon?
As discussed in the video a lexicon is a list of words. What is the purpose of a subjectivity lexicon?
-
A subjectivity lexicon lets you extract meaningful insights from text.
-
A subjectivity lexicon is a predefined list of words associated with emotional context such as positive/negative.
-
A subjectivity lexicon lets you subjectively argue your point!
1.2.2 Where can you observe Zipf's law?
Although Zipf observed a steep and predictable decline in word usage you may not buy into Zipf's law. You may be thinking "I know plenty of words, and have a distinctive vocabulary". That may be the case, but the same can't be said for most people! To prove it, let's construct a visual from 3 million tweets mentioning "#sb". Keep in mind that the visual doesn't follow Zipf's law perfectly, the tweets all mentioned the same hashtag so it is a bit skewed. That said, the visual you will make follows a steep decline showing a small lexical diversity among the millions of tweets. So there is some science behind using lexicons for natural language analysis!
In this exercise, you will use the package metricsgraphics
. Although the author suggests using the pipe %>%
operator, you will construct the graphic step-by-step to learn about
the various aspects of the plot. The main function of the package metricsgraphics
is the mjs_plot()
function which is the first step in creating a JavaScript plot. Once
you have that, you can add other layers on top of the plot.
An example metricsgraphics
workflow without using the %>%
operator is below:
metro_plot <- mjs_plot(data, x = x_axis_name, y = y_axis_name, show_rollover_text = FALSE)
metro_plot <- mjs_line(metro_plot)
metro_plot <- mjs_add_line(metro_plot, line_one_values)
metro_plot <- mjs_add_legend(metro_plot, legend = c('names', 'more_names'))
metro_plot
head()
on sb_words
to review top words.
sb_words=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSr1GbdxxFhoZcAqH_pkr-E61NMiKnffJdAPlbfLv5FrfJkTgOeDq8KCv1-WolHMf0N0K-5nUcMH3Ta/pub?gid=842100586&single=true&output=csv")
# Examine sb_words
head(sb_words)
## word freq rank
## 1 sb 1984423 1
## 2 rt 1700564 2
## 3 the 1101899 3
## 4 to 588803 4
## 5 a 428598 5
## 6 for 388390 6
expectations
by dividing the largest word frequency, freq[1]
, by the rank
column.
# Create expectations
sb_words$expectations <- sb_words %$%
{freq[1] / rank}
sb_plot
using mjs_plot()
.
-
Pass in
sb_words
withx = rank
andy = freq
. -
Within
mjs_plot()
setshow_rollover_text
toFALSE
.
library(metricsgraphics)
# Create metrics plot
sb_plot <- mjs_plot(sb_words, x = rank, y = freq, show_rollover_text = FALSE)
sb_plot
using mjs_line()
and pass in sb_plot
.
# Add 1st line
sb_plot <- mjs_line(sb_plot)
sb_plot
with mjs_add_line()
.-
Pass in the previous
sb_plot
object and the vector,expectations
.
# Add 2nd line
sb_plot <- mjs_add_line(sb_plot, expectations)
sb_plot
object using mjs_add_legend()
.-
Pass in the previous
sb_plot
object -
The legend labels should consist of
"Frequency"
and"Expectation"
.
# Add legend
sb_plot <- mjs_add_legend(sb_plot, legend = c("Frequency", "Expectation"))
sb_plot
to display the plot. Mouseover a point to simultaneously highlight a freq
and Expectation
point. The magic of JavaScript!
# Display plot
sb_plot
Great job! While you may not obey Zipf's Law, it seems like most people on Twitter do!
1.2.3 Polarity on actual text
So far you have learned the basic components needed for assessing positive or negative intent in text. Remember the following points so you can feel confident in your results.
- The subjectivity lexicon is a predefined list of words associated with emotions or positive/negative feelings.
- You don't have to list every word in a subjectivity lexicon because Zipf's law describes human expression.
A quick way to get started is to use the polarity()
function which has a built-in subjectivity lexicon.
The function scans the text to identify words in the lexicon. It then creates a cluster around an identified subjectivity word. Within the cluster valence shifters adjust the score. Valence shifters are words that amplify or negate the emotional intent of the subjectivity word. For example, "well known" is positive while "not well known" is negative. Here "not" is a negating term and reverses the emotional intent of "well known." In contrast, "very well known" employs an amplifier increasing the positive intent.
The polarity()
function then calculates a score using
subjectivity terms, valence shifters and the total number of words in
the passage. This exercise demonstrates a simple polarity calculation.
In the next video we look under the hood of polarity()
for more detail.
Calculate the polarity()
of positive
in a new object called pos_score
. Encase the entire call in parentheses so the output is also printed.
# Example statements
positive <- "DataCamp courses are good for learning"
# Calculate polarity of both statements
(pos_score <- polarity(positive))
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 6 0.408 NA NA
Manually perform the same polarity calculation.
counts()
on the polarity object.
# From previous step
positive <- "DataCamp courses are good for learning"
pos_score <- polarity(positive)
# Get counts
(pos_counts <- counts(pos_score))
## all wc polarity pos.words neg.words text.var
## 1 all 6 0.408 good - DataCamp courses are good for learning
$pos.words
element vector. Find the number of positive words in n_good
by calling length()
on the first part of the $pos.words
element.
# Number of positive words
n_good <- length(pos_counts$pos.words[[1]])
n_words
. This value is stored in pos_count
as the wc
element.
# Total number of words
n_words <- pos_counts$wc
polarity()
calculation by dividing n_good
by sqrt()
of n_words
. Compare the result to pos_pol
to the equation's result.
# Verify polarity score
n_good / sqrt(n_words)
## [1] 0.4082483
Well done! Using the polarity()
function is much easier, and still gets the same answer!
1.3 qdap's polarity & lexicon
1.3.1 Happy songs!
Of course just positive and negative words aren't enough. In this
exercise you will learn about valence shifters which tell you about the
author's emotional intent. Previously you applied polarity()
to text without valence shifters. In this example you will see amplification and negation words in action.
Recall that an amplifying word adds 0.8 to a positive word in polarity()
so the positive score becomes 1.8. For negative words 0.8 is subtracted
so the total becomes -1.8. Then the score is divided by the square root
of the total number of words.
Consider the following example from Frank Sinatra:
- "It was a very good year"
"Good" equals 1 and "very" adds another 0.8. So, 1.8/sqrt(6) results in 0.73 polarity.
A negating word such as "not" will inverse the subjectivity score. Consider the following example from Bobby McFerrin:
- "Don't worry Be Happy"
"worry is now 1 due to the negation "don't." Adding the "happy", +1, equals 2. With 4 total words, 2 / sqrt(4)
equals a polarity score of 1.
conversation
. Note the valence shifters like "never" in the text column.
library(tidyverse)
conversation=tribble(~student,~text,
"Martijn","This restaurant is never bad",
"Nick","The lunch was very good",
"Nicole","It was awful I got food poisoning and was extremely ill")
# Examine conversation
conversation
## # A tibble: 3 × 2
## student text
## <chr> <chr>
## 1 Martijn This restaurant is never bad
## 2 Nick The lunch was very good
## 3 Nicole It was awful I got food poisoning and was extremely ill
polarity()
to the text
column of conversation
to calculate polarity for the entire conversation.
# Polarity - All
polarity(conversation$text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 3 21 0.317 0.565 0.561
student_pol
.
-
Call
polarity()
again, this time passing two columns ofconversation
. -
The text variable is
text
and the grouping variable isstudent
.
# Polarity - Grouped
student_pol <- conversation %$%
polarity(text, student)
scores()
on student_pol
.
# Student results
scores(student_pol)
## student total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Martijn 1 5 0.447 NA NA
## 2 Nick 1 5 0.805 NA NA
## 3 Nicole 1 11 -0.302 NA NA
counts()
function applied to student_pol
will print the sentence level polarity for the entire data frame along with lexicon words identified.
# Sentence by sentence
counts(student_pol)
## student wc polarity pos.words neg.words text.var
## 1 Martijn 5 0.447 - bad This restaurant is never bad
## 2 Nick 5 0.805 good - The lunch was very good
## 3 Nicole 11 -0.302 - awful It was awful I got food poisoning and was extremely ill
student_pol
, can be plotted with plot()
.
# qdap plot
plot(student_pol)
It was a very good piece of code you just wrote! 'Extremely good' is more positive than 'very good', which is more positive than 'good', which is more positive than 'quite good'.
1.3.2 LOL, this song is wicked good
Even with Zipf's law in action, you will still need to adjust lexicons
to fit the text source (for example twitter versus legal documents) or
the author's demographics (teenager versus the elderly). This exercise
demonstrates the explicit components of polarity()
so you can change it if needed.
In Trey Songz "Lol :)" song there is a lyric "LOL smiley face, LOL smiley face." In the basic polarity()
function, "LOL" is not defined as positive. However, "LOL" stands for
"Laugh Out Loud" and should be positive. As a result, you should adjust
the lexicon to fit the text's context which includes pop-culture slang.
If your analysis contains text from a specific channel (Twitter's
"LOL"), location (Boston's "Wicked Good"), or age group (teenagers'
"sick") you will likely have to adjust the lexicon.
In this exercise you are not adjusting the subjectivity lexicon or qdap
dictionaries containing valence shifters. Instead you are examining the
existing word data frame objects so you can change them in the
following exercise.
We've created text
containing two excerpts from Beyoncé's "Crazy in Love" lyrics for the exercise.
key.pol
to see a portion of the subjectivity words and values.
# Examine the key.pol
key.pol
## x y
## 1: a plus 1
## 2: abnormal -1
## 3: abolish -1
## 4: abominable -1
## 5: abominably -1
## ---
## 6775: zealously -1
## 6776: zenith 1
## 6777: zest 1
## 6778: zippy 1
## 6779: zombie -1
negation.words
to print all the negating terms.
# Negators
negation.words
## [1] "ain't" "aren't" "can't" "couldn't" "didn't" "doesn't"
## [7] "don't" "hasn't" "isn't" "mightn't" "mustn't" "neither"
## [13] "never" "no" "nobody" "nor" "not" "shan't"
## [19] "shouldn't" "wasn't" "weren't" "won't" "wouldn't"
amplification.words
to see the words that add values to the lexicon.
# Amplifiers
amplification.words
## [1] "acute" "acutely" "certain" "certainly"
## [5] "colossal" "colossally" "deep" "deeply"
## [9] "definite" "definitely" "enormous" "enormously"
## [13] "extreme" "extremely" "great" "greatly"
## [17] "heavily" "heavy" "high" "highly"
## [21] "huge" "hugely" "immense" "immensely"
## [25] "incalculable" "incalculably" "massive" "massively"
## [29] "more" "particular" "particularly" "purpose"
## [33] "purposely" "quite" "real" "really"
## [37] "serious" "seriously" "severe" "severely"
## [41] "significant" "significantly" "sure" "surely"
## [45] "true" "truly" "vast" "vastly"
## [49] "very"
deamplification.words
to print the words that reduce the lexicon values.
# De-amplifiers
deamplification.words
## [1] "barely" "faintly" "few" "hardly" "little"
## [6] "only" "rarely" "seldom" "slightly" "sparsely"
## [11] "sporadically" "very few" "very little"
text
to see conversation.
# Examine
text=tribble(~speaker,~words,"beyonce","I know I dont understand Just how your love can do what no one else can","jay_z","They cant figure him out they like hey, is he insane")
text
## # A tibble: 2 × 2
## speaker words
## <chr> <chr>
## 1 beyonce I know I dont understand Just how your love can do what no one else c…
## 2 jay_z They cant figure him out they like hey, is he insane
-
Calculate
polarity()
as follows.-
Set
text.var
totext$words
. -
Set
grouping.var
totext$speaker
. -
Set
polarity.frame
tokey.pol
. -
Set
negators
tonegation.words
. -
Set
amplifiers
toamplification.words
. -
Set
deamplifiers
todeamplification.words
.
-
Set
# Complete the polarity parameters
polarity(
text.var = text$words,
grouping.var = text$speaker,
polarity.frame = key.pol,
negators = negation.words,
amplifiers = amplification.words,
deamplifiers = deamplification.words
)
## speaker total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 beyonce 1 16 0.25 NA NA
## 2 jay_z 1 11 0.00 NA NA
Powerful polarizing! The polarity()
function is very flexible and allows you to override score given to each word.
1.3.3 Stressed Out!
Here you will adjust the negative words to account for the specific text. You will then compare the basic and custom polarity()
scores.
A popular song from Twenty One Pilots is called "Stressed Out". If you scan the song lyrics, you will observe the song is about youthful nostalgia. Overall, most people would say the polarity is negative. Repeatedly the lyrics mention stress, fears and pretending.
Let's compare the song lyrics using the default subjectivity lexicon and also a custom one.
To start, you need to verify the key.pol
subjectivity lexicon does not already have the term you want to add. One way to check is with grep()
. The grep()
function returns the row containing characters that match a search pattern. Here is an example used while indexing.
data_frame[grep("search_pattern", data_frame$column), ]
After verifying the slang or new word is not already in the key.pol
lexicon you need to add it. The code below uses sentiment_frame()
to construct the new lexicon. Within the code sentiment_frame()
accepts the original positive word vector, positive.words
. Next, the original negative.words
are concatenated to "smh" and "kappa", both considered negative slang.
Although you can declare the positive and negative weights, the default
is 1 and -1 so they are not included below.
custom_pol <- sentiment_frame(positive.words, c(negative.words, "hate", "pain"))
Now you are ready to apply polarity and it will reference the custom subjectivity lexicon!
We've created stressed_out
which contains the lyrics to the song "Stressed Out", by Twenty One Pilots.
polarity()
on stressed_out
to see the default score.
stressed_out="I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# stressed_out has been pre-defined
head(stressed_out)
## [1] "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# Basic lexicon score
polarity(stressed_out)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.255 NA NA
key.pol
for any words containing "stress". Use grep()
to index the data frame by searching in the x
column.
# Check the subjectivity lexicon
key.pol[grep("stress", x)]
## x y
## 1: distress -1
## 2: distressed -1
## 3: distressing -1
## 4: distressingly -1
## 5: mistress -1
## 6: stress -1
## 7: stresses -1
## 8: stressful -1
## 9: stressfully -1
custom_pol
as a new sentiment data frame.sentiment_frame()
and pass positive.words
as the first argument without concatenating any new terms.
c()
to combine negative.words
with new terms "stressed" and "turn back".
# New lexicon
custom_pol <- sentiment_frame(positive.words, c(negative.words, "stressed", "turn back"))
polarity()
to stressed_out
with the additional parameter polarity.frame = custom_pol
to compare how the new words change the score to a more accurate representation of the song.
# Compare new score
polarity(stressed_out, polarity.frame = custom_pol)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.826 NA NA
Great work! It's important to take the specific features of the text you're analyzing into account so that you can make sure your results are accurate.
2 Sentiment analysis the tidytext way
In the second chapter you will explore 3 subjectivity lexicons from tidytext. Then you will do an inner join to score some text.
2.1 Wheel of emotion
2.1.1 One theory of emotion
What is the philosophical basis for the Plutchik's wheel of emotion?
-
Plutchik wanted a round framework so made it like a wheel.
-
Plutchik was an angry person and wanted to explain his actions to others.
-
Plutchik believed the primary emotions were formed as survival mechanisms in humans and animals.
-
Plutchik performed extensive field tests with sloths in the field because they are slow to react.
2.1.2 DTM vs. tidytext matrix
The tidyverse is a collection of R packages that share common philosophies and are designed to work together. This chapter covers some tidy functions to manipulate data. In this exercise you will compare a DTM to a tidy text data frame called a tibble.
Within the tidyverse, each observation is a single row in a data frame.
That makes working in different packages much easier since the
fundamental data structure is the same. Parts of this course borrow
heavily from the tidytext
package which uses this data organization.
For example, you may already be familiar with the %>%
operator from the magrittr
package. This forwards an object on its left-hand side as the first argument of the function on its right-hand side.
In the example below, you are forwarding the data
object to function1()
. Notice how the parentheses are empty. This in turn is forwarded to function2()
. In the last function you don't have to add the data
object because it was forwarded from the output of function1()
. However, you do add a fictitious parameter, some_parameter
as TRUE
. These pipe forwards ultimately create the object
.
object <- data %>%
function1() %>%
function2(some_parameter = TRUE)
To use the %>%
operator, you don't necessarily need to load the magrittr
package, since it is also available in the dplyr
package. dplyr
also contains the functions inner_join()
(which you'll learn more about later) and count()
for tallying data. The last function you'll need is mutate()
to create new variables or modify existing ones.
object <- data %>%
mutate(new_Var_name = Var1 - Var2)
or to modify a variable
object <- data %>%
mutate(Var1 = as.factor(Var1))
You will also use tidyr
's spread()
function to organize the data with each row being a line from the book and the positive and negative values as columns.
index | negative | positive |
---|---|---|
42 | 2 | 0 |
43 | 0 | 1 |
44 | 1 | 0 |
To change a DTM to a tidy format use tidy()
from the broom
package.
tidy_format <- tidy(Document_Term_Matrix)
This exercise uses text from the Greek tragedy, Agamemnon. Agamemnon is a story about marital infidelity and murder. You can download a copy here.
We've already created a clean DTM called ag_dtm
for this exercise.
ag_dtm_m
by applying as.matrix()
to ag_dtm
.
library(tidytext)
file="https://raw.githubusercontent.com/ThanhDatIU/datacamp/main/pg14417.txt"
ag=read_lines(file, skip = 0, n_max = -1L)
ag_source <- VectorSource(ag)
ag_corpus <- VCorpus(ag_source)
ag_text=clean_corpus(ag_corpus)
# Create tf_dtm
ag_dtm <- DocumentTermMatrix(ag_text)
# As matrix
ag_dtm_m <- as.matrix(ag_dtm)
[
and ]
, index ag_dtm_m
to row 2206
.
# Examine line 2206 and columns 245:250
ag_dtm_m[2206, 245:250]
## birds birdthroated birth bite bitter black
## 0 0 0 0 0 0
tidy()
to ag_dtm
. Call the new object ag_tidy
.
library(tidytext)
# Tidy up the DTM
ag_tidy <- tidy(ag_dtm)
ag_tidy
at rows [831:835, ]
to compare the tidy format. You will see a common word from the examined part of ag_dtm_m
in step 2.
# Examine tidy with a word you saw
ag_tidy[831:835, ]
## # A tibble: 5 × 3
## document term count
## <chr> <chr> <dbl>
## 1 207 whateer 1
## 2 207 zeus 2
## 3 208 hear 1
## 4 208 love 1
## 5 208 name 1
Aces! See the difference?
2.1.3 Getting Sentiment Lexicons
So far you have used a single lexicon. Now we will transition to using three, each measuring sentiment in different ways.
The tidytext
package contains a function called get_sentiments
which along with the [textdata
] package allows you to download & interact well researched lexicons. Here is a small section of the loughran
lexicon.
Word | Sentiment |
---|---|
abandoned | negative |
abandoning | negative |
abandonment | negative |
abandonments | negative |
abandons | negative |
This lexicon contains 4150 terms with corresponding information. We will be exploring other lexicons but the structure & method to get them is similar.
Let's use tidytext
with textdata
to explore other lexicons' word labels!
get_sentiments()
to obtain the "afinn"
lexicon, assigning to afinn_lex
.
library(textdata)
afinn_lex=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS4tUUP2pJ1A6crxDKSY6Be7Rp2QlZjase7ubLcPUXcnwE7xHKkXGuV3V8WpxsJFQpOEDuFpxb2qfbh/pub?gid=1069070907&single=true&output=csv")
count()
of value
in afinn_lex
.
# Count AFINN scores
afinn_lex %>%
count(value)
## value n
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 966
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
-
Do the same again, this time with the
"nrc"
lexicon. That is,-
get the sentiments, assigning to
nrc_lex
, then -
count the
sentiment
column, assigning tonrc_counts
.
-
get the sentiments, assigning to
library(syuzhet)
library(tidytext)
# Subset to nrc
nrc_lex <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRci0CCoGxbcZrBd_PR5kcgSt9jqYICsgHxBj40hbUuSxr4XUKPmvV1EssEy9EnvC5eD9LH12w2wnWI/pub?gid=1381586899&single=true&output=csv")
# Make the nrc counts object
nrc_counts <- nrc_lex %>%
count(sentiment)
-
Create a ggplot labeling the y-axis as
n
vs. x-axis ofsentiment
. -
Add a
col
layer usinggeom_col()
. (This is likegeom_bar()
, but used when you've already summarized withcount()
.)
# From previous step
nrc_counts <- nrc_lex %>%
count(sentiment)
library(ggthemes)
# Plot n vs. sentiment
ggplot(nrc_counts, aes(x = sentiment, y = n)) +
# Add a col layer
geom_col() +
theme_gdocs()
Lovely lexicon exploration! Negative words are the most common type in the NRC lexicon.
2.2 Bing lexicon
2.2.1 Bing tidy polarity: Simple example
Now that you understand the basics of an inner join, let's apply this to the "Bing" lexicon. Keep in mind the inner_join()
function comes from dplyr
and the lexicon object is obtained using tidytext
's get_sentiments()
function'.
The Bing lexicon labels words as positive or negative. The next three
exercises let you interact with this specific lexicon. To use get_sentiments()
pass in a string such as "afinn", "bing", "nrc", or "loughran" to download the specific lexicon.
The inner join workflow:
-
Obtain the correct lexicon using
get_sentiments()
. -
Pass the lexicon and the tidy text data to
inner_join()
. -
In order for
inner_join()
to work there must be a shared column name. If there are no shared column names, declare them with an additional parameter,by
equal toc
with column names like below.
object <- x %>%
inner_join(y, by = c("column_from_x" = "column_from_y"))
- Perform some aggregation and analysis on the table intersection.
We've loaded ag_txt
containing the first 100 lines from Agamemnon and ag_tidy
which is the tidy version.
polarity()
on ag_txt
.
ag_txt=ag
# Qdap polarity
polarity(ag)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 2683 15145 -0.054 0.315 -0.171
"bing"
lexicon by passing that string to get_sentiments()
.
# Get Bing lexicon
bing <- get_sentiments("bing")
inner_join()
with ag_tidy
and bing
.-
The word columns are called
"term"
inag_tidy
&"word"
in the lexicon, so declare theby
argument. -
Call the new object
ag_bing_words
.
# Join text to lexicon
ag_bing_words <- inner_join(ag_tidy, bing, by = c("term" = "word"))
ag_bing_words
, and look at some of the words that are in the result.
# Examine
ag_bing_words
## # A tibble: 1,425 × 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 7 waste 1 negative
## 2 8 respite 1 positive
## 3 10 well 1 positive
## 4 11 lonely 1 negative
## 5 13 great 1 positive
## 6 13 heavenly 1 positive
## 7 19 dark 1 negative
## 8 20 fear 1 negative
## 9 21 warning 1 negative
## 10 22 well 1 positive
## # … with 1,415 more rows
ag_bing_words
to count()
of sentiment
using the pipe operator, %>%. Compare the polarity()
score to sentiment count ratio.
# Get counts by sentiment
ag_bing_words %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 904
## 2 positive 521
Great work! Did you notice the sentiment count()
ratio? It's 321:162.
2.2.2 Bing tidy polarity: Count & spread the white whale
In this exercise you will apply another inner_join()
using the "bing"
lexicon.
Then you will manipulate the results with both count()
from dplyr
and spread()
from tidyr
to learn about the text.
The spread()
function spreads a key-value pair across
multiple columns. In this case the key is the sentiment & the values
are the frequency of positive or negative terms for each line. Using spread()
changes the data so that each row now has positive and negative values, even if it is 0.
In this exercise, your R session has m_dick_tidy
which contains the book Moby Dick and bing
, containing the lexicon similar to the previous exercise.
inner_join()
on m_dick_tidy
and bing
.-
As before, join the
"term"
column inm_dick_tidy
to the"word"
column in the lexicon. -
Call the new object
moby_lex_words
.
file="https://raw.githubusercontent.com/ThanhDatIU/datacamp/main/moby10b.txt"
m_dick_tidy=read_lines(file, skip = 0)
dick_tidy_source <- VectorSource(m_dick_tidy)
dick_corpus <- VCorpus(dick_tidy_source)
dick_text=clean_corpus(dick_corpus)
# Create tf_dtm
dick_dtm <- DocumentTermMatrix(dick_text)
# As matrix
dick_dtm_m <- as.matrix(dick_dtm)
m_dick_tidy=tidy(dick_dtm)
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
moby_count
## # A tibble: 10,531 × 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9 1
## 2 negative 10 1
## 3 negative 22 1
## 4 negative 42 1
## 5 negative 43 2
## 6 negative 45 1
## 7 negative 58 1
## 8 negative 65 1
## 9 negative 67 1
## 10 negative 70 1
## # … with 10,521 more rows
moby_spread <- moby_count %>%
# Spread sentiments
spread(sentiment, n, fill = 0)
# Review the spread data
moby_spread
## # A tibble: 9,246 × 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9 1 0
## 2 10 1 0
## 3 13 0 1
## 4 17 0 1
## 5 19 0 1
## 6 22 1 0
## 7 24 0 1
## 8 25 0 1
## 9 31 0 2
## 10 35 0 2
## # … with 9,236 more rows
bing=get_sentiments("bing")
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
index
, equal to as.numeric()
applied to document
. This occurs within mutate()
in the tidyverse.
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
moby_count
by forwarding moby_lex_words
to count()
, passing in sentiment, index
.
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
moby_count
## # A tibble: 10,531 × 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9 1
## 2 negative 10 1
## 3 negative 22 1
## 4 negative 42 1
## 5 negative 43 2
## 6 negative 45 1
## 7 negative 58 1
## 8 negative 65 1
## 9 negative 67 1
## 10 negative 70 1
## # … with 10,521 more rows
moby_spread
by piping moby_count
to spread()
which contains sentiment
, n
, and fill = 0
.
moby_spread <- moby_count %>%
# Spread sentiments
spread(sentiment, n, fill = 0)
# Review the spread data
moby_spread
## # A tibble: 9,246 × 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9 1 0
## 2 10 1 0
## 3 13 0 1
## 4 17 0 1
## 5 19 0 1
## 6 22 1 0
## 7 24 0 1
## 8 25 0 1
## 9 31 0 2
## 10 35 0 2
## # … with 9,236 more rows
Excellent work! You slew the data wrangling white whale!
2.2.3 Bing tidy polarity: Call me Ishmael (with ggplot2)!
The last Bing lexicon exercise! In this exercise you will use the pipe operator (%>%
) to create a timeline of the sentiment in Moby Dick.
In the end you will also create a simple visual following the code
structure below. The next chapter goes into more depth for visuals.
ggplot(spread_data, aes(index_column, polarity_column)) +
geom_smooth()
moby
to the bing
lexicon.
-
Call
inner_join()
to join the tibbles. -
Join by the
term
column in the text and theword
column in the lexicon.
sentiment
and index
.
-
Call
spread()
. -
The key column (to split into multiple columns) is
sentiment
. -
The value column (containing the counts) is
n
. -
Also specify
fill = 0
to fill out missing values with a zero.
mutate()
to add the polarity
column. Define it as the difference between the positive
and negative
columns.
moby=m_dick_tidy
moby_polarity <- moby %>%
# Inner join to lexicon
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
# Count the sentiment scores
count(sentiment, index) %>%
# Spread the sentiment into positive and negative columns
spread(sentiment, n, fill = 0) %>%
# Add polarity column
mutate(polarity = positive - negative)
moby_polarity
, plot polarity
vs. index
.
# From previous step
moby_polarity <- moby %>%
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
count(sentiment, index) %>%
spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
geom_smooth()
with no arguments.
# Plot polarity vs. index
ggplot(moby_polarity, aes(index, polarity)) +
# Add a smooth trend curve
geom_smooth()
Call me pleased with your work! Does Moby Dick have a happy ending?
2.3 AFINN & NRC lexicon
2.3.1 AFINN: I'm your Huckleberry
Now we transition to the AFINN lexicon. The AFINN lexicon has numeric
values from 5 to -5, not just positive or negative. Unlike the Bing
lexicon's sentiment
, the AFINN lexicon's sentiment score column is called value
.
As before, you apply inner_join()
then count()
. Next, to sum the scores of each line, we use dplyr
's group_by()
and summarize()
functions. The group_by()
function takes an existing data frame and converts it into a grouped
data frame where operations are performed "by group". Then, the summarize()
function lets you calculate a value for each group in your data frame using a function that aggregates data, like sum()
or mean()
. So, in our case we can do something like
data_frame %>%
group_by(book_line) %>%
summarize(total_value = sum(book_line))
In the tidy version of Huckleberry Finn, line 9703 contains words "best", "ever", "fun", "life" and "spirit". "best" and "fun" have AFINN scores of 3 and 4 respectively. After aggregating, line 9703 will have a total score of 7.
In the tidyverse, filter()
is preferred to subset()
because it combines the functionality of subset()
with simpler syntax. Here is an example that filter()
s data_frame
where some value in column1
is equal to 24
. Notice the column name is not in quotes.
filter(data_frame, column1 == 24)
The afinn
object contains the AFINN lexicon. The huck
object is a tidy version of Mark Twain's Adventures of Huckleberry Finn for analysis.
Line 5400 is All the loafers looked glad; I reckoned they was used to having fun out of Boggs. Stopwords and punctuation have already been removed in the dataset.
df=readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/all_books.rds")
huck=df %>% filter(book=="huck_finn") %>% mutate(line=document) %>% select(term,count,line)
# See abbreviated line 5400
huck %>% filter(line == 5400)
## # A tibble: 7 × 3
## term count line
## <chr> <dbl> <chr>
## 1 all 1 5400
## 2 fun 1 5400
## 3 glad 1 5400
## 4 loafers 1 5400
## 5 looked 1 5400
## 6 reckoned 1 5400
## 7 used 1 5400
afinn=afinn_lex
# What are the scores of the sentiment words?
afinn %>% filter(word %in% c("fun", "glad"))
## word value
## 1 fun 4
## 2 glad 3
inner_join()
huck
to the afinn
lexicon.
-
Remember
huck
is already piped into the function so just add the lexicon. -
Join by the
term
column in the text and theword
column in the lexicon.
count()
with value
and line
to tally/count observations by group.-
Assign the result to
huck_afinn
.
huck_afinn <- huck %>%
# Inner Join to AFINN lexicon
inner_join(afinn, by = c("term" = "word")) %>%
# Count by value and line
count(value, line)
huck_afinn
to group_by()
and passing line
without quotes.
-
Create
huck_afinn_agg
usingsummarize()
, settingtotal_value
equal to thesum()
ofvalue * n
.
huck_afinn_agg <- huck_afinn %>%
# Group by line
group_by(line) %>%
# Sum values times n (by line)
summarize(total_value = sum(value * n))
filter()
on huck_afinn_agg
and line == 5400
to review a single line.
huck_afinn_agg %>%
# Filter for line 5400
filter(line == 5400)
## # A tibble: 1 × 2
## line total_value
## <chr> <int>
## 1 5400 7
-
Create a sentiment timeline. Pass
huck_afinn_agg
to thedata
argument ofggplot()
.-
Then specify the
x
andy
withinaes()
asline
andtotal_value
without quotes. -
Add a layer with
geom_smooth()
.
-
Then specify the
# Plot total_value vs. line
ggplot(huck_afinn_agg, aes(line, total_value)) +
# Add a smooth trend curve
geom_smooth()
Wow, you're a tidytext
wizard! Huckleberry Finn has a not-quite-a-happy-ending.
2.3.2 The wonderful wizard of NRC
Last but not least, you get to work with the NRC lexicon which labels words across multiple emotional states. Remember Plutchik's wheel of emotion? The NRC lexicon tags words according to Plutchik's 8 emotions plus positive/negative.
In this exercise there is a new operator, %in%
, which matches a vector to another. In the code below %in%
will return FALSE
, FALSE
, TRUE
. This is because within some_vec
, 1
and 2
are not found within some_other_vector
but 3
is found and returns TRUE
. The %in%
is useful to find matches.
some_vec <- c(1, 2, 3)
some_other_vector <- c(3, "a", "b")
some_vec %in% some_other_vector
Another new operator is !
. For logical conditions, adding !
will inverse the result. In the above example, the FALSE
, FALSE
, TRUE
will become TRUE
, TRUE
, FALSE
. Using it in concert with %in%
will inverse the response and is good for removing items that are matched.
!some_vec %in% some_other_vector
We've created oz
which is the tidy version of The Wizard of Oz along with nrc
containing the "NRC" lexicon with renamed columns.
oz
to the nrc
lexicon.
-
Call
inner_join()
to join the tibbles. -
Join
by
theterm
column in the text and theword
column in the lexicon.
-
Use
filter()
to keep rows where thesentiment
is not"positive"
or"negative"
.
-
Call
group_by()
, passingsentiment
without quotes.
-
Call
summarize()
, settingtotal_count
equal to thesum()
ofcount
. -
Assign the result to
oz_plutchik
.
oz_txt=readLines("https://raw.githubusercontent.com/kwartler/text_mining/master/Wizard_Of_Oz.txt")
oz_source=VectorSource(oz_txt)
oz_corpus <- VCorpus(oz_source)
clean_text=clean_corpus(oz_corpus)
tf_dtm <- DocumentTermMatrix(clean_text)
nrc=nrc_lex
oz=tidy(tf_dtm)
oz_plutchik <- oz %>%
# Join to nrc lexicon by term = word
inner_join(nrc, by = c("term" = "word")) %>%
# Only consider Plutchik sentiments
filter(!sentiment %in% c("positive", "negative")) %>%
# Group by sentiment
group_by(sentiment) %>%
# Get total count by sentiment
summarize(total_count = sum(count))
-
Create a bar plot with
ggplot()
.-
Pass in
oz_plutchik
to thedata
argument. -
Then specify the
x
andy
aesthetics, callingaes()
and passingsentiment
andtotal_count
without quotes. -
Add a column geom with
geom_col()
. (This is the same asgeom_bar()
, but doesn't summarize the data, since you've done that already.)
-
Pass in
# Plot total_count vs. sentiment
ggplot(oz_plutchik, aes(x = sentiment, y = total_count)) +
# Add a column geom
geom_col()
Your childhood memories are correct: The Wizard of Oz is a scary story. Fear is the most prevalent sentiment in this text.
3 Visualizing sentiment
Make compelling visuals with your sentiment output.
3.1 Parlor trick or worthwhile?
3.1.1 Real insight?
You are given a stack of 10 employee surveys and told to figure out the team's sentiment. The two question survey has 1 question with a numeric scale (1-10) where employees answer how inspired they are at work and a second question for free form text.
You are asked to perform a sentiment analysis on the free form text. Would performing sentiment analysis on the text be appropriate?
-
Yes, the sentiment analysis confirms the employee ratings.
-
No, the free form text will correlate with the ratings and with only 10 surveys the results may have selection and simultaneity bias.
3.1.2 Unhappy ending? Chronological polarity
Sometimes you want to track sentiment over time. For example, during an ad campaign you could track brand sentiment to see the campaign's effect. You saw a few examples of this at the end of the last chapter.
In this exercise you'll recap the workflow for exploring sentiment over time using the novel Moby Dick. One should expect that happy moments in the book would have more positive words than negative. Conversely dark moments and sad endings should use more negative language. You'll also see some tricks to make your sentiment time series more visually appealing.
Recall that the workflow is:
- Inner join the text to the lexicon by word.
- Count the sentiments by line.
- Reshape the data so each sentiment has its own column.
- (Depending upon the lexicon) Calculate the polarity as positive score minus negative score.
- Draw the polarity time series.
This exercise should look familiar: it extends Bing tidy polarity: Call me Ishmael (with ggplot2)!.
inner_join()
the pre-loaded tidy version of Moby Dick, moby
, to the bing
lexicon.
-
Join by the
"term"
column in the text and the"word"
column in the lexicon.
sentiment
and index
.
spread()
with the column sentiment
and the counts column called n
.
-
Also specify
fill = 0
to fill out missing values with a zero.
mutate()
add two columns: polarity
and line_number
.
-
Set
polarity
equal to the positive score minus the negative score. -
Set
line_number
equal to the row number using therow_number()
function.
moby_polarity <- moby %>%
# Inner join to the lexicon
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
# Count by sentiment, index
count(sentiment, index) %>%
# Spread sentiments
spread(sentiment, n, fill = 0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number()
)
-
Create a sentiment time series with
ggplot()
.-
Pass in
moby_polarity
to thedata
argument. -
Call
aes()
and pass inline_number
andpolarity
without quotes. -
Add a smoothed curve with
geom_smooth()
. -
Add a red horizontal line at zero by calling
geom_hline()
, with parameters0
and"red"
. -
Add a title with
ggtitle()
set to"Moby Dick Chronological Polarity"
.
-
Pass in
# Plot polarity vs. line_number
ggplot(moby_polarity, aes(line_number, polarity)) +
# Add a smooth trend curve
geom_smooth() +
# Add a horizontal line at y = 0
geom_hline(yintercept = 0, color = "red") +
# Add a plot title
ggtitle("Moby Dick Chronological Polarity") +
theme_gdocs()
Nice data viz! The story isn't much happier this time around!
3.1.3 Word impact, frequency analysis
One of the easiest ways to explore data is with a frequency analysis.
Although not difficult, in sentiment analysis this simple method can be
surprisingly illuminating. Specifically, you will build a barplot. In
this exercise you are once again working with moby
and bing
to construct your visual.
To get the bars ordered from lowest to highest, you will use a trick with factors. reorder()
lets you change the order of factor levels based upon another scoring
variable. In this case, you will reorder the factor variable term
by the scoring variable polarity
.
moby_tidy_sentiment
.
moby_tidy_sentiment <- moby %>%
# Inner join to bing lexicon by term = word
inner_join(bing, by = c("term" = "word")) %>%
# Count by term and sentiment, weighted by count
count(term, sentiment, wt = count) %>%
# Spread sentiment, using n as values
spread(sentiment, n, fill = 0) %>%
# Mutate to add a polarity column
mutate(polarity = positive - negative)
moby_tidy_sentiment
to review and compare it to the previous exercise.
# Review
moby_tidy_sentiment
## # A tibble: 2,344 × 4
## term negative positive polarity
## <chr> <dbl> <dbl> <dbl>
## 1 abominable 3 0 -3
## 2 abominate 1 0 -1
## 3 abomination 1 0 -1
## 4 abound 0 3 3
## 5 abruptly 2 0 -2
## 6 absence 5 0 -5
## 7 absurd 3 0 -3
## 8 absurdly 1 0 -1
## 9 abundance 0 3 3
## 10 abundant 0 2 2
## # … with 2,334 more rows
moby_tidy_pol <- moby_tidy_sentiment %>%
# Filter for absolute polarity at least 50
filter(abs(polarity) >= 50) %>%
# Add positive/negative status
mutate(
pos_or_neg = ifelse(polarity > 0, "positive", "negative")
)
-
Using
moby_tidy_pol
, plotpolarity
vs.term
, reordered bypolarity
(reorder(term, polarity)
),fill
ed bypos_or_neg
. -
Inside
element_text()
, rotate the x-axis text90
degrees by settingangle = 90
and shifting the vertical justification withvjust = -0.1
.
# Plot polarity vs. (term reordered by polarity), filled by pos_or_neg
ggplot(moby_tidy_pol, aes(reorder(term, polarity), polarity, fill = pos_or_neg)) +
geom_col() +
ggtitle("Moby Dick: Sentiment Word Frequency") +
theme_gdocs() +
# Rotate text and vertically justify
theme(axis.text.x = element_text(angle = 90, vjust = -0.1))
Amazing! You went all the way from documents to visualizations in no time at all.
3.2 Introspection
3.2.1 Divide & conquer: Using polarity for a comparison cloud
Now that you have seen how polarity can be used to divide a corpus, let's do it! This code will walk you through dividing a corpus based on sentiment so you can peer into the information in subsets instead of holistically.
Your R session has oz_pol
which was created by applying polarity()
to "The Wonderful Wizard of Oz."
For simplicity's sake, we created a simple custom function called pol_subsections()
which will divide the corpus by polarity score. First, the function
accepts a data frame with each row being a sentence or document of the
corpus. The data frame is subset anywhere the polarity values are
greater than or less than 0. Finally, the positive and negative
sentences, non-zero polarities, are pasted with parameter collapse
so that the terms are grouped into a single corpus. Lastly, the two
documents are concatenated into a single vector of two distinct
documents.
pol_subsections <- function(df) {
x.pos <- subset(df\$text, df\$polarity > 0)
x.neg <- subset(df\$text, df\$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
return(all.terms)
}
At this point you have omitted the neutral sentences and want to focus
on organizing the remaining text. In this exercise we use the %>%
operator again to forward objects to functions. After some simple cleaning use comparison.cloud()
to make the visual.
oz_pol
.
-
Call
select()
, declaring the first columntext
astext.var
which is the raw text. The second columnpolarity
should refer to the polarity scorespolarity
.
oz_pol=polarity(oz_txt)
oz_df <- oz_pol$all %>%
# Select text.var as text and polarity
select(text = text.var, polarity = polarity)
pol_subsections()
to oz_df
. Call the new object all_terms
.
pol_subsections=function(df) {
x.pos <- subset(df$text, df$polarity > 0)
x.neg <- subset(df$text, df$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
return(all.terms)
}
# Apply custom function pol_subsections()
all_terms <- pol_subsections(oz_df)
all_corpus
apply VectorSource()
to all_terms
and then %>%
to VCorpus()
.
all_corpus <- all_terms %>%
# Source from a vector
VectorSource() %>%
# Make a volatile corpus
VCorpus()
-
Create a term-document matrix,
all_tdm
, usingTermDocumentMatrix()
onall_corpus
.
-
Add in the parameters
control = list(removePunctuation = TRUE, stopwords = stopwords(kind = "en")))
. -
Then
%>%
toas.matrix()
and%>%
again toset_colnames(c("positive", "negative"))
.
-
Add in the parameters
all_tdm=readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/all_tdm.rds")
all_tdm <- TermDocumentMatrix(
# Create TDM from corpus
all_corpus,
control = list(
# Yes, remove the punctuation
removePunctuation = TRUE,
# Use English stopwords
stopwords = stopwords(kind = "en")
)
) %>%
# Convert to matrix
as.matrix() %>%
# Set column names
set_colnames(c("positive", "negative"))
Apply comparison.cloud()
to all_tdm
with parameters max.words = 50
, and colors = c("darkgreen","darkred")
.
library(wordcloud)
comparison.cloud(
# Create plot from the all_tdm matrix
all_tdm,
# Limit to 50 words
max.words = 50,
# Use darkgreen and darkred colors
colors = c("darkgreen", "darkred")
)
Fantastic work! Word clouds are a great way to get an overview of your data.
3.2.2 Emotional introspection
In this exercise you go beyond subsetting on positive and negative language. Instead you will subset text by each of the 8 emotions in Plutchik's emotional wheel to construct a visual. With this approach you will get more clarity in word usage by mapping to a specific emotion instead of just positive or negative.
Using the tidytext
subjectivity lexicon, "nrc", you perform an inner_join()
with your text. The "nrc" lexicon has the 8 emotions plus positive and
negative term classes. So you will have to drop positive and negative
words after performing your inner_join()
. One way to do so is with the negation, !
, and grepl()
.
The "Global Regular Expression Print Logical" function, grepl()
, will return a True or False if a string pattern is identified in each row. In this exercise you will search for positive OR negative using the |
operator, representing "or" as shown below. Often this straight line is above the enter key on a keyboard. Since the !
negation precedes grepl()
, the T or F is switched so the "positive|negative"
is dropped instead of kept.
Object <- tibble %>%
filter(!grepl("positive|negative", column_name))
Next you apply count()
on the identified words along with spread()
to get the data frame organized.
comparison.cloud()
requires its input to have row names, so you'll have to convert it to a base-R data.frame
, calling data.frame()
with the row.names
argument.
inner_join()
moby
to nrc
.
filter()
with a negation (!
) and grepl()
search for "positive|negative"
. The column to search is called sentiment
.
count()
to count by sentiment
and term
.
spread()
, passing in sentiment
, n
, and fill = 0
.
data.frame()
, making the term
column into rownames.
moby_tidy <- moby %>%
# Inner join to nrc lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Drop positive or negative
filter(!grepl("positive|negative", sentiment)) %>%
# Count by sentiment and term
count(sentiment, term) %>%
# Spread sentiment, using n for values
spread(sentiment, n, fill = 0) %>%
# Convert to data.frame, making term the row names
data.frame(row.names = "term")
moby_tidy
using head()
.
# Examine
head(moby_tidy)
## anger anticipation disgust fear joy sadness surprise trust
## abandon 0 0 0 3 0 3 0 0
## abandoned 7 0 0 7 0 7 0 0
## abandonment 2 0 0 2 0 2 2 0
## abhorrent 1 0 1 1 0 0 0 0
## abominable 0 0 3 3 0 0 0 0
## abomination 1 0 1 1 0 0 0 0
-
Using
moby_tidy
, draw acomparison.cloud()
.-
Limit to
50
words. -
Increase the title size to
1.5
.
-
Limit to
library(wordcloud)
# From previous step
moby_tidy <- m_dick_tidy %>% mutate(document=as.numeric(document)) %>%
inner_join(nrc, by = c("term" = "word")) %>%
filter(!grepl("positive|negative", sentiment)) %>%
count(sentiment, term) %>%
spread(sentiment, n, fill = 0) %>%
data.frame(row.names = "term")
# Plot comparison cloud
comparison.cloud(moby_tidy, max.words = 50, title.size = 1.5)
That's great! How does this cloud compare to the one from the previous exercise?
3.2.3 Compare & contrast stacked bar chart
Another way to slice your text is to understand how much of the document(s) are made of positive or negative words. For example a restaurant review may have some positive aspects such as "the food was good" but then continue to add "the restaurant was dirty, the staff was rude and parking was awful." As a result, you may want to understand how much of a document is dedicated to positive vs negative language. In this example it would have a higher negative percentage compared to positive.
One method for doing so is to count()
the positive and
negative words then divide by the number of subjectivity words
identified. In the restaurant review example, "good" would count as 1
positive and "dirty," "rude," and "awful" count as 3 negative terms. A
simple calculation would lead you to believe the restaurant review is
25% positive and 75% negative since there were 4 subjectivity terms.
Start by performing the inner_join()
on a unified tidy data frame containing 4 books, Agamemnon, Oz, Huck
Finn, and Moby Dick. Just like the previous exercise you will use filter()
and grepl()
.
To perform the count()
you have to group the data by book
and then sentiment. For example all the positive words for Agamemnon
have to be grouped then tallied so that positive words from all books
are not mixed. Luckily, you can pass multiple variables into count()
directly.
all_books
to the lexicon, nrc
.
sentiment
contains "positive"
or "negative"
. That is, use grepl()
on the sentiment
column, checking without the negation so that "positive|negative"
are kept.
all_books=readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/all_books.rds")
# Review tail of all_books
tail(all_books)
## # A tibble: 6 × 5
## term document count author book
## <chr> <chr> <dbl> <chr> <chr>
## 1 ebooks 19117 1 twain innocents_abroad
## 2 email 19117 1 twain innocents_abroad
## 3 hear 19117 1 twain innocents_abroad
## 4 new 19117 1 twain innocents_abroad
## 5 newsletter 19117 1 twain innocents_abroad
## 6 subscribe 19117 1 twain innocents_abroad
# Count by book & sentiment
books_sent_count <- all_books %>%
# Inner join to nrc lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Keep only positive or negative
filter(grepl("positive|negative", sentiment)) %>%
# Count by book and by sentiment
count(book, sentiment)
book
and sentiment
.
# Review entire object
books_sent_count
## # A tibble: 22 × 3
## book sentiment n
## <chr> <chr> <int>
## 1 bartleby negative 531
## 2 bartleby positive 854
## 3 confidence_man negative 3456
## 4 confidence_man positive 5738
## 5 ct_yankee negative 3985
## 6 ct_yankee positive 6053
## 7 hamlet negative 1666
## 8 hamlet positive 2205
## 9 huck_finn negative 2401
## 10 huck_finn positive 3440
## # … with 12 more rows
-
Group
books_sent_count
byline
. -
Mutate to add a column named
percent_positive
. This should e calculated as100
timesn
divided by the sum ofn
.
book_pos <- books_sent_count %>%
# Group by book
group_by(book) %>%
# Mutate to add % positive column
mutate(percent_positive = 100 * n / sum(n))
-
Using
book_pos
, plotpercent_positive
vs.book
, usingsentiment
as the fill color. -
Add a column layer with
geom_col()
.
# Plot percent_positive vs. book, filled by sentiment
ggplot(book_pos, aes(book, percent_positive, fill = sentiment)) +
# Add a col layer
geom_col()
Cruising along! Now you know how to see the proportional positivity in text.
3.3 Interpreting visualizations
3.3.1 Kernel density plot
Now that you learned about a kernel density plot you can create one! Remember it's like a smoothed histogram but isn't affected by binwidth. This exercise will help you construct a kernel density plot from sentiment values.
In this exercise you will plot 2 kernel densities. One for Agamemnon and
another for The Wizard of Oz. For both you will perform an inner_join()
with the "afinn" lexicon. Recall the "afinn" lexicon has terms scored
from -5 to 5. Once in a tidy format, both books will retain words and
corresponding scores for the lexicon.
After that, you need to row bind the results into a larger data frame using bind_rows()
and create a plot with ggplot2
.
From the visual you will be able to understand which book uses more positive versus negative language. There is clearly overlap as negative things happen to Dorothy but you could infer the kernel density is demonstrating a greater probability of positive language in the Wizard of Oz compared to Agamemnon.
We've loaded ag
and oz
as tidy versions of Agamemnon and The Wizard of Oz respectively, and created afinn
as a subset of the tidytext
"afinn"
lexicon.
ag
to the lexicon, afinn
, assigning to ag_afinn
.
afinn=read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS4tUUP2pJ1A6crxDKSY6Be7Rp2QlZjase7ubLcPUXcnwE7xHKkXGuV3V8WpxsJFQpOEDuFpxb2qfbh/pub?gid=0&single=true&output=csv")
ag=ag_tidy
ag_afinn <- ag %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word"))
oz
dataset and assigning to oz_afinn
.
oz_afinn <- oz %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word"))
bind_rows()
to combine ag_afinn
to oz_afinn
. Set the .id
argument to "book"
to create a new column with the name of each book.
# Combine
all_df <- bind_rows(agamemnon = ag_afinn, oz = oz_afinn, .id = "book")
-
Using
all_df
, plotvalue
, usingbook
as thefill
color. -
Set the
alpha
transparency to0.3
.
# Plot value, filled by book
ggplot(all_df, aes(x = value, fill = book)) +
# Set transparency to 0.3
geom_density(alpha = 0.3) +
theme_gdocs() +
ggtitle("AFINN Score Densities")
Not bad. Kernel densities are great for understanding a distribution.
3.3.2 Box plot
An easy way to compare multiple distributions is with a box plot. This code will help you construct multiple box plots to make a compact visual.
In this exercise the all_book_polarity
object is already loaded. The data frame contains two columns, book
and polarity
. It comprises all books with qdap
's polarity()
function applied. Here are the first 3 rows of the large object.
book | polarity | |
---|---|---|
14 | huck | 0.2773501 |
22 | huck | 0.2581989 |
26 | huck | -0.5773503 |
This exercise introduces tapply()
which allows you to apply functions over a ragged array. You input a
vector of values and then a vector of factors. For each factor, value
combination the third parameter, a function like min()
, is applied. For example here's some code with tapply()
used on two vectors.
f1 <- as.factor(c("Group1", "Group2", "Group1", "Group2"))
stat1 <- c(1, 2, 1, 2)
tapply(stat1, f1, sum)
The result is an array where Group1
has a value of 2 (1+1) and Group2
has a value of 4 (2+2).
all_book_polarity
with str()
.
all_book_polarity=readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/all_book_polarity.rds")
# Examine
str(all_book_polarity)
## 'data.frame': 14437 obs. of 2 variables:
## $ book : Factor w/ 4 levels "huck","agamemnon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ polarity: num 0.277 0.258 -0.577 0.25 0.516 ...
tapply()
, pass in all_book_polarity$polarity
, all_book_polarity$book
and the summary()
function. This will print the summary statistics for the 4 books in terms of their polarity()
scores. You would expect to see Oz and Huck Finn to have higher
averages than Agamemnon or Moby Dick. Pay close attention to the median.
# Summary by document
tapply(all_book_polarity$polarity, all_book_polarity$book, summary)
## $huck
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.38695 -0.25820 0.23570 0.04156 0.26726 1.60357
##
## $agamemnon
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4667 -0.3780 -0.3333 -0.1266 0.3333 1.2247
##
## $moby
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.13333 -0.28868 -0.25000 -0.02524 0.28868 1.84752
##
## $oz
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2728 -0.2774 0.2582 0.0454 0.2887 1.1877
ggplot()
by passing in all_book_polarity
.
-
Aesthetics should be
aes(x = book, y = polarity)
. -
Using a
+
add thegeom_boxplot()
withcol = "darkred"
. Pay close attention to the dark line in each box representing median. -
Next add another layer called
geom_jitter()
to add points for each of the words.
# Box plot
ggplot(all_book_polarity, aes(x = book, y = polarity)) +
geom_boxplot(fill = c("#bada55", "#F00B42", "#F001ED", "#BA6E15"), col = "darkred") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.02) +
theme_gdocs() +
ggtitle("Book Polarity")
Boom goes the dynamite! Box plots help you quickly compare multiple distributions
3.3.3 Radar chart
Remember Plutchik's wheel of emotion? The NRC lexicon has the 8 emotions corresponding to the first ring of the wheel. Previously you created a comparison.cloud()
according to the 8 primary emotions. Now you will create a radar chart similar to the wheel in this exercise.
A radarchart
is a two-dimensional representation of multidimensional data (at least
3). In this case the tally of the different emotions for a book are
represented in the chart. Using a radar chart, you can review all 8
emotions simultaneously.
As before we've loaded the "nrc" lexicon as nrc
and moby_huck
which is a combined tidy version of both Moby Dick and Huck Finn.
In this exercise you once again use a negated grepl()
to remove "positive|negative"
emotional classes from the chart. As a refresher here is an example:
object <- tibble %>%
filter(!grepl("positive|negative", column_name))
This exercise reintroduces spread()
which rearranges the tallied emotional words. As a refresher consider this raw data called datacamp
.
people | food | like |
---|---|---|
Nicole | bread | 78 |
Nicole | salad | 66 |
Ted | bread | 99 |
Ted | salad | 21 |
If you applied spread()
as in spread(datacamp, people, like)
the data looks like this.
food | Nicole | Ted |
---|---|---|
bread | 78 | 99 |
salad | 66 | 21 |
moby_huck
with tail()
.
moby_huck=df %>% filter(book=="moby_dick")
# Review tail of moby_huck
tail(moby_huck)
## # A tibble: 6 × 5
## term document count author book
## <chr> <chr> <dbl> <chr> <chr>
## 1 ebooks 21574 1 melville moby_dick
## 2 email 21574 1 melville moby_dick
## 3 hear 21574 1 melville moby_dick
## 4 new 21574 1 melville moby_dick
## 5 newsletter 21574 1 melville moby_dick
## 6 subscribe 21574 1 melville moby_dick
inner_join()
moby_huck
and nrc
.
filter()
negating "positive|negative"
in the sentiment
column. Assign the result to books_pos_neg
.
books_pos_neg
is forwarded to group_by()
with book
and sentiment
. Then tally()
the object with an empty function.
spread()
the books_tally
by the book
and n
column.
scores <- moby_huck %>%
# Inner join to lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Drop positive or negative sentiments
filter(!grepl("positive|negative", sentiment)) %>%
# Count by book and sentiment
count(book, sentiment) %>%
# Spread book, using n as values
spread(book, n)
scores
data.
# Review scores
scores
## # A tibble: 8 × 2
## sentiment moby_dick
## <chr> <int>
## 1 anger 2811
## 2 anticipation 4426
## 3 disgust 1996
## 4 fear 4177
## 5 joy 2781
## 6 sadness 3306
## 7 surprise 2074
## 8 trust 4784
Call chartJSRadar()
on scores
which is an htmlwidget
from the radarchart
package.
library(radarchart)
# JavaScript radar chart
chartJSRadar(scores)
Radical radar plotting! Bar plots are usually a clearer alternative, but radar charts do look pretty.
3.3.4 Treemaps for groups of documents
Often you will find yourself working with documents in groups, such as
author, product or by company. This exercise lets you learn about the
text while retaining the groups in a compact visual. For example, with
customer reviews grouped by product you may want to explore multiple
dimensions of the customer reviews at the same time. First you could
calculate the polarity()
of the reviews. Another dimension
may be length. Document length can demonstrate the emotional intensity.
If a customer leaves a short "great shoes!" one could infer they are
actually less enthusiastic compared to a lengthier positive review. You
may also want to group reviews by product type such as women's, men's
and children's shoes. A treemap lets you examine all of these
dimensions.
For text analysis, within a treemap each individual box represents a document such as a tweet. Documents are grouped in some manner such as author. The size of each box is determined by a numeric value such as number of words or letters. The individual colors are determined by a sentiment score.
After you organize the tibble, you use the treemap
library containing the function treemap()
to make the visual. The code example below declares the data, grouping variables, size, color and other aesthetics.
treemap(
data_frame,
index = c("group", "individual_document"),
vSize = "doc_length",
vColor = "avg_score",
type = "value",
title = "Sentiment Scores by Doc",
palette = c("red", "white", "green")
)
The pre-loaded all_books
object contains a combined tidy
format corpus with 4 Shakespeare, 3 Melville and 4 Twain books. Based on
the treemap you should be able to tell who writes longer books, and the
polarity of the author as a whole and for individual books.
Calculate each book's length in a new object called book_length
using count()
with the book
column.
book_length <- all_books %>%
# Count number of words per book
count(book)
-
Inner join
all_books
to the lexicon,afinn
. -
Group by
author
andbook
. -
Use
summarize()
to calculate themean_value
as themean()
ofvalue
. -
Inner join again, this time to
book_length
. Joinby
thebook
column.
book_tree <- all_books %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word")) %>%
# Group by author, book
group_by(author, book) %>%
# Calculate mean book value
summarize(mean_value = mean(value)) %>%
# Inner join by book
inner_join(book_length, by = "book")
# Examine the results
book_tree
## # A tibble: 11 × 4
## # Groups: author [3]
## author book mean_value n
## <chr> <chr> <dbl> <int>
## 1 melville bartleby 0.0962 8871
## 2 melville confidence_man 0.484 48834
## 3 melville moby_dick 0.144 109996
## 4 shakespeare hamlet 0.0779 18725
## 5 shakespeare julius_caesar 0.0604 13165
## 6 shakespeare macbeth 0.206 12240
## 7 shakespeare romeo_juliet 0.151 16870
## 8 twain ct_yankee 0.189 58229
## 9 twain huck_finn 0.0727 55198
## 10 twain innocents_abroad 0.397 99031
## 11 twain tom_sawyer -0.0292 38831
-
Draw a treemap, setting the following arguments.
-
Use the
book_tree
from the previous step. -
Specify the aggregation
index
columns as"author"
and"book"
. -
Specify the vertex size column,
vSize
, as"n"
. -
Specify the vertex color column,
vColor
, as"mean_value"
. -
Specify a direct mapping from
vColor
to the palette by settingtype = "value"
.
-
Use the
library(treemap)
treemap(
# Use the book tree
book_tree,
# Index by author and book
index = c("author", "book"),
# Use n as vertex size
vSize = "n",
# Color vertices by mean_value
vColor = "mean_value",
# Draw a value type
type = "value",
title = "Book Sentiment Scores",
palette = c("red", "white", "green")
)
Terrific treemapping! Treemaps are great ways to explore grouped data.
4 Case study: Airbnb reviews
Is your property a good rental? What do people look for in a good rental?
4.1 The text mining workflow
4.1.1 Step 1: What do you want to know?
Throughout this chapter you will analyze the text of a corpus of Airbnb housing rental reviews. Which of the following questions can you answer using a sentiment analysis of these reviews?
-
What document clusters exist in the reviews?
-
How many words are associated with rental reviews?
-
What property qualities are listed in positive or negative comments?
-
What named entities are in the documents?
4.1.2 Step 2: Identify Text Sources
In this short exercise you will load and examine a small corpus of
property rental reviews from around Boston. Hopefully you already know read.csv()
which enables you to load a comma separated file. In this exercise you will also need to specify stringsAsFactors = FALSE
when loading the corpus. This ensures that the reviews are character
vectors, not factors. This may seem mundane but the point of this
chapter is to get you doing an entire workflow from start to finish so
let's begin with data ingestion!
Next you simply apply str()
to review the data frame's str
ucture. It is a convenient function for compactly displaying initial values and class types for vectors.
Lastly you will apply dim()
to print the dim
ensions of the data frame. For a data frame, your console will print the number of rows and the number of columns.
Other functions like head()
, tail()
or summary()
are often used for data exploration but in this case we keep the
examination short so you can get to the fun sentiment analysis!
The Boston property rental reviews are stored in a CSV file located by the predefined variable bos_reviews_file
.
# bos_reviews_file has been pre-defined
bos_reviews_file
with read.csv()
. Call the object bos_reviews
. Be sure to pass in the parameter stringsAsFactors = FALSE
so the comments are not unique factors.
# load raw text
bos_reviews =readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/bos_reviews.rds")
str()
function applied to bos_reviews
.
# Structure
str(bos_reviews)
## 'data.frame': 1000 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ comments: chr "My daughter and I had a wonderful stay with Maura. She kept in close touch with us throughout the day as we wer"| __truncated__ "We stay at Elizabeth's place for 3 nights in October 2014.\nThe apartment is really a great place to stay. \nLo"| __truncated__ "If you're staying in South Boston, this is a terrific place to camp out. The apartment and bedroom are lovely, "| __truncated__ "Derian and Brian were great and prompt with their communications with us. The room was as described; it was a s"| __truncated__ ...
dim()
on the bos_reviews
.
# Dimensions
dim(bos_reviews)
## [1] 1000 2
Hurrah! Now that you've imported the data, let's get started with the sentiment analysis.
4.1.3 Quickly examine the basic polarity
When starting a sentiment project, sometimes a quick polarity()
will help you set expectations or learn about the problem. In this exercise (to save time), you will apply polarity()
to a portion of the comments
vector while the larger polarity object is loaded in the background.
Using a kernel density plot you should notice the reviews do not center on 0. Often there are two causes for this sentiment "grade inflation." First, social norms may lead respondents to be pleasant instead of neutral. This, of course, is channel specific. Particularly snarky channels like e-sports or social media posts may skew negative leading to "deflation." These channels have different expectations. A second possible reason could be "feature based sentiment". In some reviews an author may write "the bed was comfortable and nice but the kitchen was dirty and gross." The sentiment of this type of review encompasses multiple features simultaneously and therefore could make an average score skewed.
In a subsequent exercise you will adjust this "grade inflation" but here explore the reviews without any change.
practice_pol
using polarity()
on the first six reviews as in bos_reviews$comments[1:6]
# Practice apply polarity to first 6 reviews
practice_pol <- polarity(bos_reviews$comments[1:6])
practice_pol
.
# Review the object
practice_pol
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 6 390 0.747 0.398 1.875
summary()
on practice_pol$all$polarity
- this will access the overall polarity for all 6 comments.
# Check out the practice polarity
summary(practice_pol$all$polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.5009 0.6594 0.7466 1.0779 1.2455
bos_pol
. Now apply summary()
to the correct list element that returns all polarity scores of bos_pol
.
bos_pol=readRDS("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/bos_pol.rds")
# Summary for all reviews
summary(bos_pol$all$polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.9712 0.6047 0.8921 0.9022 1.2063 3.7510 1
polarity
represents a column of this data frame.
# Plot Boston polarity all element
ggplot(bos_pol$all, aes(x = polarity, y = ..density..)) +
geom_histogram(binwidth = 0.25, fill = "#bada55", colour = "grey60") +
geom_density(size = 0.75) +
theme_gdocs()
Out of the gate and you're crushing it! Quick and easy yet polarity can help you get familiar with your data.
4.2 Organize (& clean) the text
4.2.1 Create Polarity Based Corpora
In this exercise you will perform Step 3 of the text mining workflow. Although qdap
isn't a tidy package you will mutate()
a new column based on the returned polarity
list representing all polarity (that's a hint BTW) scores. In chapter 3 we used a custom function pol_subsections
which uses only base R declarations. However, in following the tidy principles this exercise uses filter()
then introduces pull()
. The pull()
function works like works like [[
to extract a single variable.
Once segregated you collapse all the positive and negative comments into two larger documents representing all words among the positive and negative rental reviews.
Lastly, you will create a Term Frequency Inverse Document Frequency
(TFIDF) weighted Term Document Matrix (TDM). Since this exercise code
starts with a tidy structure, some of the functions borrowed from tm
are used along with the %>%
operator to keep the style consistent. If the basics of the tm
package aren't familiar check out the Text Mining with Bag-of-Words in R
course. Instead of counting the number of times a word is used
(frequency), the values in the TDM are penalized for over used terms,
which helps reduce non-informative words.
-
Mutate to add a
polarity
column, equal tobos_pol$all$polarity
. -
Filter to keep rows where
polarity
is greater than zero. -
Use
pull()
to extract thecomments
column. (Pass this column without quotes.) -
Collapse into a single string, separated by spaces using
paste()
, passingcollapse = " "
.
pos_terms <- bos_reviews %>%
# Add polarity column
mutate(polarity = bos_pol$all$polarity) %>%
# Filter for positive polarity
filter(polarity > 0) %>%
# Extract comments column
pull(comments) %>%
# Paste and collapse
paste(collapse = " ")
pos_terms=readLines("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/pos.txt")
-
Do the same again, this time with negative comments.
-
Mutate to add a
polarity
column, equal tobos_pol\(all\)polarity
. -
Filter to keep rows where
polarity
is less than zero. -
Extract the
comments
column. - Collapse into a single string, separated by spaces.
-
Mutate to add a
neg_terms <- bos_reviews %>%
# Add polarity column
mutate(polarity = bos_pol$all$polarity) %>%
# Filter for negative polarity
filter(polarity < 0) %>%
# Extract comments column
pull(comments) %>%
# Paste and collapse
paste(collapse = " ")
neg_terms=readLines("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/neg.txt")
-
Create a corpus of both positive and negative comments.
-
Use
c()
to concatenatepos_terms
andneg_terms
. -
Source the text using
VectorSource()
without arguments. -
Convert to a volatile corpus by calling
VCorpus()
, again without arguments.
-
Use
# Concatenate the terms
all_corpus <- c(pos_terms, neg_terms) %>%
# Source from a vector
VectorSource() %>%
# Create a volatile corpus
VCorpus()
-
Create a term-document matrix from
all_corpus
.
-
Use term frequency inverse document frequency weighting by setting
weighting
toweightTfIdf
. -
Remove punctuation by setting
removePunctuation
toTRUE
. -
Use English stopwords by setting
stopwords
tostopwords(kind = "en")
.
-
Use term frequency inverse document frequency weighting by setting
all_tdm <- TermDocumentMatrix(
# Use all_corpus
all_corpus,
control = list(
# Use TFIDF weighting
weighting = weightTfIdf,
# Remove the punctuation
removePunctuation = TRUE,
# Use English stopwords
stopwords = stopwords(kind = "en")
)
)
# Examine the TDM
# all_tdm
Congrats now you have a TFIDF weighted TDM splitting up your text!
4.2.2 Create a Tidy Text Tibble!
Since you learned about tidy principles this code helps you organize your data into a tibble so you can then work within the tidyverse!
Previously you learned that applying tidy()
on a TermDocumentMatrix()
object will convert the TDM to a tibble. In this exercise you will create the word data directly from the review column called comments
.
First you use unnest_tokens()
to make the text lowercase and tokenize the reviews into single words.
Sometimes it is useful to capture the original word order within each group of a corpus. To do so, use mutate()
. In mutate()
you will use seq_along()
to create a sequence of numbers from 1 to the length of the object. This will capture the word order as it was written.
In the tm
package, you would use removeWords()
to remove stopwords. In the tidyverse you first need to load the stop words lexicon and then apply an anti_join()
between the tidy text data frame and the stopwords.
tidy_reviews
by piping (%>%
) the original reviews object bos_reviews
to the unnest_tokens()
function. Pass in a new column name, word
and declare the comments
column. Remember in the tidyverse you don't need a $
or quotes.
# Vector to tibble
tidy_reviews <- bos_reviews %>%
unnest_tokens(word, comments)
tidy_reviews
by piping tidy_reviews
to group_by
with the column id
. Then %>%
it again to mutate()
. Within mutate create a new variable original_word_order
equal to seq_along(word)
.
# Group by and mutate
tidy_reviews <- tidy_reviews %>%
group_by(id) %>%
mutate(original_word_order = seq_along(word))
tidy_reviews
.
# Quick review
tidy_reviews
## # A tibble: 70,975 × 3
## # Groups: id [1,000]
## id word original_word_order
## <int> <chr> <int>
## 1 1 my 1
## 2 1 daughter 2
## 3 1 and 3
## 4 1 i 4
## 5 1 had 5
## 6 1 a 6
## 7 1 wonderful 7
## 8 1 stay 8
## 9 1 with 9
## 10 1 maura 10
## # … with 70,965 more rows
data("stop_words")
.
# Load stopwords
stop_words
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
tidy_reviews
by passing the original tidy_reviews
to anti_join()
with a %>%
. Within anti_join()
pass in the predetermined stop_words
lexicon.
# Perform anti-join
tidy_reviews_without_stopwords <- tidy_reviews %>%
anti_join(stop_words)
Tidy Text Tibbles are a mouthful but you did it!
4.2.3 Compare Tidy Sentiment to Qdap Polarity
Here you will learn that differing sentiment methods will cause
different results. Often you will simply need to have results align
directionally although the specifics may be different. In the last
exercise you created tidy_reviews
which is a data frame of rental reviews without stopwords. Earlier in the chapter, you calculated and plotted qdap
's basic polarity()
function. This showed you the reviews tend to be positive.
Now let's perform a similar analysis the tidytext
way! Recall from an earlier chapter you will perform an inner_join()
followed by count()
and then a spread()
.
Lastly, you will create a new column using mutate()
and passing in positive - negative
.
get_sentiments()
function with "bing" will obtain the bing subjectivity lexicon. Call the lexicon bing
.
# Get the correct lexicon
bing <- get_sentiments("bing")
bing
, the new column name (polarity
) and its calculation within mutate()
.
# Calculate polarity for each review
pos_neg <- tidy_reviews %>%
inner_join(bing) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
summary()
on the new object pos_neg
. Although the values are different, are most rental reviews similarly positive compared to using polarity()
? Do you see "grade inflation?"
# Check outcome
summary(pos_neg)
## id negative positive polarity
## Min. : 1 Min. : 0.0000 Min. : 0.000 Min. :-10.000
## 1st Qu.: 251 1st Qu.: 0.0000 1st Qu.: 4.000 1st Qu.: 3.000
## Median : 499 Median : 0.0000 Median : 6.000 Median : 5.000
## Mean : 500 Mean : 0.6633 Mean : 6.571 Mean : 5.908
## 3rd Qu.: 748 3rd Qu.: 1.0000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :1000 Max. :14.0000 Max. :42.000 Max. : 37.000
Horray! Often different polarity methods yield similar results.
4.3 Feature Extraction & Analysis
4.3.2 Comparison Cloud
This exercise will create a common visual for you to understand term
frequency. Specifically, you will review the most frequent terms from
among the positive and negative collapsed documents. Recall the TermDocumentMatrix all_tdm
you created earlier. Instead of 1000 rental reviews the matrix contains 2 documents containing all reviews separated by the polarity()
score.
It's usually easier to change the TDM to a matrix. From there you simply rename the columns. Remember that the colnames()
function is called on the left side of the assignment operator as shown below.
colnames(OBJECT) <- c("COLUMN_NAME1", "COLUMN_NAME2")
Once done, you will reorder the matrix to see the most positive and negative words. Review these terms so you can answer the conclusion exercises!
Lastly, you'll visualize the terms using comparison.cloud()
.
all_tdm
to a matrix called all_tdm_m
using as.matrix()
.
# Matrix
all_tdm_m <- as.matrix(all_tdm)
colnames()
on all_tdm_m
to declare c("positive", "negative")
.
# Column names
colnames(all_tdm_m) <- c("positive", "negative")
order()
to all_tdm_m[,1]
and set decreasing = TRUE
.
# Top pos words
order_by_pos <- order(all_tdm_m[, 1], decreasing = TRUE)
%>%
) then head()
with n = 10
.
# Review top 10 pos words
all_tdm_m[order_by_pos, ] %>% head(10)
## Docs
## Terms positive negative
## walk 0.004546528 0
## definitely 0.004133207 0
## quiet 0.003749410 0
## staying 0.003719887 0
## wonderful 0.003040860 0
## city 0.003011337 0
## restaurants 0.003011337 0
## highly 0.002745631 0
## station 0.002627539 0
## enjoyed 0.002420879 0
order()
by the second column, all_tdm_m[,2]
and use decreasing = TRUE
.
# Top neg words
order_by_neg <- order(all_tdm_m[, 2], decreasing = TRUE)
all_tdm_m
by order_by_neg
. Pipe this to head()
with n = 10
.
# Review top 10 neg words
all_tdm_m[order_by_neg, ] %>% head(10)
## Docs
## Terms positive negative
## condition 0 0.002156722
## demand 0 0.001437815
## disappointed 0 0.001437815
## dumpsters 0 0.001437815
## hygiene 0 0.001437815
## inform 0 0.001437815
## nasty 0 0.001437815
## safety 0 0.001437815
## shouldve 0 0.001437815
## sounds 0 0.001437815
Draw a comparison.cloud()
on all_tdm_m
. Specify max.words
equal to 20
.
comparison.cloud(
# Use the term-document matrix
all_tdm_m,
# Limit to 20 words
max.words = 20,
colors = c("darkgreen", "darkred")
)
Success. Overused…yes. Still useful…yes!
4.3.3 Scaled Comparison Cloud
Recall the "grade inflation" of polarity scores on the rental reviews?
Sometimes, another way to uncover an insight is to scale the scores back
to 0 then perform the corpus subset. This means some of the previously
positive comments may become part of the negative subsection or vice
versa since the mean is changed to 0. This exercise will help you scale
the scores and then re-plot the comparison.cloud()
. Removing the "grade inflation" can help provide additional insights.
Previously you applied polarity()
to the bos_reviews$comments
and created a comparison.cloud()
. In this exercise you will scale()
the outcome before creating the comparison.cloud()
. See if this shows something different in the visual!
Since this is largely a review exercise, a lot of the code exists, just fill in the correct objects and parameters!
bos_pol$all
while indexing [1:6,1:3]
.
# Review
bos_pol$all[1:6,1:3]
## all wc polarity
## 1 all 77 1.1851900
## 2 all 78 1.2455047
## 3 all 39 0.4803845
## 4 all 101 0.7562283
## 5 all 16 0.2500000
## 6 all 79 0.5625440
scaled_polarity
with scale()
applied to the polarity score column bos_pol$all$polarity
.
# Scale/center & append
bos_reviews$scaled_polarity <- scale(bos_pol$all$polarity)
subset()
where the new column bos_reviews$scaled_polarity
is greater than (>) zero.
# Subset positive comments
pos_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity > 0)
subset()
where the new column bos_reviews$scaled_polarity
is less than (<) zero.
# Subset negative comments
neg_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity < 0)
pos_terms
using paste()
on pos_comments
.
pos_comments=readLines("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/pos_comments.txt")
# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")
neg_terms
with paste()
on neg_comments
.
neg_comments=readLines("/Users/apple/Documents/Rstudio/DataCamp/SentimentAnalysisinR/neg_comments.txt")
# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")
pos_terms
and neg_terms
documents into a single corpus called all_terms
.
# Organize
all_terms <- c(pos_terms,neg_terms)
tm
workflow by nesting VectorSource()
inside VCorpus()
applied to all_terms
.
# VCorpus
all_corpus <- VCorpus(VectorSource(all_terms))
TermDocumentMatrix()
using the all_corpus
object. Note this is a TfIdf weighted TDM with basic cleaning functions.
# TDM
all_tdm <- TermDocumentMatrix(
all_corpus,
control = list(
weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
)
all_tdm
to all_tdm_m
using as.matrix()
. Then rename the columns in the existing code to "positive"
and "negative"
.
# Column names
all_tdm_m <- as.matrix(all_tdm)
colnames(all_tdm_m) <- c("positive", "negative")
comparison.cloud()
to the matrix object all_tdm_m
. Take notice of the new most frequent negative words. Maybe it will uncover an unknown insight!
# Comparison cloud
comparison.cloud(
all_tdm_m,
max.words = 100,
colors = c("darkgreen", "darkred")
)
Almost there! Another comparison cloud to help you extract your insights.
4.4 Reach a conclusion
4.4.1 Confirm an expected conclusion
Refer to the following plot from the exercise "Comparison Cloud":
Its not surprising that the most common positive words for rentals included "walk", "restaurants", "subway" and "stations". In contrast, top negative terms included "condition", "dumpsters", "hygiene", "safety" and "sounds".
If you were looking to rent your clean apartment and it was close to public transit and good restaurants would it get a favorable review?
-
Yes
-
No
4.4.2 Choose a less expected insight
Refer to the following plot from the exercise "Scaled Comparison Cloud":
For your rental, should you use an automated posting?
-
Yes
-
No