Add sentiment analysis to your text mining toolkit! Sentiment analysis is used by text miners in marketing, politics, customer service and elsewhere. In this course you will learn to identify positive and negative language, specific emotional intent, and make compelling visualizations. You will end the course by applying your sentiment analysis skills to Airbnb reviews to learn what makes for a good rental.
1 Fast & dirty: Polarity scoring
In the first chapter, you will learn how to apply qdap's sentiment function called polarity() .
1.1 Let's talk about our feelings
1.1.1 Jump right in! Visualize polarity
Sentiment analysis helps you extract an author's feelings towards a subject.
More specifically, to make a quantitative judgement about the sentiment
of some text, you need to give it a score. A simple method is a positive
or negative value related to a sentence, passage or a collection of
documents called a corpus. Scoring with positive or negative values only
is called "polarity." A useful function for extracting polarity scores
is counts()
applied to the polarity object. For a quick visual call plot()
on the polarity()
conversation data frame.
# Examine the text data
## person text
## 1 Nick DataCamp courses are the best
## 2 Jonathan I like talking to students
## 3 Martijn Other online data science curricula are boring.
## 4 Nicole What is for lunch?
## 5 Nick DataCamp has lots of great content!
## 6 Jonathan Students are passionate and are excited to learn
## 7 Martijn Other data science curriculum is hard to learn and to understand
## 8 Nicole I think the food here is good.
# Calc overall polarity score
text_df %$% polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 8 53 0.214 0.393 0.544
## person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan 2 13 0.577 0.184 3.141
## 2 Martijn 2 18 -0.340 0.054 -6.284
## 3 Nick 2 11 0.428 0.028 15.524
## 4 Nicole 2 11 0.189 0.267 0.707
to datacamp_conversation
to print the specific emotional words that were found.
# Counts table from datacamp_conversation
## person wc polarity pos.words neg.words text.var
## 1 Nick 5 0.447 best - DataCamp courses are the best
## 2 Jonathan 5 0.447 like - I like talking to students
## 3 Martijn 7 -0.378 - boring Other online data science curricula are boring.
## 4 Nicole 4 0.000 - - What is for lunch?
## 5 Nick 6 0.408 great - DataCamp has lots of great content!
## 6 Jonathan 8 0.707 passionate, excited - Students are passionate and are excited to learn
## 7 Martijn 11 -0.302 - hard Other data science curriculum is hard to learn and to understand
## 8 Nicole 7 0.378 good - I think the food here is good.
the datacamp_conversation
# Plot the conversation polarity
Excellent, that was easy!
1.1.2 TM refresher (I)
In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text. To recap, one way to create & clean a corpus is with the functions below.
Turn a character vector into a text source using
. -
Turn a text source into a corpus using
. -
Remove unwanted characters from the corpus using cleaning functions like
, andreplace_abbreviation()
In this exercise a custom clean_corpus()
function has been created using standard preprocessing functions for easier application.
accepts the output of VCorpus()
and applies cleaning functions. For example:
processed_corpus <- clean_corpus(my_corpus)
Your R session has a text vector, tm_define
, containing two small documents and the function clean_corpus()
by applying VectorSource()
to tm_define
# Create a VectorSource
tm_vector <- VectorSource(tm_define)
using VCorpus()
on tm_vector
# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)
to examine the contents of the first document in tm_corpus
# Examine the first document's contents
## [1] "Text mining is the process of distilling actionable insights from text."
on tm_corpus
. Call this new object tm_clean
# Clean the text
tm_clean <- clean_corpus(tm_corpus)
object again to see how the text changed after clean_corpus()
was applied.
# Reexamine the contents of the first doc
## [1] "text mining process distilling actionable insights text"
Feeling fresh! If you work with text, it's useful to know how to manipulate corpora.
1.1.3 TM refresher (II)
Now let's create a Document Term Matrix (DTM). In a DTM:
- Each row of the matrix represents a document.
- Each column is a unique word token.
- Values of the matrix correspond to an individual document's word usage.
The DTM is the basis for many bag of words analyses. Later in the course, you will also use the related Term Document Matrix (TDM). This is the transpose; that is, columns represent documents and rows represent unique word tokens.
You should construct a DTM after cleaning the corpus (using clean_corpus()
). To do so, call DocumentTermMatrix()
on the corpus object.
tm_dtm <- DocumentTermMatrix(tm_clean)
If you need a more in-depth refresher check out the Text Mining with Bag-of-Words in R course.
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
We've created a VCorpus()
object called clean_text
containing 1000 tweets mentioning coffee. The tweets have been cleaned
with the previously mentioned preprocessing steps and your goal is to
create a DTM from it.
to the clean_text
corpus to create a term frequency weighted DTM called tf_dtm
## [1] 1000 3098
# Subset part of tf_dtm_m for comparison
tf_dtm_m[16:20, 2975:2984]
## Terms
## Docs went were west westin westside wet wfriends what whatever whatislifeee
## 16 0 0 0 0 0 0 0 0 0 0
## 17 0 0 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0 0
Delightful use of a DocumentTermMatrix! These things crop up regularly in text mining.
1.2 Zipf's law & subjectivity lexicon
1.2.1 What is a subjectivity lexicon?
As discussed in the video a lexicon is a list of words. What is the purpose of a subjectivity lexicon?
A subjectivity lexicon lets you extract meaningful insights from text.
A subjectivity lexicon is a predefined list of words associated with emotional context such as positive/negative.
A subjectivity lexicon lets you subjectively argue your point!
1.2.2 Where can you observe Zipf's law?
Although Zipf observed a steep and predictable decline in word usage you may not buy into Zipf's law. You may be thinking "I know plenty of words, and have a distinctive vocabulary". That may be the case, but the same can't be said for most people! To prove it, let's construct a visual from 3 million tweets mentioning "#sb".
In this exercise, you will use the package metricsgraphics
. Although the author suggests using the pipe %>%
operator, you will construct the graphic step-by-step to learn about
the various aspects of the plot. The main function of the package metricsgraphics
is the mjs_plot()
function which is the first step in creating a JavaScript plot. Once
you have that, you can add other layers on top of the plot.
An example metricsgraphics
workflow without using the %>%
operator is below:
metro_plot <- mjs_plot(data, x = x_axis_name, y = y_axis_name, show_rollover_text = FALSE)
metro_plot <- mjs_line(metro_plot)
metro_plot <- mjs_add_line(metro_plot, line_one_values)
metro_plot <- mjs_add_legend(metro_plot, legend = c('names', 'more_names'))
on sb_words
to review top words.
using mjs_plot()
Pass in
withx = rank
andy = freq
. -
# Create metrics plot
sb_plot <- mjs_plot(sb_words, x = rank, y = freq, show_rollover_text = FALSE)
using mjs_line()
and pass in sb_plot
# Add 1st line
sb_plot <- mjs_line(sb_plot)
with mjs_add_line()
Pass in the previous
object and the vector,expectations
# Add 2nd line
sb_plot <- mjs_add_line(sb_plot, expectations)
object using mjs_add_legend()
Pass in the previous
object -
The legend labels should consist of
# Add legend
sb_plot <- mjs_add_legend(sb_plot, legend = c("Frequency", "Expectation"))
to display the plot. Mouseover a point to simultaneously highlight a freq
and Expectation
point. The magic of JavaScript!
# Display plot
Great job! While you may not obey Zipf's Law, it seems like most people on Twitter do!
1.2.3 Polarity on actual text
So far you have learned the basic components needed for assessing positive or negative intent in text.
- The subjectivity lexicon is a predefined list of words associated with emotions or positive/negative feelings.
- You don't have to list every word in a subjectivity lexicon because Zipf's law describes human expression.
A quick way to get started is to use the polarity()
function which has a built-in subjectivity lexicon.
The function scans the text to identify words in the lexicon. It then creates a cluster around an identified subjectivity word. Within the cluster valence shifters adjust the score. Valence shifters are words that amplify or negate the emotional intent of the subjectivity word. For example, "well known" is positive while "not well known" is negative. Here "not" is a negating term and reverses the emotional intent of "well known." In contrast, "very well known" employs an amplifier increasing the positive intent.
The polarity()
function then calculates a score using
subjectivity terms, valence shifters and the total number of words in
the passage. This exercise demonstrates a simple polarity calculation.
In the next video we look under the hood of polarity()
for more detail.
Calculate the polarity()
of positive
in a new object called pos_score
. Encase the entire call in parentheses so the output is also printed.
# Get counts
(pos_counts <- counts(pos_score))
## all wc polarity pos.words neg.words text.var
## 1 all 6 0.408 good - DataCamp courses are good for learning
element vector. Find the number of positive words in n_good
by calling length()
on the first part of the $pos.words
# Number of positive words
n_good <- length(pos_counts$pos.words[[1]])
. This value is stored in pos_count
as the wc
# Total number of words
n_words <- pos_counts$wc
calculation by dividing n_good
by sqrt()
of n_words
. Compare the result to pos_pol
to the equation's result.
# Verify polarity score
n_good / sqrt(n_words)
## [1] 0.4082483
Well done! Using the polarity()
function is much easier, and still gets the same answer!
1.3 qdap's polarity & lexicon
1.3.1 Happy songs!
Of course just positive and negative words aren't enough. In this
exercise you will learn about valence shifters which tell you about the
author's emotional intent. Previously you applied polarity()
to text without valence shifters. In this example you will see amplification and negation words in action.
Recall that an amplifying word adds 0.8 to a positive word in polarity()
so the positive score becomes 1.8. For negative words 0.8 is subtracted
so the total becomes -1.8. Then the score is divided by the square root
of the total number of words.
Consider the following example from Frank Sinatra:
- "It was a very good year"
"Good" equals 1 and "very" adds another 0.8. So, 1.8/sqrt(6) results in 0.73 polarity.
A negating word such as "not" will inverse the subjectivity score. Consider the following example from Bobby McFerrin:
- "Don't worry Be Happy"
"worry is now 1 due to the negation "don't." Adding the "happy", +1, equals 2. With 4 total words, 2 / sqrt(4)
equals a polarity score of 1.
# Examine conversation
## # A tibble: 3 × 2
## student text
## <chr> <chr>
## 1 Martijn This restaurant is never bad
## 2 Nick The lunch was very good
## 3 Nicole It was awful I got food poisoning and was extremely ill
to the text
column of conversation
to calculate polarity for the entire conversation.
# Polarity - All
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 3 21 0.317 0.565 0.561
again, this time passing two columns ofconversation
. -
The text variable is
and the grouping variable isstudent
# Polarity - Grouped
student_pol <- conversation %$%
polarity(text, student)
on student_pol
# Student results
## student total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Martijn 1 5 0.447 NA NA
## 2 Nick 1 5 0.805 NA NA
## 3 Nicole 1 11 -0.302 NA NA
function applied to student_pol
will print the sentence level polarity for the entire data frame along with lexicon words identified.
# Sentence by sentence
## student wc polarity pos.words neg.words text.var
## 1 Martijn 5 0.447 - bad This restaurant is never bad
## 2 Nick 5 0.805 good - The lunch was very good
## 3 Nicole 11 -0.302 - awful It was awful I got food poisoning and was extremely ill
, can be plotted with plot()
# qdap plot
It was a very good piece of code you just wrote! 'Extremely good' is more positive than 'very good', which is more positive than 'good', which is more positive than 'quite good'.
1.3.2 LOL, this song is wicked good
Even with Zipf's law in action, you will still need to adjust lexicons
to fit the text source (for example twitter versus legal documents) or
the author's demographics (teenager versus the elderly). This exercise
demonstrates the explicit components of polarity()
so you can change it if needed.
In Trey Songz "Lol :)" song there is a lyric "LOL smiley face, LOL smiley face." In the basic polarity()
function, "LOL" is not defined as positive. However, "LOL" stands for
"Laugh Out Loud" and should be positive. As a result, you should adjust
the lexicon to fit the text's context which includes pop-culture slang.
If your analysis contains text from a specific channel (Twitter's
"LOL"), location (Boston's "Wicked Good"), or age group (teenagers'
"sick") you will likely have to adjust the lexicon.
In this exercise you are not adjusting the subjectivity lexicon or qdap
dictionaries containing valence shifters. Instead you are examining the
existing word data frame objects so you can change them in the
following exercise.
We've created text
containing two excerpts from Beyoncé's "Crazy in Love" lyrics for the exercise.
to see a portion of the subjectivity words and values.
# Examine the key.pol
## x y
## 1: a plus 1
## 2: abnormal -1
## 3: abolish -1
## 4: abominable -1
## 5: abominably -1
## ---
## 6775: zealously -1
## 6776: zenith 1
## 6777: zest 1
## 6778: zippy 1
## 6779: zombie -1
to print all the negating terms.
# Negators
## [1] "ain't" "aren't" "can't" "couldn't" "didn't" "doesn't"
## [7] "don't" "hasn't" "isn't" "mightn't" "mustn't" "neither"
## [13] "never" "no" "nobody" "nor" "not" "shan't"
## [19] "shouldn't" "wasn't" "weren't" "won't" "wouldn't"
to see the words that add values to the lexicon.
# Amplifiers
## [1] "acute" "acutely" "certain" "certainly"
## [5] "colossal" "colossally" "deep" "deeply"
## [9] "definite" "definitely" "enormous" "enormously"
## [13] "extreme" "extremely" "great" "greatly"
## [17] "heavily" "heavy" "high" "highly"
## [21] "huge" "hugely" "immense" "immensely"
## [25] "incalculable" "incalculably" "massive" "massively"
## [29] "more" "particular" "particularly" "purpose"
## [33] "purposely" "quite" "real" "really"
## [37] "serious" "seriously" "severe" "severely"
## [41] "significant" "significantly" "sure" "surely"
## [45] "true" "truly" "vast" "vastly"
## [49] "very"
to print the words that reduce the lexicon values.
# De-amplifiers
## [1] "barely" "faintly" "few" "hardly" "little"
## [6] "only" "rarely" "seldom" "slightly" "sparsely"
## [11] "sporadically" "very few" "very little"
to see conversation.
# Examine
## # A tibble: 2 × 2
## speaker words
## <chr> <chr>
## 1 beyonce I know I dont understand Just how your love can do what no one else c…
## 2 jay_z They cant figure him out they like hey, is he insane
as follows.-
. -
. -
. -
. -
. -
# Complete the polarity parameters
text.var = text$words,
grouping.var = text$speaker,
polarity.frame = key.pol,
negators = negation.words,
amplifiers = amplification.words,
deamplifiers = deamplification.words
## speaker total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 beyonce 1 16 0.25 NA NA
## 2 jay_z 1 11 0.00 NA NA
Powerful polarizing! The polarity()
function is very flexible and allows you to override score given to each word.
1.3.3 Stressed Out!
Here you will adjust the negative words to account for the specific text. You will then compare the basic and custom polarity()
A popular song from Twenty One Pilots is called "Stressed Out". If you scan the song lyrics, you will observe the song is about youthful nostalgia. Overall, most people would say the polarity is negative. Repeatedly the lyrics mention stress, fears and pretending.
Let's compare the song lyrics using the default subjectivity lexicon and also a custom one.
To start, you need to verify the key.pol
subjectivity lexicon does not already have the term you want to add. One way to check is with grep()
. The grep()
function returns the row containing characters that match a search pattern. Here is an example used while indexing.
data_frame[grep("search_pattern", data_frame$column), ]
After verifying the slang or new word is not already in the key.pol
lexicon you need to add it. The code below uses sentiment_frame()
to construct the new lexicon. Within the code sentiment_frame()
accepts the original positive word vector, positive.words
. Next, the original negative.words
are concatenated to "smh" and "kappa", both considered negative slang.
Although you can declare the positive and negative weights, the default
is 1 and -1 so they are not included below.
custom_pol <- sentiment_frame(positive.words, c(negative.words, "hate", "pain"))
Now you are ready to apply polarity and it will reference the custom subjectivity lexicon!
We've created stressed_out
which contains the lyrics to the song "Stressed Out", by Twenty One Pilots.
on stressed_out
to see the default score.
# Basic lexicon score
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.255 NA NA
for any words containing "stress". Use grep()
to index the data frame by searching in the x
# Check the subjectivity lexicon
key.pol[grep("stress", x)]
## x y
## 1: distress -1
## 2: distressed -1
## 3: distressing -1
## 4: distressingly -1
## 5: mistress -1
## 6: stress -1
## 7: stresses -1
## 8: stressful -1
## 9: stressfully -1
as a new sentiment data frame.sentiment_frame()
and pass positive.words
as the first argument without concatenating any new terms.
to combine negative.words
with new terms "stressed" and "turn back".
# New lexicon
custom_pol <- sentiment_frame(positive.words, c(negative.words, "stressed", "turn back"))
to stressed_out
with the additional parameter polarity.frame = custom_pol
to compare how the new words change the score to a more accurate representation of the song.
# Compare new score
polarity(stressed_out, polarity.frame = custom_pol)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.826 NA NA
Great work! It's important to take the specific features of the text you're analyzing into account so that you can make sure your results are accurate.
2 Sentiment analysis the tidytext way
In the second chapter you will explore 3 subjectivity lexicons from tidytext. Then you will do an inner join to score some text.
2.1 Wheel of emotion
2.1.1 One theory of emotion
What is the philosophical basis for the Plutchik's wheel of emotion?
Plutchik wanted a round framework so made it like a wheel.
Plutchik was an angry person and wanted to explain his actions to others.
Plutchik believed the primary emotions were formed as survival mechanisms in humans and animals.
Plutchik performed extensive field tests with sloths in the field because they are slow to react.
2.1.2 DTM vs. tidytext matrix
The tidyverse is a collection of R packages that share common philosophies and are designed to work together. This chapter covers some tidy functions to manipulate data.
Within the tidyverse, each observation is a single row in a data frame.
That makes working in different packages much easier since the
fundamental data structure is the same. Parts of this course borrow
heavily from the tidytext
package which uses this data organization.
For example, you may already be familiar with the %>%
operator from the magrittr
package. This forwards an object on its left-hand side as the first argument of the function on its right-hand side.
In the example below, you are forwarding the data
object to function1()
. Notice how the parentheses are empty. This in turn is forwarded to function2()
. In the last function you don't have to add the data
object because it was forwarded from the output of function1()
. However, you do add a fictitious parameter, some_parameter
. These pipe forwards ultimately create the object
object <- data %>%
function1() %>%
function2(some_parameter = TRUE)
To use the %>%
operator, you don't necessarily need to load the magrittr
package, since it is also available in the dplyr
package. dplyr
also contains the functions inner_join()
(which you'll learn more about later) and count()
for tallying data. The last function you'll need is mutate()
to create new variables or modify existing ones.
object <- data %>%
mutate(new_Var_name = Var1 - Var2)
or to modify a variable
object <- data %>%
mutate(Var1 = as.factor(Var1))
You will also use tidyr
's spread()
function to organize the data with each row being a line from the book and the positive and negative values as columns.
index | negative | positive |
42 | 2 | 0 |
43 | 0 | 1 |
44 | 1 | 0 |
To change a DTM to a tidy format use tidy()
from the broom
tidy_format <- tidy(Document_Term_Matrix)
This exercise uses text from the Greek tragedy, Agamemnon. Agamemnon is a story about marital infidelity and murder.
We've already created a clean DTM called ag_dtm
for this exercise.
by applying as.matrix()
to ag_dtm
# As matrix
ag_dtm_m <- as.matrix(ag_dtm)
and ]
, index ag_dtm_m
to row 2206
# Examine line 2206 and columns 245:250
ag_dtm_m[2206, 245:250]
## birds birdthroated birth bite bitter black
## 0 0 0 0 0 0
to ag_dtm
. Call the new object ag_tidy
# Tidy up the DTM
ag_tidy <- tidy(ag_dtm)
at rows [831:835, ]
to compare the tidy format. You will see a common word from the examined part of ag_dtm_m
in step 2.
# Examine tidy with a word you saw
ag_tidy[831:835, ]
## # A tibble: 5 × 3
## document term count
## <chr> <chr> <dbl>
## 1 207 whateer 1
## 2 207 zeus 2
## 3 208 hear 1
## 4 208 love 1
## 5 208 name 1
Aces! See the difference?
2.1.3 Getting Sentiment Lexicons
So far you have used a single lexicon. Now we will transition to using three, each measuring sentiment in different ways.
The tidytext
package contains a function called get_sentiments
which along with the [textdata
] package allows you to download & interact well researched lexicons. Here is a small section of the loughran
Word | Sentiment |
abandoned | negative |
abandoning | negative |
abandonment | negative |
abandonments | negative |
abandons | negative |
This lexicon contains 4150 terms with corresponding information. We will be exploring other lexicons but the structure & method to get them is similar.
Let's use tidytext
with textdata
to explore other lexicons' word labels!
to obtain the "afinn"
lexicon, assigning to afinn_lex
of value
in afinn_lex
# Count AFINN scores
afinn_lex %>%
## value n
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 966
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
Do the same again, this time with the
lexicon. That is,-
get the sentiments, assigning to
, then -
count the
column, assigning tonrc_counts
get the sentiments, assigning to
# Subset to nrc
nrc_lex <- read.csv("")
# Make the nrc counts object
nrc_counts <- nrc_lex %>%
Create a ggplot labeling the y-axis as
vs. x-axis ofsentiment
. -
Add a
layer usinggeom_col()
. (This is likegeom_bar()
, but used when you've already summarized withcount()
# From previous step
nrc_counts <- nrc_lex %>%
# Plot n vs. sentiment
ggplot(nrc_counts, aes(x = sentiment, y = n)) +
# Add a col layer
geom_col() +
Lovely lexicon exploration! Negative words are the most common type in the NRC lexicon.
2.2 Bing lexicon
2.2.1 Bing tidy polarity: Simple example
Now that you understand the basics of an inner join, let's apply this to the "Bing" lexicon. Keep in mind the inner_join()
function comes from dplyr
and the lexicon object is obtained using tidytext
's get_sentiments()
The Bing lexicon labels words as positive or negative. The next three
exercises let you interact with this specific lexicon. To use get_sentiments()
pass in a string such as "afinn", "bing", "nrc", or "loughran" to download the specific lexicon.
The inner join workflow:
Obtain the correct lexicon using
. -
Pass the lexicon and the tidy text data to
. -
In order for
to work there must be a shared column name. If there are no shared column names, declare them with an additional parameter,by
equal toc
with column names like below.
object <- x %>%
inner_join(y, by = c("column_from_x" = "column_from_y"))
- Perform some aggregation and analysis on the table intersection.
We've loaded ag_txt
containing the first 100 lines from Agamemnon and ag_tidy
which is the tidy version.
on ag_txt
# Qdap polarity
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 2683 15145 -0.054 0.315 -0.171
lexicon by passing that string to get_sentiments()
# Get Bing lexicon
bing <- get_sentiments("bing")
with ag_tidy
and bing
The word columns are called
in the lexicon, so declare theby
argument. -
Call the new object
# Join text to lexicon
ag_bing_words <- inner_join(ag_tidy, bing, by = c("term" = "word"))
, and look at some of the words that are in the result.
# Examine
## # A tibble: 1,425 × 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 7 waste 1 negative
## 2 8 respite 1 positive
## 3 10 well 1 positive
## 4 11 lonely 1 negative
## 5 13 great 1 positive
## 6 13 heavenly 1 positive
## 7 19 dark 1 negative
## 8 20 fear 1 negative
## 9 21 warning 1 negative
## 10 22 well 1 positive
## # … with 1,415 more rows
to count()
of sentiment
using the pipe operator, %>%. Compare the polarity()
score to sentiment count ratio.
# Get counts by sentiment
ag_bing_words %>%
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 904
## 2 positive 521
Great work! Did you notice the sentiment count()
ratio? It's 321:162.
2.2.2 Bing tidy polarity: Count & spread the white whale
In this exercise you will apply another inner_join()
using the "bing"
Then you will manipulate the results with both count()
from dplyr
and spread()
from tidyr
to learn about the text.
The spread()
function spreads a key-value pair across
multiple columns. In this case the key is the sentiment & the values
are the frequency of positive or negative terms for each line. Using spread()
changes the data so that each row now has positive and negative values, even if it is 0.
In this exercise, your R session has m_dick_tidy
which contains the book Moby Dick and bing
, containing the lexicon similar to the previous exercise.
on m_dick_tidy
and bing
As before, join the
column inm_dick_tidy
to the"word"
column in the lexicon. -
Call the new object
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
## # A tibble: 10,531 × 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9 1
## 2 negative 10 1
## 3 negative 22 1
## 4 negative 42 1
## 5 negative 43 2
## 6 negative 45 1
## 7 negative 58 1
## 8 negative 65 1
## 9 negative 67 1
## 10 negative 70 1
## # … with 10,521 more rows
moby_spread <- moby_count %>%
# Spread sentiments
spread(sentiment, n, fill = 0)
# Review the spread data
## # A tibble: 9,246 × 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9 1 0
## 2 10 1 0
## 3 13 0 1
## 4 17 0 1
## 5 19 0 1
## 6 22 1 0
## 7 24 0 1
## 8 25 0 1
## 9 31 0 2
## 10 35 0 2
## # … with 9,236 more rows
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
, equal to as.numeric()
applied to document
. This occurs within mutate()
in the tidyverse.
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
by forwarding moby_lex_words
to count()
, passing in sentiment, index
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
## # A tibble: 10,531 × 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9 1
## 2 negative 10 1
## 3 negative 22 1
## 4 negative 42 1
## 5 negative 43 2
## 6 negative 45 1
## 7 negative 58 1
## 8 negative 65 1
## 9 negative 67 1
## 10 negative 70 1
## # … with 10,521 more rows
by piping moby_count
to spread()
which contains sentiment
, n
, and fill = 0
moby_spread <- moby_count %>%
# Spread sentiments
spread(sentiment, n, fill = 0)
# Review the spread data
## # A tibble: 9,246 × 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9 1 0
## 2 10 1 0
## 3 13 0 1
## 4 17 0 1
## 5 19 0 1
## 6 22 1 0
## 7 24 0 1
## 8 25 0 1
## 9 31 0 2
## 10 35 0 2
## # … with 9,236 more rows
Excellent work! You slew the data wrangling white whale!
2.2.3 Bing tidy polarity: Call me Ishmael (with ggplot2)!
The last Bing lexicon exercise! In this exercise you will use the pipe operator (%>%
) to create a timeline of the sentiment in Moby Dick.
In the end you will also create a simple visual following the code
structure below. The next chapter goes into more depth for visuals.
ggplot(spread_data, aes(index_column, polarity_column)) +
to the bing
to join the tibbles. -
Join by the
column in the text and theword
column in the lexicon.
and index
. -
The key column (to split into multiple columns) is
. -
The value column (containing the counts) is
. -
Also specify
fill = 0
to fill out missing values with a zero.
to add the polarity
column. Define it as the difference between the positive
and negative
moby_polarity <- moby %>%
# Inner join to lexicon
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
# Count the sentiment scores
count(sentiment, index) %>%
# Spread the sentiment into positive and negative columns
spread(sentiment, n, fill = 0) %>%
# Add polarity column
mutate(polarity = positive - negative)
, plot polarity
vs. index
# From previous step
moby_polarity <- moby %>%
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
count(sentiment, index) %>%
spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
with no arguments.
# Plot polarity vs. index
ggplot(moby_polarity, aes(index, polarity)) +
# Add a smooth trend curve
Call me pleased with your work! Does Moby Dick have a happy ending?
2.3 AFINN & NRC lexicon
2.3.1 AFINN: I'm your Huckleberry
Now we transition to the AFINN lexicon. The AFINN lexicon has numeric
values from 5 to -5, not just positive or negative. Unlike the Bing
lexicon's sentiment
, the AFINN lexicon's sentiment score column is called value
As before, you apply inner_join()
then count()
. Next, to sum the scores of each line, we use dplyr
's group_by()
and summarize()
functions. The group_by()
function takes an existing data frame and converts it into a grouped
data frame where operations are performed "by group". Then, the summarize()
function lets you calculate a value for each group in your data frame using a function that aggregates data, like sum()
or mean()
. So, in our case we can do something like
data_frame %>%
group_by(book_line) %>%
summarize(total_value = sum(book_line))
In the tidy version of Huckleberry Finn, line 9703 contains words "best", "ever", "fun", "life" and "spirit". "best" and "fun" have AFINN scores of 3 and 4 respectively. After aggregating, line 9703 will have a total score of 7.
In the tidyverse, filter()
is preferred to subset()
because it combines the functionality of subset()
with simpler syntax. Here is an example that filter()
s data_frame
where some value in column1
is equal to 24
. Notice the column name is not in quotes.
filter(data_frame, column1 == 24)
The afinn
object contains the AFINN lexicon. The huck
object is a tidy version of Mark Twain's Adventures of Huckleberry Finn for analysis.
Line 5400 is All the loafers looked glad; I reckoned they was used to having fun out of Boggs. Stopwords and punctuation have already been removed in the dataset.
huck=df %>% filter(book=="huck_finn") %>% mutate(line=document) %>% select(term,count,line)
# See abbreviated line 5400
huck %>% filter(line == 5400)
## # A tibble: 7 × 3
## term count line
## <chr> <dbl> <chr>
## 1 all 1 5400
## 2 fun 1 5400
## 3 glad 1 5400
## 4 loafers 1 5400
## 5 looked 1 5400
## 6 reckoned 1 5400
## 7 used 1 5400
# What are the scores of the sentiment words?
afinn %>% filter(word %in% c("fun", "glad"))
## word value
## 1 fun 4
## 2 glad 3
to the afinn
is already piped into the function so just add the lexicon. -
Join by the
column in the text and theword
column in the lexicon.
with value
and line
to tally/count observations by group.-
Assign the result to
huck_afinn <- huck %>%
# Inner Join to AFINN lexicon
inner_join(afinn, by = c("term" = "word")) %>%
# Count by value and line
count(value, line)
to group_by()
and passing line
without quotes.
, settingtotal_value
equal to thesum()
ofvalue * n
huck_afinn_agg <- huck_afinn %>%
# Group by line
group_by(line) %>%
# Sum values times n (by line)
summarize(total_value = sum(value * n))
on huck_afinn_agg
and line == 5400
to review a single line.
huck_afinn_agg %>%
# Filter for line 5400
filter(line == 5400)
## # A tibble: 1 × 2
## line total_value
## <chr> <int>
## 1 5400 7
Create a sentiment timeline. Pass
to thedata
argument ofggplot()
Then specify the
without quotes. -
Add a layer with
Then specify the
# Plot total_value vs. line
ggplot(huck_afinn_agg, aes(line, total_value)) +
# Add a smooth trend curve
Wow, you're a tidytext
wizard! Huckleberry Finn has a not-quite-a-happy-ending.
2.3.2 The wonderful wizard of NRC
Last but not least, you get to work with the NRC lexicon which labels words across multiple emotional states. Remember Plutchik's wheel of emotion? The NRC lexicon tags words according to Plutchik's 8 emotions plus positive/negative.
In this exercise there is a new operator, %in%
, which matches a vector to another. In the code below %in%
will return FALSE
. This is because within some_vec
, 1
and 2
are not found within some_other_vector
but 3
is found and returns TRUE
. The %in%
is useful to find matches.
some_vec <- c(1, 2, 3)
some_other_vector <- c(3, "a", "b")
some_vec %in% some_other_vector
Another new operator is !
. For logical conditions, adding !
will inverse the result. In the above example, the FALSE
will become TRUE
. Using it in concert with %in%
will inverse the response and is good for removing items that are matched.
!some_vec %in% some_other_vector
We've created oz
which is the tidy version of The Wizard of Oz along with nrc
containing the "NRC" lexicon with renamed columns.
to the nrc
to join the tibbles. -
column in the text and theword
column in the lexicon.
to keep rows where thesentiment
is not"positive"
, passingsentiment
without quotes.
, settingtotal_count
equal to thesum()
. -
Assign the result to
oz_corpus <- VCorpus(oz_source)
tf_dtm <- DocumentTermMatrix(clean_text)
oz_plutchik <- oz %>%
# Join to nrc lexicon by term = word
inner_join(nrc, by = c("term" = "word")) %>%
# Only consider Plutchik sentiments
filter(!sentiment %in% c("positive", "negative")) %>%
# Group by sentiment
group_by(sentiment) %>%
# Get total count by sentiment
summarize(total_count = sum(count))
Create a bar plot with
Pass in
to thedata
argument. -
Then specify the
aesthetics, callingaes()
and passingsentiment
without quotes. -
Add a column geom with
. (This is the same asgeom_bar()
, but doesn't summarize the data, since you've done that already.)
Pass in
# Plot total_count vs. sentiment
ggplot(oz_plutchik, aes(x = sentiment, y = total_count)) +
# Add a column geom
Your childhood memories are correct: The Wizard of Oz is a scary story. Fear is the most prevalent sentiment in this text.
3 Visualizing sentiment
Make compelling visuals with your sentiment output.
3.1 Parlor trick or worthwhile?
3.1.1 Real insight?
You are given a stack of 10 employee surveys and told to figure out the team's sentiment. The two question survey has 1 question with a numeric scale (1-10) where employees answer how inspired they are at work and a second question for free form text.
You are asked to perform a sentiment analysis on the free form text. Would performing sentiment analysis on the text be appropriate?
Yes, the sentiment analysis confirms the employee ratings.
No, the free form text will correlate with the ratings and with only 10 surveys the results may have selection and simultaneity bias.
3.1.2 Unhappy ending? Chronological polarity
Sometimes you want to track sentiment over time. For example, during an ad campaign you could track brand sentiment to see the campaign's effect. You saw a few examples of this at the end of the last chapter.
In this exercise you'll recap the workflow for exploring sentiment over time using the novel Moby Dick. One should expect that happy moments in the book would have more positive words than negative. Conversely dark moments and sad endings should use more negative language. You'll also see some tricks to make your sentiment time series more visually appealing.
Recall that the workflow is:
- Inner join the text to the lexicon by word.
- Count the sentiments by line.
- Reshape the data so each sentiment has its own column.
- (Depending upon the lexicon) Calculate the polarity as positive score minus negative score.
- Draw the polarity time series.
This exercise should look familiar: it extends Bing tidy polarity: Call me Ishmael (with ggplot2)!.
the pre-loaded tidy version of Moby Dick, moby
, to the bing
Join by the
column in the text and the"word"
column in the lexicon.
and index
with the column sentiment
and the counts column called n
Also specify
fill = 0
to fill out missing values with a zero.
add two columns: polarity
and line_number
equal to the positive score minus the negative score. -
equal to the row number using therow_number()
moby_polarity <- moby %>%
# Inner join to the lexicon
inner_join(bing, by = c("term" = "word")) %>% mutate(index=row_number()) %>%
# Count by sentiment, index
count(sentiment, index) %>%
# Spread sentiments
spread(sentiment, n, fill = 0) %>%
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number()
Create a sentiment time series with
Pass in
to thedata
argument. -
and pass inline_number
without quotes. -
Add a smoothed curve with
. -
Add a red horizontal line at zero by calling
, with parameters0
. -
Add a title with
set to"Moby Dick Chronological Polarity"
Pass in
# Plot polarity vs. line_number
ggplot(moby_polarity, aes(line_number, polarity)) +
# Add a smooth trend curve
geom_smooth() +
# Add a horizontal line at y = 0
geom_hline(yintercept = 0, color = "red") +
# Add a plot title
ggtitle("Moby Dick Chronological Polarity") +
Nice data viz! The story isn't much happier this time around!
3.1.3 Word impact, frequency analysis
One of the easiest ways to explore data is with a frequency analysis.
Although not difficult, in sentiment analysis this simple method can be
surprisingly illuminating. Specifically, you will build a barplot. In
this exercise you are once again working with moby
and bing
to construct your visual.
To get the bars ordered from lowest to highest, you will use a trick with factors. reorder()
lets you change the order of factor levels based upon another scoring
variable. In this case, you will reorder the factor variable term
by the scoring variable polarity
moby_tidy_sentiment <- moby %>%
# Inner join to bing lexicon by term = word
inner_join(bing, by = c("term" = "word")) %>%
# Count by term and sentiment, weighted by count
count(term, sentiment, wt = count) %>%
# Spread sentiment, using n as values
spread(sentiment, n, fill = 0) %>%
# Mutate to add a polarity column
mutate(polarity = positive - negative)
to review and compare it to the previous exercise.
# Review
## # A tibble: 2,344 × 4
## term negative positive polarity
## <chr> <dbl> <dbl> <dbl>
## 1 abominable 3 0 -3
## 2 abominate 1 0 -1
## 3 abomination 1 0 -1
## 4 abound 0 3 3
## 5 abruptly 2 0 -2
## 6 absence 5 0 -5
## 7 absurd 3 0 -3
## 8 absurdly 1 0 -1
## 9 abundance 0 3 3
## 10 abundant 0 2 2
## # … with 2,334 more rows
moby_tidy_pol <- moby_tidy_sentiment %>%
# Filter for absolute polarity at least 50
filter(abs(polarity) >= 50) %>%
# Add positive/negative status
pos_or_neg = ifelse(polarity > 0, "positive", "negative")
, plotpolarity
, reordered bypolarity
(reorder(term, polarity)
ed bypos_or_neg
. -
, rotate the x-axis text90
degrees by settingangle = 90
and shifting the vertical justification withvjust = -0.1
# Plot polarity vs. (term reordered by polarity), filled by pos_or_neg
ggplot(moby_tidy_pol, aes(reorder(term, polarity), polarity, fill = pos_or_neg)) +
geom_col() +
ggtitle("Moby Dick: Sentiment Word Frequency") +
theme_gdocs() +
# Rotate text and vertically justify
theme(axis.text.x = element_text(angle = 90, vjust = -0.1))
Amazing! You went all the way from documents to visualizations in no time at all.
3.2 Introspection
3.2.1 Divide & conquer: Using polarity for a comparison cloud
Now that you have seen how polarity can be used to divide a corpus, let's do it! This code will walk you through dividing a corpus based on sentiment so you can peer into the information in subsets instead of holistically.
Your R session has oz_pol
which was created by applying polarity()
to "The Wonderful Wizard of Oz."
For simplicity's sake, we created a simple custom function called pol_subsections()
which will divide the corpus by polarity score. First, the function
accepts a data frame with each row being a sentence or document of the
corpus. The data frame is subset anywhere the polarity values are
greater than or less than 0. Finally, the positive and negative
sentences, non-zero polarities, are pasted with parameter collapse
so that the terms are grouped into a single corpus. Lastly, the two
documents are concatenated into a single vector of two distinct
pol_subsections <- function(df) {
x.pos <- subset(df\$text, df\$polarity > 0)
x.neg <- subset(df\$text, df\$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
At this point you have omitted the neutral sentences and want to focus
on organizing the remaining text. In this exercise we use the %>%
operator again to forward objects to functions. After some simple cleaning use
to make the visual.
, declaring the first columntext
which is the raw text. The second columnpolarity
should refer to the polarity scorespolarity
oz_df <- oz_pol$all %>%
# Select text.var as text and polarity
select(text = text.var, polarity = polarity)
to oz_df
. Call the new object all_terms
pol_subsections=function(df) {
x.pos <- subset(df$text, df$polarity > 0)
x.neg <- subset(df$text, df$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
# Apply custom function pol_subsections()
all_terms <- pol_subsections(oz_df)
apply VectorSource()
to all_terms
and then %>%
to VCorpus()
all_corpus <- all_terms %>%
# Source from a vector
VectorSource() %>%
# Make a volatile corpus
Create a term-document matrix,
, usingTermDocumentMatrix()
Add in the parameters
control = list(removePunctuation = TRUE, stopwords = stopwords(kind = "en")))
. -
again toset_colnames(c("positive", "negative"))
Add in the parameters
all_tdm <- TermDocumentMatrix(
# Create TDM from corpus
control = list(
# Yes, remove the punctuation
removePunctuation = TRUE,
# Use English stopwords
stopwords = stopwords(kind = "en")
) %>%
# Convert to matrix
as.matrix() %>%
# Set column names
set_colnames(c("positive", "negative"))
to all_tdm
with parameters max.words = 50
, and colors = c("darkgreen","darkred")
# Create plot from the all_tdm matrix
# Limit to 50 words
max.words = 50,
# Use darkgreen and darkred colors
colors = c("darkgreen", "darkred")
Fantastic work! Word clouds are a great way to get an overview of your data.
3.2.2 Emotional introspection
In this exercise you go beyond subsetting on positive and negative language. Instead you will subset text by each of the 8 emotions in Plutchik's emotional wheel to construct a visual. With this approach you will get more clarity in word usage by mapping to a specific emotion instead of just positive or negative.
Using the tidytext
subjectivity lexicon, "nrc", you perform an inner_join()
with your text. The "nrc" lexicon has the 8 emotions plus positive and
negative term classes. So you will have to drop positive and negative
words after performing your inner_join()
. One way to do so is with the negation, !
, and grepl()
The "Global Regular Expression Print Logical" function, grepl()
, will return a True or False if a string pattern is identified in each row. In this exercise you will search for positive OR negative using the |
operator, representing "or" as shown below. Often this straight line is above the enter key on a keyboard. Since the !
negation precedes grepl()
, the T or F is switched so the "positive|negative"
is dropped instead of kept.
Object <- tibble %>%
filter(!grepl("positive|negative", column_name))
Next you apply count()
on the identified words along with spread()
to get the data frame organized.
requires its input to have row names, so you'll have to convert it to a base-R data.frame
, calling data.frame()
with the row.names
to nrc
with a negation (!
) and grepl()
search for "positive|negative"
. The column to search is called sentiment
to count by sentiment
and term
, passing in sentiment
, n
, and fill = 0
, making the term
column into rownames.
moby_tidy <- moby %>%
# Inner join to nrc lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Drop positive or negative
filter(!grepl("positive|negative", sentiment)) %>%
# Count by sentiment and term
count(sentiment, term) %>%
# Spread sentiment, using n for values
spread(sentiment, n, fill = 0) %>%
# Convert to data.frame, making term the row names
data.frame(row.names = "term")
using head()
# Examine
## anger anticipation disgust fear joy sadness surprise trust
## abandon 0 0 0 3 0 3 0 0
## abandoned 7 0 0 7 0 7 0 0
## abandonment 2 0 0 2 0 2 2 0
## abhorrent 1 0 1 1 0 0 0 0
## abominable 0 0 3 3 0 0 0 0
## abomination 1 0 1 1 0 0 0 0
, draw
Limit to
words. -
Increase the title size to
Limit to
# From previous step
moby_tidy <- m_dick_tidy %>% mutate(document=as.numeric(document)) %>%
inner_join(nrc, by = c("term" = "word")) %>%
filter(!grepl("positive|negative", sentiment)) %>%
count(sentiment, term) %>%
spread(sentiment, n, fill = 0) %>%
data.frame(row.names = "term")
# Plot comparison cloud, max.words = 50, title.size = 1.5)
That's great! How does this cloud compare to the one from the previous exercise?
3.2.3 Compare & contrast stacked bar chart
Another way to slice your text is to understand how much of the document(s) are made of positive or negative words. For example a restaurant review may have some positive aspects such as "the food was good" but then continue to add "the restaurant was dirty, the staff was rude and parking was awful." As a result, you may want to understand how much of a document is dedicated to positive vs negative language. In this example it would have a higher negative percentage compared to positive.
One method for doing so is to count()
the positive and
negative words then divide by the number of subjectivity words
identified. In the restaurant review example, "good" would count as 1
positive and "dirty," "rude," and "awful" count as 3 negative terms. A
simple calculation would lead you to believe the restaurant review is
25% positive and 75% negative since there were 4 subjectivity terms.
Start by performing the inner_join()
on a unified tidy data frame containing 4 books, Agamemnon, Oz, Huck
Finn, and Moby Dick. Just like the previous exercise you will use filter()
and grepl()
To perform the count()
you have to group the data by book
and then sentiment. For example all the positive words for Agamemnon
have to be grouped then tallied so that positive words from all books
are not mixed. Luckily, you can pass multiple variables into count()
to the lexicon, nrc
contains "positive"
or "negative"
. That is, use grepl()
on the sentiment
column, checking without the negation so that "positive|negative"
are kept.
# Review tail of all_books
## # A tibble: 6 × 5
## term document count author book
## <chr> <chr> <dbl> <chr> <chr>
## 1 ebooks 19117 1 twain innocents_abroad
## 2 email 19117 1 twain innocents_abroad
## 3 hear 19117 1 twain innocents_abroad
## 4 new 19117 1 twain innocents_abroad
## 5 newsletter 19117 1 twain innocents_abroad
## 6 subscribe 19117 1 twain innocents_abroad
# Count by book & sentiment
books_sent_count <- all_books %>%
# Inner join to nrc lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Keep only positive or negative
filter(grepl("positive|negative", sentiment)) %>%
# Count by book and by sentiment
count(book, sentiment)
and sentiment
# Review entire object
## # A tibble: 22 × 3
## book sentiment n
## <chr> <chr> <int>
## 1 bartleby negative 531
## 2 bartleby positive 854
## 3 confidence_man negative 3456
## 4 confidence_man positive 5738
## 5 ct_yankee negative 3985
## 6 ct_yankee positive 6053
## 7 hamlet negative 1666
## 8 hamlet positive 2205
## 9 huck_finn negative 2401
## 10 huck_finn positive 3440
## # … with 12 more rows
. -
Mutate to add a column named
. This should e calculated as100
divided by the sum ofn
book_pos <- books_sent_count %>%
# Group by book
group_by(book) %>%
# Mutate to add % positive column
mutate(percent_positive = 100 * n / sum(n))
, plotpercent_positive
, usingsentiment
as the fill color. -
Add a column layer with
# Plot percent_positive vs. book, filled by sentiment
ggplot(book_pos, aes(book, percent_positive, fill = sentiment)) +
# Add a col layer
Cruising along! Now you know how to see the proportional positivity in text.
3.3 Interpreting visualizations
3.3.1 Kernel density plot
Now that you learned about a kernel density plot you can create one! Remember it's like a smoothed histogram but isn't affected by binwidth. This exercise will help you construct a kernel density plot from sentiment values.
In this exercise you will plot 2 kernel densities. One for Agamemnon and
another for The Wizard of Oz. For both you will perform an inner_join()
with the "afinn" lexicon. Recall the "afinn" lexicon has terms scored
from -5 to 5. Once in a tidy format, both books will retain words and
corresponding scores for the lexicon.
After that, you need to row bind the results into a larger data frame using bind_rows()
and create a plot with ggplot2
From the visual you will be able to understand which book uses more positive versus negative language. There is clearly overlap as negative things happen to Dorothy but you could infer the kernel density is demonstrating a greater probability of positive language in the Wizard of Oz compared to Agamemnon.
We've loaded ag
and oz
as tidy versions of Agamemnon and The Wizard of Oz respectively, and created afinn
as a subset of the tidytext
to the lexicon, afinn
, assigning to ag_afinn
ag_afinn <- ag %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word"))
dataset and assigning to oz_afinn
oz_afinn <- oz %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word"))
to combine ag_afinn
to oz_afinn
. Set the .id
argument to "book"
to create a new column with the name of each book.
# Combine
all_df <- bind_rows(agamemnon = ag_afinn, oz = oz_afinn, .id = "book")
, plotvalue
, usingbook
as thefill
color. -
Set the
transparency to0.3
# Plot value, filled by book
ggplot(all_df, aes(x = value, fill = book)) +
# Set transparency to 0.3
geom_density(alpha = 0.3) +
theme_gdocs() +
ggtitle("AFINN Score Densities")
Not bad. Kernel densities are great for understanding a distribution.
3.3.2 Box plot
An easy way to compare multiple distributions is with a box plot. This code will help you construct multiple box plots to make a compact visual.
In this exercise the all_book_polarity
object is already loaded. The data frame contains two columns, book
and polarity
. It comprises all books with qdap
's polarity()
function applied. Here are the first 3 rows of the large object.
book | polarity | |
14 | huck | 0.2773501 |
22 | huck | 0.2581989 |
26 | huck | -0.5773503 |
This exercise introduces tapply()
which allows you to apply functions over a ragged array. You input a
vector of values and then a vector of factors. For each factor, value
combination the third parameter, a function like min()
, is applied. For example here's some code with tapply()
used on two vectors.
f1 <- as.factor(c("Group1", "Group2", "Group1", "Group2"))
stat1 <- c(1, 2, 1, 2)
tapply(stat1, f1, sum)
The result is an array where Group1
has a value of 2 (1+1) and Group2
has a value of 4 (2+2).
with str()
# Examine
## 'data.frame': 14437 obs. of 2 variables:
## $ book : Factor w/ 4 levels "huck","agamemnon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ polarity: num 0.277 0.258 -0.577 0.25 0.516 ...
, pass in all_book_polarity$polarity
, all_book_polarity$book
and the summary()
function. This will print the summary statistics for the 4 books in terms of their polarity()
scores. You would expect to see Oz and Huck Finn to have higher
averages than Agamemnon or Moby Dick. Pay close attention to the median.
# Summary by document
tapply(all_book_polarity$polarity, all_book_polarity$book, summary)
## $huck
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.38695 -0.25820 0.23570 0.04156 0.26726 1.60357
## $agamemnon
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4667 -0.3780 -0.3333 -0.1266 0.3333 1.2247
## $moby
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.13333 -0.28868 -0.25000 -0.02524 0.28868 1.84752
## $oz
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2728 -0.2774 0.2582 0.0454 0.2887 1.1877
by passing in all_book_polarity
Aesthetics should be
aes(x = book, y = polarity)
. -
Using a
add thegeom_boxplot()
withcol = "darkred"
. Pay close attention to the dark line in each box representing median. -
Next add another layer called
to add points for each of the words.
# Box plot
ggplot(all_book_polarity, aes(x = book, y = polarity)) +
geom_boxplot(fill = c("#bada55", "#F00B42", "#F001ED", "#BA6E15"), col = "darkred") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.02) +
theme_gdocs() +
ggtitle("Book Polarity")
Boom goes the dynamite! Box plots help you quickly compare multiple distributions
3.3.3 Radar chart
Remember Plutchik's wheel of emotion? The NRC lexicon has the 8 emotions corresponding to the first ring of the wheel. Previously you created a
according to the 8 primary emotions. Now you will create a radar chart similar to the wheel in this exercise.
A radarchart
is a two-dimensional representation of multidimensional data (at least
3). In this case the tally of the different emotions for a book are
represented in the chart. Using a radar chart, you can review all 8
emotions simultaneously.
As before we've loaded the "nrc" lexicon as nrc
and moby_huck
which is a combined tidy version of both Moby Dick and Huck Finn.
In this exercise you once again use a negated grepl()
to remove "positive|negative"
emotional classes from the chart. As a refresher here is an example:
object <- tibble %>%
filter(!grepl("positive|negative", column_name))
This exercise reintroduces spread()
which rearranges the tallied emotional words. As a refresher consider this raw data called datacamp
people | food | like |
Nicole | bread | 78 |
Nicole | salad | 66 |
Ted | bread | 99 |
Ted | salad | 21 |
If you applied spread()
as in spread(datacamp, people, like)
the data looks like this.
food | Nicole | Ted |
bread | 78 | 99 |
salad | 66 | 21 |
with tail()
and nrc
negating "positive|negative"
in the sentiment
column. Assign the result to books_pos_neg
is forwarded to group_by()
with book
and sentiment
. Then tally()
the object with an empty function.
the books_tally
by the book
and n
scores <- moby_huck %>%
# Inner join to lexicon
inner_join(nrc, by = c("term" = "word")) %>%
# Drop positive or negative sentiments
filter(!grepl("positive|negative", sentiment)) %>%
# Count by book and sentiment
count(book, sentiment) %>%
# Spread book, using n as values
spread(book, n)
# Review scores
## # A tibble: 8 × 2
## sentiment moby_dick
## <chr> <int>
## 1 anger 2811
## 2 anticipation 4426
## 3 disgust 1996
## 4 fear 4177
## 5 joy 2781
## 6 sadness 3306
## 7 surprise 2074
## 8 trust 4784
Call chartJSRadar()
on scores
which is an htmlwidget
from the radarchart
# JavaScript radar chart
Radical radar plotting! Bar plots are usually a clearer alternative, but radar charts do look pretty.
3.3.4 Treemaps for groups of documents
Often you will find yourself working with documents in groups, such as
author, product or by company. This exercise lets you learn about the
text while retaining the groups in a compact visual. For example, with
customer reviews grouped by product you may want to explore multiple
dimensions of the customer reviews at the same time. First you could
calculate the polarity()
of the reviews. Another dimension
may be length. Document length can demonstrate the emotional intensity.
If a customer leaves a short "great shoes!" one could infer they are
actually less enthusiastic compared to a lengthier positive review. You
may also want to group reviews by product type such as women's, men's
and children's shoes. A treemap lets you examine all of these
For text analysis, within a treemap each individual box represents a document such as a tweet. Documents are grouped in some manner such as author. The size of each box is determined by a numeric value such as number of words or letters. The individual colors are determined by a sentiment score.
After you organize the tibble, you use the treemap
library containing the function treemap()
to make the visual. The code example below declares the data, grouping variables, size, color and other aesthetics.
index = c("group", "individual_document"),
vSize = "doc_length",
vColor = "avg_score",
type = "value",
title = "Sentiment Scores by Doc",
palette = c("red", "white", "green")
The pre-loaded all_books
object contains a combined tidy
format corpus with 4 Shakespeare, 3 Melville and 4 Twain books. Based on
the treemap you should be able to tell who writes longer books, and the
polarity of the author as a whole and for individual books.
Calculate each book's length in a new object called book_length
using count()
with the book
book_length <- all_books %>%
# Count number of words per book
Inner join
to the lexicon,afinn
. -
Group by
. -
to calculate themean_value
as themean()
. -
Inner join again, this time to
. Joinby
book_tree <- all_books %>%
# Inner join to afinn lexicon
inner_join(afinn, by = c("term" = "word")) %>%
# Group by author, book
group_by(author, book) %>%
# Calculate mean book value
summarize(mean_value = mean(value)) %>%
# Inner join by book
inner_join(book_length, by = "book")
# Examine the results
## # A tibble: 11 × 4
## # Groups: author [3]
## author book mean_value n
## <chr> <chr> <dbl> <int>
## 1 melville bartleby 0.0962 8871
## 2 melville confidence_man 0.484 48834
## 3 melville moby_dick 0.144 109996
## 4 shakespeare hamlet 0.0779 18725
## 5 shakespeare julius_caesar 0.0604 13165
## 6 shakespeare macbeth 0.206 12240
## 7 shakespeare romeo_juliet 0.151 16870
## 8 twain ct_yankee 0.189 58229
## 9 twain huck_finn 0.0727 55198
## 10 twain innocents_abroad 0.397 99031
## 11 twain tom_sawyer -0.0292 38831
Draw a treemap, setting the following arguments.
Use the
from the previous step. -
Specify the aggregation
columns as"author"
. -
Specify the vertex size column,
, as"n"
. -
Specify the vertex color column,
, as"mean_value"
. -
Specify a direct mapping from
to the palette by settingtype = "value"
Use the
# Use the book tree
# Index by author and book
index = c("author", "book"),
# Use n as vertex size
vSize = "n",
# Color vertices by mean_value
vColor = "mean_value",
# Draw a value type
type = "value",
title = "Book Sentiment Scores",
palette = c("red", "white", "green")
Terrific treemapping! Treemaps are great ways to explore grouped data.
4 Case study: Airbnb reviews
Is your property a good rental? What do people look for in a good rental?
4.1 The text mining workflow
4.1.1 Step 1: What do you want to know?
Throughout this chapter you will analyze the text of a corpus of Airbnb housing rental reviews. Which of the following questions can you answer using a sentiment analysis of these reviews?
What document clusters exist in the reviews?
How many words are associated with rental reviews?
What property qualities are listed in positive or negative comments?
What named entities are in the documents?
4.1.2 Step 2: Identify Text Sources
In this short exercise you will load and examine a small corpus of
property rental reviews from around Boston. Hopefully you already know read.csv()
which enables you to load a comma separated file. In this exercise you will also need to specify stringsAsFactors = FALSE
when loading the corpus. This ensures that the reviews are character
vectors, not factors. This may seem mundane but the point of this
chapter is to get you doing an entire workflow from start to finish so
let's begin with data ingestion!
Next you simply apply str()
to review the data frame's str
ucture. It is a convenient function for compactly displaying initial values and class types for vectors.
Lastly you will apply dim()
to print the dim
ensions of the data frame. For a data frame, your console will print the number of rows and the number of columns.
Other functions like head()
, tail()
or summary()
are often used for data exploration but in this case we keep the
examination short so you can get to the fun sentiment analysis!
The Boston property rental reviews are stored in a CSV file located by the predefined variable bos_reviews_file
# bos_reviews_file has been pre-defined
with read.csv()
. Call the object bos_reviews
. Be sure to pass in the parameter stringsAsFactors = FALSE
so the comments are not unique factors.
# load raw text
function applied to bos_reviews
# Structure
## 'data.frame': 1000 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ comments: chr "My daughter and I had a wonderful stay with Maura. She kept in close touch with us throughout the day as we wer"| __truncated__ "We stay at Elizabeth's place for 3 nights in October 2014.\nThe apartment is really a great place to stay. \nLo"| __truncated__ "If you're staying in South Boston, this is a terrific place to camp out. The apartment and bedroom are lovely, "| __truncated__ "Derian and Brian were great and prompt with their communications with us. The room was as described; it was a s"| __truncated__ ...
on the bos_reviews
# Dimensions
## [1] 1000 2
Hurrah! Now that you've imported the data, let's get started with the sentiment analysis.
4.1.3 Quickly examine the basic polarity
When starting a sentiment project, sometimes a quick polarity()
will help you set expectations or learn about the problem. In this exercise (to save time), you will apply polarity()
to a portion of the comments
vector while the larger polarity object is loaded in the background.
Using a kernel density plot you should notice the reviews do not center on 0. Often there are two causes for this sentiment "grade inflation." First, social norms may lead respondents to be pleasant instead of neutral. This, of course, is channel specific. Particularly snarky channels like e-sports or social media posts may skew negative leading to "deflation." These channels have different expectations. A second possible reason could be "feature based sentiment". In some reviews an author may write "the bed was comfortable and nice but the kitchen was dirty and gross." The sentiment of this type of review encompasses multiple features simultaneously and therefore could make an average score skewed.
In a subsequent exercise you will adjust this "grade inflation" but here explore the reviews without any change.
using polarity()
on the first six reviews as in bos_reviews$comments[1:6]
# Practice apply polarity to first 6 reviews
practice_pol <- polarity(bos_reviews$comments[1:6])
# Review the object
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 6 390 0.747 0.398 1.875
on practice_pol$all$polarity
- this will access the overall polarity for all 6 comments.
# Check out the practice polarity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.5009 0.6594 0.7466 1.0779 1.2455
. Now apply summary()
to the correct list element that returns all polarity scores of bos_pol
# Summary for all reviews
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.9712 0.6047 0.8921 0.9022 1.2063 3.7510 1
represents a column of this data frame.
# Plot Boston polarity all element
ggplot(bos_pol$all, aes(x = polarity, y = ..density..)) +
geom_histogram(binwidth = 0.25, fill = "#bada55", colour = "grey60") +
geom_density(size = 0.75) +
Out of the gate and you're crushing it! Quick and easy yet polarity can help you get familiar with your data.
4.2 Organize (& clean) the text
4.2.1 Create Polarity Based Corpora
In this exercise you will perform Step 3 of the text mining workflow. Although qdap
isn't a tidy package you will mutate()
a new column based on the returned polarity
list representing all polarity (that's a hint BTW) scores. In chapter 3 we used a custom function pol_subsections
which uses only base R declarations. However, in following the tidy principles this exercise uses filter()
then introduces pull()
. The pull()
function works like works like [[
to extract a single variable.
Once segregated you collapse all the positive and negative comments into two larger documents representing all words among the positive and negative rental reviews.
Lastly, you will create a Term Frequency Inverse Document Frequency
(TFIDF) weighted Term Document Matrix (TDM). Since this exercise code
starts with a tidy structure, some of the functions borrowed from tm
are used along with the %>%
operator to keep the style consistent. If the basics of the tm
If the basics of the tm package aren't familiar check out the Text Mining with Bag-of-Words in R course.
course. Instead of counting the number of times a word is used
(frequency), the values in the TDM are penalized for over used terms,
which helps reduce non-informative words.
Mutate to add a
column, equal tobos_pol$all$polarity
. -
Filter to keep rows where
is greater than zero. -
to extract thecomments
column. (Pass this column without quotes.) -
Collapse into a single string, separated by spaces using
, passingcollapse = " "
pos_terms <- bos_reviews %>%
# Add polarity column
mutate(polarity = bos_pol$all$polarity) %>%
# Filter for positive polarity
filter(polarity > 0) %>%
# Extract comments column
pull(comments) %>%
# Paste and collapse
paste(collapse = " ")
Do the same again, this time with negative comments.
Mutate to add a
column, equal tobos_pol\(all\)polarity
. -
Filter to keep rows where
is less than zero. -
Extract the
column. - Collapse into a single string, separated by spaces.
Mutate to add a
neg_terms <- bos_reviews %>%
# Add polarity column
mutate(polarity = bos_pol$all$polarity) %>%
# Filter for negative polarity
filter(polarity < 0) %>%
# Extract comments column
pull(comments) %>%
# Paste and collapse
paste(collapse = " ")
Create a corpus of both positive and negative comments.
to concatenatepos_terms
. -
Source the text using
without arguments. -
Convert to a volatile corpus by calling
, again without arguments.
# Concatenate the terms
all_corpus <- c(pos_terms, neg_terms) %>%
# Source from a vector
VectorSource() %>%
# Create a volatile corpus
Create a term-document matrix from
Use term frequency inverse document frequency weighting by setting
. -
Remove punctuation by setting
. -
Use English stopwords by setting
tostopwords(kind = "en")
Use term frequency inverse document frequency weighting by setting
all_tdm <- TermDocumentMatrix(
# Use all_corpus
control = list(
# Use TFIDF weighting
weighting = weightTfIdf,
# Remove the punctuation
removePunctuation = TRUE,
# Use English stopwords
stopwords = stopwords(kind = "en")
# Examine the TDM
# all_tdm
Congrats now you have a TFIDF weighted TDM splitting up your text!
4.2.2 Create a Tidy Text Tibble!
Since you learned about tidy principles this code helps you organize your data into a tibble so you can then work within the tidyverse!
Previously you learned that applying tidy()
on a TermDocumentMatrix()
object will convert the TDM to a tibble. In this exercise you will create the word data directly from the review column called comments
First you use unnest_tokens()
to make the text lowercase and tokenize the reviews into single words.
Sometimes it is useful to capture the original word order within each group of a corpus. To do so, use mutate()
. In mutate()
you will use seq_along()
to create a sequence of numbers from 1 to the length of the object. This will capture the word order as it was written.
In the tm
package, you would use removeWords()
to remove stopwords. In the tidyverse you first need to load the stop words lexicon and then apply an anti_join()
between the tidy text data frame and the stopwords.
by piping (%>%
) the original reviews object bos_reviews
to the unnest_tokens()
function. Pass in a new column name, word
and declare the comments
column. Remember in the tidyverse you don't need a $
or quotes.
# Vector to tibble
tidy_reviews <- bos_reviews %>%
unnest_tokens(word, comments)
by piping tidy_reviews
to group_by
with the column id
. Then %>%
it again to mutate()
. Within mutate create a new variable original_word_order
equal to seq_along(word)
# Group by and mutate
tidy_reviews <- tidy_reviews %>%
group_by(id) %>%
mutate(original_word_order = seq_along(word))
# Quick review
## # A tibble: 70,975 × 3
## # Groups: id [1,000]
## id word original_word_order
## <int> <chr> <int>
## 1 1 my 1
## 2 1 daughter 2
## 3 1 and 3
## 4 1 i 4
## 5 1 had 5
## 6 1 a 6
## 7 1 wonderful 7
## 8 1 stay 8
## 9 1 with 9
## 10 1 maura 10
## # … with 70,965 more rows
# Load stopwords
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
by passing the original tidy_reviews
to anti_join()
with a %>%
. Within anti_join()
pass in the predetermined stop_words
# Perform anti-join
tidy_reviews_without_stopwords <- tidy_reviews %>%
Tidy Text Tibbles are a mouthful but you did it!
4.2.3 Compare Tidy Sentiment to Qdap Polarity
Here you will learn that differing sentiment methods will cause
different results. Often you will simply need to have results align
directionally although the specifics may be different. In the last
exercise you created tidy_reviews
which is a data frame of rental reviews without stopwords. Earlier in the chapter, you calculated and plotted qdap
's basic polarity()
function. This showed you the reviews tend to be positive.
Now let's perform a similar analysis the tidytext
way! Recall from an earlier chapter you will perform an inner_join()
followed by count()
and then a spread()
Lastly, you will create a new column using mutate()
and passing in positive - negative
function with "bing" will obtain the bing subjectivity lexicon. Call the lexicon bing
# Get the correct lexicon
bing <- get_sentiments("bing")
, the new column name (polarity
) and its calculation within mutate()
# Calculate polarity for each review
pos_neg <- tidy_reviews %>%
inner_join(bing) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
on the new object pos_neg
. Although the values are different, are most rental reviews similarly positive compared to using polarity()
? Do you see "grade inflation?"
# Check outcome
## id negative positive polarity
## Min. : 1 Min. : 0.0000 Min. : 0.000 Min. :-10.000
## 1st Qu.: 251 1st Qu.: 0.0000 1st Qu.: 4.000 1st Qu.: 3.000
## Median : 499 Median : 0.0000 Median : 6.000 Median : 5.000
## Mean : 500 Mean : 0.6633 Mean : 6.571 Mean : 5.908
## 3rd Qu.: 748 3rd Qu.: 1.0000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :1000 Max. :14.0000 Max. :42.000 Max. : 37.000
Horray! Often different polarity methods yield similar results.
4.3 Feature Extraction & Analysis
4.3.2 Comparison Cloud
This exercise will create a common visual for you to understand term
frequency. Specifically, you will review the most frequent terms from
among the positive and negative collapsed documents. Recall the TermDocumentMatrix all_tdm
you created earlier. Instead of 1000 rental reviews the matrix contains 2 documents containing all reviews separated by the polarity()
It's usually easier to change the TDM to a matrix. From there you simply rename the columns. Remember that the colnames()
function is called on the left side of the assignment operator as shown below.
colnames(OBJECT) <- c("COLUMN_NAME1", "COLUMN_NAME2")
Once done, you will reorder the matrix to see the most positive and negative words. Review these terms so you can answer the conclusion exercises!
Lastly, you'll visualize the terms using
to a matrix called all_tdm_m
using as.matrix()
# Matrix
all_tdm_m <- as.matrix(all_tdm)
on all_tdm_m
to declare c("positive", "negative")
# Column names
colnames(all_tdm_m) <- c("positive", "negative")
to all_tdm_m[,1]
and set decreasing = TRUE
# Top pos words
order_by_pos <- order(all_tdm_m[, 1], decreasing = TRUE)
) then head()
with n = 10
# Review top 10 pos words
all_tdm_m[order_by_pos, ] %>% head(10)
## Docs
## Terms positive negative
## walk 0.004546528 0
## definitely 0.004133207 0
## quiet 0.003749410 0
## staying 0.003719887 0
## wonderful 0.003040860 0
## city 0.003011337 0
## restaurants 0.003011337 0
## highly 0.002745631 0
## station 0.002627539 0
## enjoyed 0.002420879 0
by the second column, all_tdm_m[,2]
and use decreasing = TRUE
# Top neg words
order_by_neg <- order(all_tdm_m[, 2], decreasing = TRUE)
by order_by_neg
. Pipe this to head()
with n = 10
# Review top 10 neg words
all_tdm_m[order_by_neg, ] %>% head(10)
## Docs
## Terms positive negative
## condition 0 0.002156722
## demand 0 0.001437815
## disappointed 0 0.001437815
## dumpsters 0 0.001437815
## hygiene 0 0.001437815
## inform 0 0.001437815
## nasty 0 0.001437815
## safety 0 0.001437815
## shouldve 0 0.001437815
## sounds 0 0.001437815
Draw a
on all_tdm_m
. Specify max.words
equal to 20
# Use the term-document matrix
# Limit to 20 words
max.words = 20,
colors = c("darkgreen", "darkred")
Success. Overused…yes. Still useful…yes!
4.3.3 Scaled Comparison Cloud
Recall the "grade inflation" of polarity scores on the rental reviews?
Sometimes, another way to uncover an insight is to scale the scores back
to 0 then perform the corpus subset. This means some of the previously
positive comments may become part of the negative subsection or vice
versa since the mean is changed to 0. This exercise will help you scale
the scores and then re-plot the
. Removing the "grade inflation" can help provide additional insights.
Previously you applied polarity()
to the bos_reviews$comments
and created a
. In this exercise you will scale()
the outcome before creating the
. See if this shows something different in the visual!
Since this is largely a review exercise, a lot of the code exists, just fill in the correct objects and parameters!
while indexing [1:6,1:3]
# Review
## all wc polarity
## 1 all 77 1.1851900
## 2 all 78 1.2455047
## 3 all 39 0.4803845
## 4 all 101 0.7562283
## 5 all 16 0.2500000
## 6 all 79 0.5625440
with scale()
applied to the polarity score column bos_pol$all$polarity
# Scale/center & append
bos_reviews$scaled_polarity <- scale(bos_pol$all$polarity)
where the new column bos_reviews$scaled_polarity
is greater than (>) zero.
# Subset positive comments
pos_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity > 0)
where the new column bos_reviews$scaled_polarity
is less than (<) zero.
# Subset negative comments
neg_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity < 0)
using paste()
on pos_comments
# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")
with paste()
on neg_comments
# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")
and neg_terms
documents into a single corpus called all_terms
# Organize
all_terms <- c(pos_terms,neg_terms)
workflow by nesting VectorSource()
inside VCorpus()
applied to all_terms
# VCorpus
all_corpus <- VCorpus(VectorSource(all_terms))
using the all_corpus
object. Note this is a TfIdf weighted TDM with basic cleaning functions.
all_tdm <- TermDocumentMatrix(
control = list(
weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
to all_tdm_m
using as.matrix()
. Then rename the columns in the existing code to "positive"
and "negative"
# Column names
all_tdm_m <- as.matrix(all_tdm)
colnames(all_tdm_m) <- c("positive", "negative")
to the matrix object all_tdm_m
. Take notice of the new most frequent negative words. Maybe it will uncover an unknown insight!
# Comparison cloud
max.words = 100,
colors = c("darkgreen", "darkred")
Almost there! Another comparison cloud to help you extract your insights.
4.4 Reach a conclusion
4.4.1 Confirm an expected conclusion
Refer to the following plot from the exercise "Comparison Cloud":
Its not surprising that the most common positive words for rentals included "walk", "restaurants", "subway" and "stations". In contrast, top negative terms included "condition", "dumpsters", "hygiene", "safety" and "sounds".
If you were looking to rent your clean apartment and it was close to public transit and good restaurants would it get a favorable review?
4.4.2 Choose a less expected insight
Refer to the following plot from the exercise "Scaled Comparison Cloud":
For your rental, should you use an automated posting?