A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics (features).
Distance vs Similarity
Calculate & plot the distance between two players
You’ve obtained the coordinates relative to the center of the field for two players in a soccer match and would like to calculate the distance between them.
In this exercise you will plot the positions of the 2 players and manually calculate the distance between them by using the Euclidean distance formula.
x <- c(5, 15)
y <- c(4, 10)
two_players <- data.frame(x,y)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Plot the positions of the players
ggplot(two_players, aes(x = x, y = y)) +
geom_point() +
# Assuming a 40x60 field
lims(x = c(-30,30), y = c(-20, 20))
# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]
# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619
Excellent work! Using the formula is a great way to learn how distance is measured between two observations.
Using the dist() function
Using the Euclidean formula manually may be practical for 2 observations but can get more complicated rather quickly when measuring the distance between many observations.
The dist()
function simplifies this process by calculating distances between our observations (rows) using their features (columns). In this case the observations are the player positions and the dimensions are their x and y coordinates.
Note: The default distance calculation for the dist()
function is Euclidean distance
three_players <- rbind(two_players, c(0, 20))
# Calculate the Distance Between two_players
dist_two_players <- dist(two_players)
dist_two_players
## 1
## 2 11.6619
# Calculate the Distance Between three_players
dist_three_players <- dist(three_players)
dist_three_players
## 1 2
## 2 11.66190
## 3 16.76305 18.02776
The dist()
function makes life easier when working with many dimensions and observations.
Scaling our features
Effects of scale
You have learned that when a variable is on a larger scale than other variables in your data it may disproportionately influence the resulting distance calculated between your observations. Lets see this in action by observing a sample of data from the trees
data set.
You will leverage the scale()
function which by default centers & scales our column features.
Our variables are the following:
Girth <- c(8.3, 8.6, 10.5)
Height <- c(840, 780, 864)
three_trees <- data.frame(Girth, Height)
# Calculate distance for three_trees
dist_trees <- dist(three_trees)
# Scale three trees & calculate the distance
scaled_three_trees <- scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)
# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
## 1 2
## 2 60.00075
## 3 24.10062 84.02149
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
## 1 2
## 2 1.409365
## 3 1.925659 2.511082
Notice that before scaling observations 1 & 3 were the closest but after scaling observations 1 & 2 turn out to have the smallest distance.
Jaccard index = Similarity
= intersection of A, B = cases where both are true = union between A, B = cases where either is ever true
Calculating distance between categorical variables
In this exercise you will explore how to calculate binary (Jaccard) distances. In order to calculate distances we will first have to dummify our categories using the dummy.data.frame()
from the library dummies
You will use a small collection of survey observations stored in the data frame job_survey
with the following columns:
job_satisfaction
Possible options: “Hi”, “Mid”, “Low”is_happy
Possible options: “Yes”, “No”library(dummies)
## dummies-1.5.6 provided by Decision Patterns
job_satisfaction <- as.factor(c("Low", "Low", "High", "Low", "Mid"))
is_happy <- as.factor(c("No", "No", "Yes", "No", "No"))
job_survey <- data.frame(job_satisfaction, is_happy)
# Dummify the Survey Data
(dummy_survey <- dummy.data.frame(job_survey))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## job_satisfactionHigh job_satisfactionLow job_satisfactionMid is_happyNo
## 1 0 1 0 1
## 2 0 1 0 1
## 3 1 0 0 0
## 4 0 1 0 1
## 5 0 0 1 1
## is_happyYes
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
# Calculate the Distance
dist_survey <- dist(dummy_survey, method = 'binary')
# Print the Original Data
job_survey
## job_satisfaction is_happy
## 1 Low No
## 2 Low No
## 3 High Yes
## 4 Low No
## 5 Mid No
# Print the Distance Matrix
dist_survey
## 1 2 3 4
## 2 0.0000000
## 3 1.0000000 1.0000000
## 4 0.0000000 0.0000000 1.0000000
## 5 0.6666667 0.6666667 1.0000000 0.6666667
Notice that this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)
Hierarchical clustering
Complete Linkage: maxomum distance between two sets
Linkage criteria
complete linkage: maximum distance between two sets single linkage: minimum distance between two sets average linkage: average distance between two sets
Calculating linkage
Let us revisit the example with three players on a field. The distance matrix between these three players is shown below and is available as the variable dist_players
.
From this we can tell that the first group that forms is between players 1 & 2, since they are the closest to one another with a Euclidean distance value of 11.
Now you want to apply the three linkage methods you have learned to determine what the distance of this group is to player 3.
1 | 2 | |
---|---|---|
2 | 11 | |
3 | 16 | 18 |
dist_players <- dist(three_players)
# Extract the pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]
# Calculate the complete distance between group 1-2 and 3
complete <- max(c(distance_1_3, distance_2_3))
complete
## [1] 18.02776
# Calculate the single distance between group 1-2 and 3
single <- min(c(distance_1_3, distance_2_3))
single
## [1] 16.76305
# Calculate the average distance between group 1-2 and 3
average <- mean(c(distance_1_3, distance_2_3))
average
## [1] 17.39541
The choice of the linkage method can drastically change the result.
Assign cluster membership
In this exercise you will leverage the hclust()
function to calculate the iterative linkage steps and you will use the cutree()
function to extract the cluster assignments for the desired number (k
) of clusters.
You are given the positions of 12 players at the start of a 6v6 soccer match. This is stored in the lineup
data frame.
You know that this match has two teams (k = 2), let’s use the clustering methods you learned to assign which team each player belongs in based on their position.
Notes:
hclust(distance_matrix, method = "complete")
x <- c(-1, -2, 8, 7, -12, -15, -13, 15, 21, 12, -25, 26)
y <- c(1, -3, 6, -8, 8, 0, -10, 16, 2, -15, 1, 0)
(lineup <- tibble(x, y))
## # A tibble: 12 x 2
## x y
## <dbl> <dbl>
## 1 -1 1
## 2 -2 -3
## 3 8 6
## 4 7 -8
## 5 -12 8
## 6 -15 0
## 7 -13 -10
## 8 15 16
## 9 21 2
## 10 12 -15
## 11 -25 1
## 12 26 0
# Calculate the Distance
dist_players <- dist(lineup)
# Perform the hierarchical clustering using the complete linkage
hc_players <- hclust(dist_players, method = "complete")
# Calculate the assignment vector with a k of 2
clusters_k2 <- cutree(hc_players, k = 2)
# Create a new data frame storing these results
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)
Fantastic job! In the next exercise we will explore this result.
Exploring the clusters
Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.
In this exercise you will explore that data frame you created in the previous exercise lineup_k2_complete
.
Reminder: The lineup_k2_complete
data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which you have added clustering assignments based on the following parameters:
# Count the cluster assignments
count(lineup_k2_complete, cluster)
## # A tibble: 2 x 2
## cluster n
## <int> <int>
## 1 1 6
## 2 2 6
# Plot the positions of the players and color them using their cluster
ggplot(lineup_k2_complete, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
You’re doing great! Think carefully about whether these results make sense to you and why.
Comparing average, single & complete linkage
You are now ready to analyze the clustering results of the lineup
dataset using the dendrogram plot. This will give you a new perspective on the effect the decision of the linkage method has on your resulting cluster analysis.
# Prepare the Distance Matrix
dist_players <- dist(lineup)
# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")
# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage') #maximum euclidian distance <= height of the branch
plot(hc_single, main = 'Single Linkage') #minimum euclidian distance <= height of the branch
plot(hc_average, main = 'Average Linkage') #average euclidian distance <= height of the branch
Excellent! Did you notice how the trees all look different? In the coming exercises you will see how visualizing this structure can be helpful for building clusters.
all members of the clusters shown here will have a euclidian distance no greater than the cut height of 15.
Clusters based on height
In previous exercises you have grouped your observations into clusters using a pre-defined number of clusters (k). In this exercise you will leverage the visual representation of the dendrogram in order to group your observations into clusters using a maximum height (h), below which clusters form.
You will work the color_branches()
function from the dendextend
library in order to visually inspect the clusters that form at any height along the dendrogram.
The hc_players has been carried over from your previous work with the soccer line-up data.
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.14.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)
# Plot the dendrogram
plot(dend_players)
# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)
# Plot the dendrogram with clusters colored below height 20
plot(dend_20)
# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)
# Plot the dendrogram with clusters colored below height 40
plot(dend_40)
Excellent! Can you see that the height that you use to cut the tree greatly influences the number of clusters and their size? Consider taking a moment to play with other values of height before continuing.
Exploring the branches cut from the tree
The cutree()
function you used in exercises 5 & 6 can also be used to cut a tree at a given height by using the h
parameter. Take a moment to explore the clusters you have generated from the previous exercises based on the heights 20 & 40.
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
# Calculate the assignment vector with a h of 20
clusters_h20 <- cutree(hc_players, h = 20)
# Create a new data frame storing these results
lineup_h20_complete <- mutate(lineup, cluster = clusters_h20)
# Calculate the assignment vector with a h of 40
clusters_h40 <- cutree(hc_players, h = 40)
# Create a new data frame storing these results
lineup_h40_complete <- mutate(lineup, cluster = clusters_h40)
# Plot the positions of the players and color them using their cluster for height = 20
ggplot(lineup_h20_complete, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
# Plot the positions of the players and color them using their cluster for height = 40
ggplot(lineup_h40_complete, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
Great job! You can now explore your clusters using both k and h parameters.
The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.
Exploring more than 2 dimensions
Segment wholesale customers
You’re now ready to use hierarchical clustering to perform market segmentation (i.e. use consumer characteristics to group them into subgroups).
In this exercise you are provided with the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend
. Assign these clients into meaningful clusters.
Note
: For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.
customers_spend <- readRDS("_data/ws_customers.rds")
# Calculate Euclidean distance between customers
dist_customers <- dist(customers_spend)
# Generate a complete linkage analysis
hc_customers <- hclust(dist_customers, method = "complete")
# Plot the dendrogram
plot(hc_customers)
# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h = 15000)
# Generate the segmented customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)
Excellent! Let’s move on to the next exercise and explore these clusters.
Explore wholesale customer clusters
Continuing your work on the wholesale dataset you are now ready to analyze the characteristics of these clusters.
Since you are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead you will rely on summary statistics to explore these clusters. In this exercise you will analyze the mean amount spent in each cluster for all three categories.
dist_customers <- dist(customers_spend)
hc_customers <- hclust(dist_customers)
clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)
# Count the number of customers that fall into each cluster
count(segment_customers, cluster)
## cluster n
## 1 1 5
## 2 2 29
## 3 3 5
## 4 4 6
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)
# Plot the colored dendrogram
plot(dend_colored)
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))
## # A tibble: 4 x 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 16950 12891. 991.
## 2 2 2513. 5229. 1796.
## 3 3 10452. 22551. 1355.
## 4 4 1250. 3917. 10889.
Great work! You’ve gathered a bunch of information about these clusters, now let’s see what can be interpreted from them.
All 4 statements are reasonable, but whether they are meaningful depends heavily on the business context of the clustering.
K-means on a soccer field
In the previous chapter you used the lineup
dataset to learn about hierarchical clustering, in this chapter you will use the same data to learn about k-means clustering. As a reminder, the lineup
data frame contains the positions of 12 players at the start of a 6v6 soccer match.
Just like before, you know that this match has two teams on the field so you can perform a k-means analysis using k = 2 in order to determine which player belongs to which team.
Note that in the kmeans()
function k
is specified using the centers
parameter.
# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)
# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster
# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
Well done! Knowing the desired number of clusters ahead of time can be very helpful when performing a k-means analysis. In the next section we will see what happens when we use an incorrect value of k.
K-means on a soccer field (part 2)
In the previous exercise you successfully used the k-means algorithm to cluster the two teams from the lineup
data frame. This time, let’s explore what happens when you use a k
of 3.
You will see that the algorithm will still run, but does it actually make sense in this context…
# Build a kmeans model
model_km3 <- kmeans(lineup, centers = 3)
# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster
# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster = clust_km3)
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
Does this result make sense? Remember we only have 2 teams on the field. It’s very important to remember that k-means will run with any k that is more than 2 and less than your total observations, but it doesn’t always mean the results will be meaningful.
Many K’s many models
While the lineup
dataset clearly has a known value of k, often times the optimal number of clusters isn’t known and must be estimated.
In this exercise you will leverage map_dbl()
from the purrr
library to run k-means using values of k ranging from 1 to 10 and extract the total within-cluster sum of squares metric from each one. This will be the first step towards visualizing the elbow plot.
library(purrr)
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = lineup, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
(elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
))
## k tot_withinss
## 1 1 3489.9167
## 2 2 1434.5000
## 3 3 940.8333
## 4 4 637.2500
## 5 5 406.0000
## 6 6 265.1667
## 7 7 171.0000
## 8 8 134.5000
## 9 9 82.0000
## 10 10 23.0000
Great work! In the next exercise you will plot the elbow plot from this data.
Elbow (Scree) plot
In the previous exercises you have calculated the total within-cluster sum of squares for values of k ranging from 1 to 10. You can visualize this relationship using a line plot to create what is known as an elbow plot (or scree plot).
When looking at an elbow plot you want to see a sharp decline from one k to another followed by a more gradual decrease in slope. The last value of k before the slope of the plot levels off suggests a “good” value of k.
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = lineup, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
Fantastic! You have learned how to create and visualize elbow plots as a tool for finding a “good” value of k. In the next section you will add another tool to your arsenal for finding k.
You can see that there is a sharp change in the slope of this line at k = 2
that makes an “elbow” shape. Furthermore, this is supported by the prior knowledge that there are two teams in this data and a k of 2 is desired.
Silhouette analysis
Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:
In this exercise you will leverage the pam()
and the silhouette()
functions from the cluster
library to perform silhouette analysis to compare the results of models with a k of 2 and a k of 3. You’ll continue working with the lineup
dataset.
Pay close attention to the silhouette plot, does each observation clearly belong to its assigned cluster for k = 3?
library(cluster)
# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)
# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))
# Generate a k-means model using the pam() function with a k = 3
pam_k3 <- pam(lineup, k = 3)
# Plot the silhouette visual for the pam_k3 model
plot(silhouette(pam_k3))
Great work! Did you notice that for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters.
Segmenting with K-means
Revisiting wholesale data: “Best” k
At the end of Chapter 2 you explored wholesale distributor data customers_spend
using hierarchical clustering. This time you will analyze this data using the k-means clustering tools covered in this chapter.
The first step will be to determine the “best” value of k using average silhouette width.
A refresher about the data: it contains records of the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery & Frozen. This is stored in the data frame customers_spend
. For this exercise you can assume that because the data is all of the same type (amount spent) and you will not need to scale it.
# Use map_dbl to run many models with varying value of k
(sil_width <- map_dbl(2:10, function(k){
model <- pam(x = customers_spend, k = k)
model$silinfo$avg.width
}))
## [1] 0.5842282 0.3741916 0.4225053 0.4397891 0.3609199 0.3270878 0.3263239
## [8] 0.3319012 0.3255676
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
You’re doing great! From the plot I hope you noticed that k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with.
Revisiting wholesale data: Exploration
From the previous analysis you have found that k = 2
has the highest average silhouette width. In this exercise you will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters.
set.seed(42)
# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(customers_spend, centers = 2)
# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster
# Build the segment_customers data frame
segment_customers <- mutate(customers_spend, cluster = clust_customers)
# Calculate the size of each cluster
count(segment_customers, cluster)
## cluster n
## 1 1 35
## 2 2 10
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))
## # A tibble: 2 x 4
## cluster Milk Grocery Frozen
## <int> <dbl> <dbl> <dbl>
## 1 1 2296. 5004 3354.
## 2 2 13701. 17721 1173
Well done! It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. Did you notice that when you explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got you 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Before you proceed with the next chapter, remember that: Generating clusters is a science, but interpreting them is an art.
Next steps: hierarchical clustering
Initial exploration of the data
You are presented with data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. You would like to use this data to identify clusters of occupations that maintained similar income trends.
The data is stored in your environment as the data.matrix oes
.
Before you begin to cluster this data you should determine whether any pre-processing steps (such as scaling and imputation) are necessary.
Leverage the functions head()
and summary()
to explore the oes
data in order to determine which of the pre-processing steps below are necessary:
There are no missing values, no categorical and the features are on the same scale.
oes <- readRDS("_data/oes.rds")
# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = 'euclidean')
# Generate an average linkage analysis
hc_oes <- hclust(dist_oes, method = 'average')
# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)
# Plot the dendrogram
plot(dend_oes)
# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)
# Plot the colored dendrogram
plot(dend_colored)
Well done! Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let’s continue this exploration.
Hierarchical clustering: Preparing for exploration
You have now created a potential clustering for the oes
data, before you can explore these clusters with ggplot2 you will need to process the oes
data matrix into a tidy data frame with each occupation assigned its cluster.
dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')
library(tibble)
library(tidyr)
# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')
# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)
# Generate the segmented the oes data frame
(clust_oes <- mutate(df_oes, cluster = cut_oes))
## occupation 2001 2002 2003 2004 2005 2006 2007 2008
## 1 Management 70800 78870 83400 87090 88450 91930 96150 100310
## 2 Business Operations 50580 53350 56000 57120 57930 60000 62410 64720
## 3 Computer Science 60350 61630 64150 66370 67100 69240 72190 74500
## 4 Architecture/Engineering 56330 58020 60390 63060 63910 66190 68880 71430
## 5 Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020 64280
## 6 Community Services 34190 34630 35800 37050 37530 39000 40540 41790
## 7 Legal 69030 77330 78590 81180 81070 85360 88450 92270
## 8 Education/Training/Library 39130 40160 41390 42810 43450 45320 46610 48460
## 9 Arts/Design/Entertainment 39770 41660 43350 43820 44310 46110 48410 50670
## 10 Healthcare Practitioners 49930 53990 56240 58310 59170 62030 65020 67890
## 11 Healthcare Support 21900 22410 22960 23510 23850 24610 25600 26340
## 12 Protective Service 32530 33330 34430 35240 35750 37040 38750 40200
## 13 Food Preparation 16720 17180 17400 17620 17840 18430 19440 20220
## 14 Grounds Cleaning & Maint. 20380 20850 21290 21670 21930 22580 23560 24370
## 15 Personal Care 21010 21370 21570 22080 22180 22920 23980 24120
## 16 Sales 28920 30610 31560 32280 32800 34350 35240 36080
## 17 Office Administrative 27230 27910 28540 29390 29710 30370 31200 32220
## 18 Farming/Fishing/Forestry 19630 20220 20290 20670 21010 21810 22640 23560
## 19 Construction 35450 36340 37000 37890 38260 39290 40620 42350
## 20 Installation/Repair/Maint. 34960 35780 36560 37620 38050 39060 39930 41230
## 21 Production 27600 28190 28930 29480 29890 30480 31310 32320
## 22 Transportation/Moving 26560 27220 27630 28250 28820 29460 30680 31450
## 2010 2011 2012 2013 2014 2015 2016 cluster
## 1 105440 107410 108570 110550 112490 115020 118020 1
## 2 67690 68740 69550 71020 72410 73800 75070 2
## 3 77230 78730 80180 82010 83970 86170 87880 2
## 4 75550 77120 79000 80100 81520 82980 84300 2
## 5 66390 67470 68360 69400 70070 71220 72930 2
## 6 43180 43830 44240 44710 45310 46160 47200 3
## 7 96940 98380 98570 99620 101110 103460 105980 1
## 8 50440 50870 51210 51500 52210 53000 54520 3
## 9 52290 53850 54490 55580 55790 56980 58390 3
## 10 71280 72730 73540 74740 76010 77800 79160 2
## 11 26920 27370 27780 28300 28820 29520 30470 3
## 12 42490 42730 43050 43510 43980 44610 45810 3
## 13 21240 21430 21380 21580 21980 22850 23850 3
## 14 25300 25560 25670 26010 26370 27080 28010 3
## 15 24590 24620 24550 24710 24980 25650 26510 3
## 16 36790 37520 37990 38200 38660 39320 40560 3
## 17 33470 34120 34410 34900 35530 36330 37260 3
## 18 24330 24300 24230 24330 25160 26360 27810 3
## 19 43870 44630 44960 45630 46600 47580 48900 3
## 20 42810 43390 43870 44420 45220 45990 46690 3
## 21 33770 34220 34500 34930 35490 36220 37190 3
## 22 32660 33200 33590 33860 34460 35160 36070 3
# Create a tidy data frame by gathering the year and values into two columns
(gathered_oes <- gather(data = clust_oes,
key = year,
value = mean_salary,
-occupation, -cluster))
## occupation cluster year mean_salary
## 1 Management 1 2001 70800
## 2 Business Operations 2 2001 50580
## 3 Computer Science 2 2001 60350
## 4 Architecture/Engineering 2 2001 56330
## 5 Life/Physical/Social Sci. 2 2001 49710
## 6 Community Services 3 2001 34190
## 7 Legal 1 2001 69030
## 8 Education/Training/Library 3 2001 39130
## 9 Arts/Design/Entertainment 3 2001 39770
## 10 Healthcare Practitioners 2 2001 49930
## 11 Healthcare Support 3 2001 21900
## 12 Protective Service 3 2001 32530
## 13 Food Preparation 3 2001 16720
## 14 Grounds Cleaning & Maint. 3 2001 20380
## 15 Personal Care 3 2001 21010
## 16 Sales 3 2001 28920
## 17 Office Administrative 3 2001 27230
## 18 Farming/Fishing/Forestry 3 2001 19630
## 19 Construction 3 2001 35450
## 20 Installation/Repair/Maint. 3 2001 34960
## 21 Production 3 2001 27600
## 22 Transportation/Moving 3 2001 26560
## 23 Management 1 2002 78870
## 24 Business Operations 2 2002 53350
## 25 Computer Science 2 2002 61630
## 26 Architecture/Engineering 2 2002 58020
## 27 Life/Physical/Social Sci. 2 2002 52380
## 28 Community Services 3 2002 34630
## 29 Legal 1 2002 77330
## 30 Education/Training/Library 3 2002 40160
## 31 Arts/Design/Entertainment 3 2002 41660
## 32 Healthcare Practitioners 2 2002 53990
## 33 Healthcare Support 3 2002 22410
## 34 Protective Service 3 2002 33330
## 35 Food Preparation 3 2002 17180
## 36 Grounds Cleaning & Maint. 3 2002 20850
## 37 Personal Care 3 2002 21370
## 38 Sales 3 2002 30610
## 39 Office Administrative 3 2002 27910
## 40 Farming/Fishing/Forestry 3 2002 20220
## 41 Construction 3 2002 36340
## 42 Installation/Repair/Maint. 3 2002 35780
## 43 Production 3 2002 28190
## 44 Transportation/Moving 3 2002 27220
## 45 Management 1 2003 83400
## 46 Business Operations 2 2003 56000
## 47 Computer Science 2 2003 64150
## 48 Architecture/Engineering 2 2003 60390
## 49 Life/Physical/Social Sci. 2 2003 54930
## 50 Community Services 3 2003 35800
## 51 Legal 1 2003 78590
## 52 Education/Training/Library 3 2003 41390
## 53 Arts/Design/Entertainment 3 2003 43350
## 54 Healthcare Practitioners 2 2003 56240
## 55 Healthcare Support 3 2003 22960
## 56 Protective Service 3 2003 34430
## 57 Food Preparation 3 2003 17400
## 58 Grounds Cleaning & Maint. 3 2003 21290
## 59 Personal Care 3 2003 21570
## 60 Sales 3 2003 31560
## 61 Office Administrative 3 2003 28540
## 62 Farming/Fishing/Forestry 3 2003 20290
## 63 Construction 3 2003 37000
## 64 Installation/Repair/Maint. 3 2003 36560
## 65 Production 3 2003 28930
## 66 Transportation/Moving 3 2003 27630
## 67 Management 1 2004 87090
## 68 Business Operations 2 2004 57120
## 69 Computer Science 2 2004 66370
## 70 Architecture/Engineering 2 2004 63060
## 71 Life/Physical/Social Sci. 2 2004 57550
## 72 Community Services 3 2004 37050
## 73 Legal 1 2004 81180
## 74 Education/Training/Library 3 2004 42810
## 75 Arts/Design/Entertainment 3 2004 43820
## 76 Healthcare Practitioners 2 2004 58310
## 77 Healthcare Support 3 2004 23510
## 78 Protective Service 3 2004 35240
## 79 Food Preparation 3 2004 17620
## 80 Grounds Cleaning & Maint. 3 2004 21670
## 81 Personal Care 3 2004 22080
## 82 Sales 3 2004 32280
## 83 Office Administrative 3 2004 29390
## 84 Farming/Fishing/Forestry 3 2004 20670
## 85 Construction 3 2004 37890
## 86 Installation/Repair/Maint. 3 2004 37620
## 87 Production 3 2004 29480
## 88 Transportation/Moving 3 2004 28250
## 89 Management 1 2005 88450
## 90 Business Operations 2 2005 57930
## 91 Computer Science 2 2005 67100
## 92 Architecture/Engineering 2 2005 63910
## 93 Life/Physical/Social Sci. 2 2005 58030
## 94 Community Services 3 2005 37530
## 95 Legal 1 2005 81070
## 96 Education/Training/Library 3 2005 43450
## 97 Arts/Design/Entertainment 3 2005 44310
## 98 Healthcare Practitioners 2 2005 59170
## 99 Healthcare Support 3 2005 23850
## 100 Protective Service 3 2005 35750
## 101 Food Preparation 3 2005 17840
## 102 Grounds Cleaning & Maint. 3 2005 21930
## 103 Personal Care 3 2005 22180
## 104 Sales 3 2005 32800
## 105 Office Administrative 3 2005 29710
## 106 Farming/Fishing/Forestry 3 2005 21010
## 107 Construction 3 2005 38260
## 108 Installation/Repair/Maint. 3 2005 38050
## 109 Production 3 2005 29890
## 110 Transportation/Moving 3 2005 28820
## 111 Management 1 2006 91930
## 112 Business Operations 2 2006 60000
## 113 Computer Science 2 2006 69240
## 114 Architecture/Engineering 2 2006 66190
## 115 Life/Physical/Social Sci. 2 2006 59660
## 116 Community Services 3 2006 39000
## 117 Legal 1 2006 85360
## 118 Education/Training/Library 3 2006 45320
## 119 Arts/Design/Entertainment 3 2006 46110
## 120 Healthcare Practitioners 2 2006 62030
## 121 Healthcare Support 3 2006 24610
## 122 Protective Service 3 2006 37040
## 123 Food Preparation 3 2006 18430
## 124 Grounds Cleaning & Maint. 3 2006 22580
## 125 Personal Care 3 2006 22920
## 126 Sales 3 2006 34350
## 127 Office Administrative 3 2006 30370
## 128 Farming/Fishing/Forestry 3 2006 21810
## 129 Construction 3 2006 39290
## 130 Installation/Repair/Maint. 3 2006 39060
## 131 Production 3 2006 30480
## 132 Transportation/Moving 3 2006 29460
## 133 Management 1 2007 96150
## 134 Business Operations 2 2007 62410
## 135 Computer Science 2 2007 72190
## 136 Architecture/Engineering 2 2007 68880
## 137 Life/Physical/Social Sci. 2 2007 62020
## 138 Community Services 3 2007 40540
## 139 Legal 1 2007 88450
## 140 Education/Training/Library 3 2007 46610
## 141 Arts/Design/Entertainment 3 2007 48410
## 142 Healthcare Practitioners 2 2007 65020
## 143 Healthcare Support 3 2007 25600
## 144 Protective Service 3 2007 38750
## 145 Food Preparation 3 2007 19440
## 146 Grounds Cleaning & Maint. 3 2007 23560
## 147 Personal Care 3 2007 23980
## 148 Sales 3 2007 35240
## 149 Office Administrative 3 2007 31200
## 150 Farming/Fishing/Forestry 3 2007 22640
## 151 Construction 3 2007 40620
## 152 Installation/Repair/Maint. 3 2007 39930
## 153 Production 3 2007 31310
## 154 Transportation/Moving 3 2007 30680
## 155 Management 1 2008 100310
## 156 Business Operations 2 2008 64720
## 157 Computer Science 2 2008 74500
## 158 Architecture/Engineering 2 2008 71430
## 159 Life/Physical/Social Sci. 2 2008 64280
## 160 Community Services 3 2008 41790
## 161 Legal 1 2008 92270
## 162 Education/Training/Library 3 2008 48460
## 163 Arts/Design/Entertainment 3 2008 50670
## 164 Healthcare Practitioners 2 2008 67890
## 165 Healthcare Support 3 2008 26340
## 166 Protective Service 3 2008 40200
## 167 Food Preparation 3 2008 20220
## 168 Grounds Cleaning & Maint. 3 2008 24370
## 169 Personal Care 3 2008 24120
## 170 Sales 3 2008 36080
## 171 Office Administrative 3 2008 32220
## 172 Farming/Fishing/Forestry 3 2008 23560
## 173 Construction 3 2008 42350
## 174 Installation/Repair/Maint. 3 2008 41230
## 175 Production 3 2008 32320
## 176 Transportation/Moving 3 2008 31450
## 177 Management 1 2010 105440
## 178 Business Operations 2 2010 67690
## 179 Computer Science 2 2010 77230
## 180 Architecture/Engineering 2 2010 75550
## 181 Life/Physical/Social Sci. 2 2010 66390
## 182 Community Services 3 2010 43180
## 183 Legal 1 2010 96940
## 184 Education/Training/Library 3 2010 50440
## 185 Arts/Design/Entertainment 3 2010 52290
## 186 Healthcare Practitioners 2 2010 71280
## 187 Healthcare Support 3 2010 26920
## 188 Protective Service 3 2010 42490
## 189 Food Preparation 3 2010 21240
## 190 Grounds Cleaning & Maint. 3 2010 25300
## 191 Personal Care 3 2010 24590
## 192 Sales 3 2010 36790
## 193 Office Administrative 3 2010 33470
## 194 Farming/Fishing/Forestry 3 2010 24330
## 195 Construction 3 2010 43870
## 196 Installation/Repair/Maint. 3 2010 42810
## 197 Production 3 2010 33770
## 198 Transportation/Moving 3 2010 32660
## 199 Management 1 2011 107410
## 200 Business Operations 2 2011 68740
## 201 Computer Science 2 2011 78730
## 202 Architecture/Engineering 2 2011 77120
## 203 Life/Physical/Social Sci. 2 2011 67470
## 204 Community Services 3 2011 43830
## 205 Legal 1 2011 98380
## 206 Education/Training/Library 3 2011 50870
## 207 Arts/Design/Entertainment 3 2011 53850
## 208 Healthcare Practitioners 2 2011 72730
## 209 Healthcare Support 3 2011 27370
## 210 Protective Service 3 2011 42730
## 211 Food Preparation 3 2011 21430
## 212 Grounds Cleaning & Maint. 3 2011 25560
## 213 Personal Care 3 2011 24620
## 214 Sales 3 2011 37520
## 215 Office Administrative 3 2011 34120
## 216 Farming/Fishing/Forestry 3 2011 24300
## 217 Construction 3 2011 44630
## 218 Installation/Repair/Maint. 3 2011 43390
## 219 Production 3 2011 34220
## 220 Transportation/Moving 3 2011 33200
## 221 Management 1 2012 108570
## 222 Business Operations 2 2012 69550
## 223 Computer Science 2 2012 80180
## 224 Architecture/Engineering 2 2012 79000
## 225 Life/Physical/Social Sci. 2 2012 68360
## 226 Community Services 3 2012 44240
## 227 Legal 1 2012 98570
## 228 Education/Training/Library 3 2012 51210
## 229 Arts/Design/Entertainment 3 2012 54490
## 230 Healthcare Practitioners 2 2012 73540
## 231 Healthcare Support 3 2012 27780
## 232 Protective Service 3 2012 43050
## 233 Food Preparation 3 2012 21380
## 234 Grounds Cleaning & Maint. 3 2012 25670
## 235 Personal Care 3 2012 24550
## 236 Sales 3 2012 37990
## 237 Office Administrative 3 2012 34410
## 238 Farming/Fishing/Forestry 3 2012 24230
## 239 Construction 3 2012 44960
## 240 Installation/Repair/Maint. 3 2012 43870
## 241 Production 3 2012 34500
## 242 Transportation/Moving 3 2012 33590
## 243 Management 1 2013 110550
## 244 Business Operations 2 2013 71020
## 245 Computer Science 2 2013 82010
## 246 Architecture/Engineering 2 2013 80100
## 247 Life/Physical/Social Sci. 2 2013 69400
## 248 Community Services 3 2013 44710
## 249 Legal 1 2013 99620
## 250 Education/Training/Library 3 2013 51500
## 251 Arts/Design/Entertainment 3 2013 55580
## 252 Healthcare Practitioners 2 2013 74740
## 253 Healthcare Support 3 2013 28300
## 254 Protective Service 3 2013 43510
## 255 Food Preparation 3 2013 21580
## 256 Grounds Cleaning & Maint. 3 2013 26010
## 257 Personal Care 3 2013 24710
## 258 Sales 3 2013 38200
## 259 Office Administrative 3 2013 34900
## 260 Farming/Fishing/Forestry 3 2013 24330
## 261 Construction 3 2013 45630
## 262 Installation/Repair/Maint. 3 2013 44420
## 263 Production 3 2013 34930
## 264 Transportation/Moving 3 2013 33860
## 265 Management 1 2014 112490
## 266 Business Operations 2 2014 72410
## 267 Computer Science 2 2014 83970
## 268 Architecture/Engineering 2 2014 81520
## 269 Life/Physical/Social Sci. 2 2014 70070
## 270 Community Services 3 2014 45310
## 271 Legal 1 2014 101110
## 272 Education/Training/Library 3 2014 52210
## 273 Arts/Design/Entertainment 3 2014 55790
## 274 Healthcare Practitioners 2 2014 76010
## 275 Healthcare Support 3 2014 28820
## 276 Protective Service 3 2014 43980
## 277 Food Preparation 3 2014 21980
## 278 Grounds Cleaning & Maint. 3 2014 26370
## 279 Personal Care 3 2014 24980
## 280 Sales 3 2014 38660
## 281 Office Administrative 3 2014 35530
## 282 Farming/Fishing/Forestry 3 2014 25160
## 283 Construction 3 2014 46600
## 284 Installation/Repair/Maint. 3 2014 45220
## 285 Production 3 2014 35490
## 286 Transportation/Moving 3 2014 34460
## 287 Management 1 2015 115020
## 288 Business Operations 2 2015 73800
## 289 Computer Science 2 2015 86170
## 290 Architecture/Engineering 2 2015 82980
## 291 Life/Physical/Social Sci. 2 2015 71220
## 292 Community Services 3 2015 46160
## 293 Legal 1 2015 103460
## 294 Education/Training/Library 3 2015 53000
## 295 Arts/Design/Entertainment 3 2015 56980
## 296 Healthcare Practitioners 2 2015 77800
## 297 Healthcare Support 3 2015 29520
## 298 Protective Service 3 2015 44610
## 299 Food Preparation 3 2015 22850
## 300 Grounds Cleaning & Maint. 3 2015 27080
## 301 Personal Care 3 2015 25650
## 302 Sales 3 2015 39320
## 303 Office Administrative 3 2015 36330
## 304 Farming/Fishing/Forestry 3 2015 26360
## 305 Construction 3 2015 47580
## 306 Installation/Repair/Maint. 3 2015 45990
## 307 Production 3 2015 36220
## 308 Transportation/Moving 3 2015 35160
## 309 Management 1 2016 118020
## 310 Business Operations 2 2016 75070
## 311 Computer Science 2 2016 87880
## 312 Architecture/Engineering 2 2016 84300
## 313 Life/Physical/Social Sci. 2 2016 72930
## 314 Community Services 3 2016 47200
## 315 Legal 1 2016 105980
## 316 Education/Training/Library 3 2016 54520
## 317 Arts/Design/Entertainment 3 2016 58390
## 318 Healthcare Practitioners 2 2016 79160
## 319 Healthcare Support 3 2016 30470
## 320 Protective Service 3 2016 45810
## 321 Food Preparation 3 2016 23850
## 322 Grounds Cleaning & Maint. 3 2016 28010
## 323 Personal Care 3 2016 26510
## 324 Sales 3 2016 40560
## 325 Office Administrative 3 2016 37260
## 326 Farming/Fishing/Forestry 3 2016 27810
## 327 Construction 3 2016 48900
## 328 Installation/Repair/Maint. 3 2016 46690
## 329 Production 3 2016 37190
## 330 Transportation/Moving 3 2016 36070
Great work! You now have the data frames necessary to explore the results of this clustering
Hierarchical clustering: Plotting occupational clusters
You have successfully created all the parts necessary to explore the results of this hierarchical clustering work. In this exercise you will leverage the named assignment vector cut_oes
and the tidy data frame gathered_oes
to analyze the resulting clusters.
# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
## Management Legal
## 1 1
## Business Operations Computer Science
## 2 2
## Architecture/Engineering Life/Physical/Social Sci.
## 2 2
## Healthcare Practitioners Community Services
## 2 3
## Education/Training/Library Arts/Design/Entertainment
## 3 3
## Healthcare Support Protective Service
## 3 3
## Food Preparation Grounds Cleaning & Maint.
## 3 3
## Personal Care Sales
## 3 3
## Office Administrative Farming/Fishing/Forestry
## 3 3
## Construction Installation/Repair/Maint.
## 3 3
## Production Transportation/Moving
## 3 3
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))
Cool huh! From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let’s see what we can get by exploring this data using k-means.
Next steps: k-means clustering
K-means: Elbow analysis
In the previous exercises you used the dendrogram to propose a clustering that generated 3 trees. In this exercise you will leverage the k-means elbow plot to propose the “best” number of clusters.
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
Fascinating! So the elbow analysis proposes a different value of k, in the next section let’s see what we can learn from Silhouette Width Analysis.
K-means: Average Silhouette Widths
So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. In this exercise use average silhouette widths to explore what the “best” value of k should be.
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(oes, k = k)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
Great work! It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).
All of the above methods are correct but the best way to cluster is highly dependent on how you would use this data after.
There is no quantitative way to determine which of these clustering approaches is the right one without further exploration. You are done with the course! If you enjoyed the material, feel free to send Dmitriy a thank you via Twitter. He’ll appreciate it. Tweet to Dmitriy