Start by loading your data. You can find the sample data used for this tutorial here.
The example data come in two forms. If you are loading an R data set (.rda), use the following:
load(file.choose())
If you are loading a .csv file, then you will have the easiest time with the scripts below if you name the data frame “data”. So, use the following for .csv:
data <- read.csv(file.choose(), header = TRUE, na = "")
Install the packages that were developed to save you work. These are ‘ClustOfVar’ and ‘cluster’. ‘ClustOfVar’ was developed to make it easy to cluster qualitative and quantitative variables together. The ‘cluster’ package was developed for clustering in general.
install.packages("ClustOfVar", dependencies = TRUE)
install.packages("cluster", dependencies = TRUE)
We will be using both of these packages today. The ‘ClustOfVar’ package will be used to find clusters in the variables (essentailly data reduction), and the ‘cluster’ package will be used to find clusters among the respondents.
Load the ‘ClustOfVar’ package.
library(ClustOfVar)
Now, take a look at the varibles and data you are using. The ‘names()’ command will give you a list of varable names, and the ‘summary()’ command will give you an idea of what each of them contain.
names(data)
## [1] "PARTICIP" "SAFETY" "USE" "GENDER" "SEXEXP" "PREVIOUS"
## [7] "SELFCON" "PERCEIVE"
summary(data)
## PARTICIP SAFETY USE GENDER
## Min. : 1.00 Min. :0.00 Unprotected:57 Male :50
## 1st Qu.: 25.75 1st Qu.:1.00 Condom Used:43 Female:50
## Median : 50.50 Median :3.00
## Mean : 50.50 Mean :2.29
## 3rd Qu.: 75.25 3rd Qu.:3.00
## Max. :100.00 Max. :6.00
## SEXEXP PREVIOUS SELFCON
## Min. : 0.00 No Condom :50 Min. : 0.00
## 1st Qu.: 2.00 Condom used :47 1st Qu.: 2.00
## Median : 3.00 First Time with partner: 3 Median : 4.00
## Mean : 4.01 Mean : 3.94
## 3rd Qu.: 6.00 3rd Qu.: 5.25
## Max. :10.00 Max. :11.00
## PERCEIVE
## Min. :0.00
## 1st Qu.:2.00
## Median :3.50
## Mean :3.11
## 3rd Qu.:4.00
## Max. :7.00
In this example, I am using all the variables. But I need to separate them into categorical and numeric bins. The ‘xquant’ object will contain all the numeric variables, and the ‘xqual’ object will be used to hold all the categorical variables.
For your own data, take the time to think through the variables that you wish to use. There should be some logical reason for selecting the variables that you use.
Though, having stated that, this is actally a data reduction technique at this point. So you can use this to reduce a large mass of variables to a few meaningful clusters if you so desire.
xquant <- data[,c(1,2,5,7,8)] # Numeric variables
xqual <- data[,c(3,4,6)] # Categorical variables
Next, you can seek out patterns in the variables using a dendrogram (tree diagram). Dendrograms are graphic displays that show how closely each of the variables are related.
tree <- hclustvar(xquant, xqual)
plot(tree)
Looking at the dendrogram, above, note that “particip” and “previous”
are closely related, as indicated by the fact that they are linked
together very directly by converging branches. “selfcon”, “perceive”,
and “use” are more closely related to one another than they are to any
other varibles in the data set.
Use this to try to decide how many clusters you think may be in the data set. Personally, I would say that there are three clusters of varibles that are closely related, given how high you have to go on the “tree”. But, as you will see below, other methods may disagree with this inital assessment.
Another method of estimating the number of clusters is to use a “stability” plot. This is a plot that “Evaluates the stability of partitions obtained from a hierarchy of p variables.” For more information on how it works, enter ‘?stability’ in the R console.
stab <- stability(tree, B=50) # "B=50" refers to the number of bootstrap samples to use in the estimation.
# If a plot is not produced, you can generate one using:
#plot(stab)
Look at the peaks in the plot above. The maximum for stability is 1.0, so the higer the peak, the better. In this case, the plot seems to suggest four stable partitions are present in the data.
K-means clustering is another popular method for grouping or partitioning variables (or respondents - see below) into groups.
k.means <- kmeansvar(xquant, xqual, init=4)
summary(k.means)
##
## Call:
## kmeansvar(X.quanti = xquant, X.quali = xqual, init = 4)
##
## number of iterations: 1
##
## Data:
## number of observations: 100
## number of variables: 8
## number of numerical variables: 5
## number of categorical variables: 3
## number of clusters: 4
##
## Cluster 1 :
## squared loading
## SEXEXP 0.54
## SELFCON 0.54
##
##
## Cluster 2 :
## squared loading
## PARTICIP 0.27
## SAFETY 0.60
## PERCEIVE 0.62
##
##
## Cluster 3 :
## squared loading
## USE 0.62
## PREVIOUS 0.62
##
##
## Cluster 4 :
## squared loading
## GENDER 1
##
##
## Gain in cohesion (in %): 46.7
k.means$cluster # This produces a list of what variables are in which cluster.
## PARTICIP SAFETY SEXEXP SELFCON PERCEIVE USE GENDER PREVIOUS
## 2 2 1 1 2 3 4 3
You can also save the cluster information in a .csv if you like.
write.csv(k.means$cluster, file="variableClusters.csv")
You can use basically the same tools to cluster respondents into “types”, based on their responses.
library(cluster)
d <- daisy(data, metric="gower")
# Keep in mind that this is only for mixed numeric and categorical data.
# See the note below if you are using only numeric variables.
Note: if you are using only numeric varaibles, then you will get better results using ‘d <- dist(data, method=“euclidian”)’.
Use this as you did the one above. Only, now you are considering respondents, not variables.
fit <- hclust(d=d, method="complete") # Also try: method="ward.D"
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=4) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=4, border="red") # draw dendogram with red borders around the 4 clusters
As above, you can plot similarities between respondents using k-means.
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
kfit$cluster
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 3 1 3 1 3 3 1 3 3 1 1 3 3 3 1 3 1 1
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 1 1 3 3 1 1 3 1 1 3 3 3 1 3 3 1 3 1
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 1 1 3 3 1 3 1 1 3 3 1 1 1 1 3 3 1 3
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 3 1 1 2 4 4 2 4 2 4 2 4 4 4 2 4 2 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 4 4 4 4 2 2 2 4 2 4 2 4 2 2 4 2 2 2
## 91 92 93 94 95 96 97 98 99 100
## 4 2 2 4 2 2 4 4 4 4
To save the results as a .csv file, use:
write.csv(kfit$cluster, file="RespondentClusters.csv")
Alternatively, you can add the cluster information into the data set. To add a variable titled “Cluster”:
data[,"Cluster"] <- kfit$cluster
write.csv(data, file="ClusteredRespondents.csv")