Setup

Start by loading your data. You can find the sample data used for this tutorial here.

The example data come in two forms. If you are loading an R data set (.rda), use the following:

load(file.choose())

If you are loading a .csv file, then you will have the easiest time with the scripts below if you name the data frame “data”. So, use the following for .csv:

data <- read.csv(file.choose(), header = TRUE, na = "")

Installing the necessary packages

Install the packages that were developed to save you work. These are ‘ClustOfVar’ and ‘cluster’. ‘ClustOfVar’ was developed to make it easy to cluster qualitative and quantitative variables together. The ‘cluster’ package was developed for clustering in general.

install.packages("ClustOfVar", dependencies = TRUE)
install.packages("cluster", dependencies = TRUE)

We will be using both of these packages today. The ‘ClustOfVar’ package will be used to find clusters in the variables (essentailly data reduction), and the ‘cluster’ package will be used to find clusters among the respondents.

Start with Clustering Variables

Load the ‘ClustOfVar’ package.

library(ClustOfVar)

Now, take a look at the varibles and data you are using. The ‘names()’ command will give you a list of varable names, and the ‘summary()’ command will give you an idea of what each of them contain.

names(data)

## [1] "PARTICIP" "SAFETY"   "USE"      "GENDER"   "SEXEXP"   "PREVIOUS"
## [7] "SELFCON"  "PERCEIVE"

summary(data)

##     PARTICIP          SAFETY              USE        GENDER  
##  Min.   :  1.00   Min.   :0.00   Unprotected:57   Male  :50  
##  1st Qu.: 25.75   1st Qu.:1.00   Condom Used:43   Female:50  
##  Median : 50.50   Median :3.00                               
##  Mean   : 50.50   Mean   :2.29                               
##  3rd Qu.: 75.25   3rd Qu.:3.00                               
##  Max.   :100.00   Max.   :6.00                               
##      SEXEXP                         PREVIOUS     SELFCON     
##  Min.   : 0.00   No Condom              :50   Min.   : 0.00  
##  1st Qu.: 2.00   Condom used            :47   1st Qu.: 2.00  
##  Median : 3.00   First Time with partner: 3   Median : 4.00  
##  Mean   : 4.01                                Mean   : 3.94  
##  3rd Qu.: 6.00                                3rd Qu.: 5.25  
##  Max.   :10.00                                Max.   :11.00  
##     PERCEIVE   
##  Min.   :0.00  
##  1st Qu.:2.00  
##  Median :3.50  
##  Mean   :3.11  
##  3rd Qu.:4.00  
##  Max.   :7.00

Next, subset your data.

In this example, I am using all the variables. But I need to separate them into categorical and numeric bins. The ‘xquant’ object will contain all the numeric variables, and the ‘xqual’ object will be used to hold all the categorical variables.

For your own data, take the time to think through the variables that you wish to use. There should be some logical reason for selecting the variables that you use.

Though, having stated that, this is actally a data reduction technique at this point. So you can use this to reduce a large mass of variables to a few meaningful clusters if you so desire.

xquant <- data[,c(1,2,5,7,8)] # Numeric variables
xqual <- data[,c(3,4,6)]      # Categorical variables

Hierarchal clustering

Next, you can seek out patterns in the variables using a dendrogram (tree diagram). Dendrograms are graphic displays that show how closely each of the variables are related.

tree <- hclustvar(xquant, xqual)
plot(tree)

Looking at the dendrogram, above, note that “particip” and “previous” are closely related, as indicated by the fact that they are linked together very directly by converging branches. “selfcon”, “perceive”, and “use” are more closely related to one another than they are to any other varibles in the data set.

Use this to try to decide how many clusters you think may be in the data set. Personally, I would say that there are three clusters of varibles that are closely related, given how high you have to go on the “tree”. But, as you will see below, other methods may disagree with this inital assessment.

Estimate the number of clusters

Another method of estimating the number of clusters is to use a “stability” plot. This is a plot that “Evaluates the stability of partitions obtained from a hierarchy of p variables.” For more information on how it works, enter ‘?stability’ in the R console.

stab <- stability(tree, B=50) # "B=50" refers to the number of bootstrap samples to use in the estimation.

# If a plot is not produced, you can generate one using:
#plot(stab)

Look at the peaks in the plot above. The maximum for stability is 1.0, so the higer the peak, the better. In this case, the plot seems to suggest four stable partitions are present in the data.

Use the partition information for k-means clustering

K-means clustering is another popular method for grouping or partitioning variables (or respondents - see below) into groups.

k.means <- kmeansvar(xquant, xqual, init=4)
summary(k.means)

## 
## Call:
## kmeansvar(X.quanti = xquant, X.quali = xqual, init = 4)
## 
## number of iterations:  1
## 
## Data: 
##    number of observations:  100
##    number of  variables:  8
##         number of numerical variables:  5
##         number of categorical variables:  3
##    number of clusters:  4
## 
## Cluster  1 : 
##         squared loading
## SEXEXP             0.54
## SELFCON            0.54
## 
## 
## Cluster  2 : 
##          squared loading
## PARTICIP            0.27
## SAFETY              0.60
## PERCEIVE            0.62
## 
## 
## Cluster  3 : 
##          squared loading
## USE                 0.62
## PREVIOUS            0.62
## 
## 
## Cluster  4 : 
##        squared loading
## GENDER               1
## 
## 
## Gain in cohesion (in %):  46.7

k.means$cluster # This produces a list of what variables are in which cluster.

## PARTICIP   SAFETY   SEXEXP  SELFCON PERCEIVE      USE   GENDER PREVIOUS 
##        2        2        1        1        2        3        4        3

You can also save the cluster information in a .csv if you like.

write.csv(k.means$cluster, file="variableClusters.csv")

Clustering Respondents

You can use basically the same tools to cluster respondents into “types”, based on their responses.

First, you will need to calculate the distances between responses, based on their responses.

library(cluster)

d <- daisy(data, metric="gower")
# Keep in mind that this is only for mixed numeric and categorical data.
# See the note below if you are using only numeric variables.

Note: if you are using only numeric varaibles, then you will get better results using ‘d <- dist(data, method=“euclidian”)’.

Next, create a dendrogram

Use this as you did the one above. Only, now you are considering respondents, not variables.

fit <- hclust(d=d, method="complete")    # Also try: method="ward.D"   
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=4)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=4, border="red") # draw dendogram with red borders around the 4 clusters

Finally, produce a k-means plot

As above, you can plot similarities between respondents using k-means.

kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

kfit$cluster

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   3   1   3   1   3   3   1   3   3   1   1   3   3   3   1   3   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   3   3   1   1   3   1   1   3   3   3   1   3   3   1   3   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   3   3   1   3   1   1   3   3   1   1   1   1   3   3   1   3 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   3   1   1   2   4   4   2   4   2   4   2   4   4   4   2   4   2   2 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   4   4   4   4   2   2   2   4   2   4   2   4   2   2   4   2   2   2 
##  91  92  93  94  95  96  97  98  99 100 
##   4   2   2   4   2   2   4   4   4   4

To save the results as a .csv file, use:

write.csv(kfit$cluster, file="RespondentClusters.csv")

Alternatively, you can add the cluster information into the data set. To add a variable titled “Cluster”:

data[,"Cluster"] <- kfit$cluster
write.csv(data, file="ClusteredRespondents.csv")

Clustering Variables and Respondents in R

Philip Murphy, PhD

4/20/2017