Exploratory Factor Analysis
Required packages: psych and nFactors
For more info: http://www.statmethods.net/advstats/factor.html
To omit missing variables, create a new data set without them:
Generally, factor analysis (and principal components analysis) does not play well with missing data. You will have to remove cases that have missing data.
Beware, however, just doing that before selecting your variables of interest from a data set. If you leave all variables in the data set when you remove missing cases, then you will frequently discover that a particularly problematic variable (one with a lot of missings) will cause you to cull a lot of observations that you would be able to use otherwise.
The code for removing missing cases is straightforward, but it is a blunt force tool.
If, in the course of doing a factor analysis, you decide to discard one or more variables, then you are strongly advised to go back and subset the original data set and removing missings from that before continuing.
To remove missings, use: dat <- na.omit(my.Original.data) # note: I rename the data set each time to save work later.
Also, since these examples use the hypothetical data set “mydata”, you may want to go ahead and rename your data set “mydata” to make this really easy.
mydata <- dat
Also, keep in mind that factor analysis is designed for continuous data. Although it is possible to include categorical data in a factor analysis, the procedures below will not allow that. Keep this in mind when subsetting data.
As with any analysis, begin by evaluating the data you have.
The following evaluations are important. Provided you pay attention to what they tell you, they will frequently save you a huge amount of grief. There is little reward in getting to the middle or end of an analysis and discovering that your data were not appropriate for the tool you are applying.
Try each of the following analyses to familiarize yourself with the data and evaluate what you may need to modify before applying a factor analysis. Later tools are designed to tell you, in advance, whether factor analysis will render a satisfactory set of conclusions in your data.
Think of these as time-savers, even though they are a time investment themselves. The iterative nature of factor analysis can easily result in an exercise in chasing rabbits for a while before giving up due to lack of convergence.
Start by examining the variables you are analyzing.
If nothing else, look for the number of missings. You can, however, also see whether each variable was read in correctly as a continuous variable.
You may need to clean up problematic variables before going any further.
summary(mydata)
At the very least, identify which variables have the most missing values. They will be the most problematic, as they will force you to reduce the number of observations in the data most.
For the sake of having some output to show, I am using the "*UN.2014.rda**" data from the course website. As you will see, these data are not ideal for factor analysis. But, they do give us the opportunity to see what it is like to clean data and get it ready for this procedure.
colSums(is.na(mydata))
## life school ExSchool GNI
## 4 8 5 5
## CHI IneqLiEx IneqLiEx.index InEd
## 50 10 10 42
## InEd.index InIN InIN.index QuintRat
## 43 44 44 78
## PalmaRat Gini GenderIneq MatMort
## 91 57 43 15
## AdBR FemPar GenderHDI ImmDTP
## 10 5 47 2
## ImmMeasles UndFiveMort Antenatal SecondEd
## 2 2 32 33
## PrePrimary Primary Secondary Tertiary
## 21 14 18 30
## dropout PupTeaRat EdGDP GDP
## 40 20 37 15
## GDPPC CPI FoodPI PriceVolitility
## 15 15 68 63
## EmpPopRat VulnerPop YouthEmp Unemployment
## 23 106 82 46
## ChildLabor WorkingPoor MaternityLeave BirthReg
## 89 99 61 34
## Pension refugees IntTrade FDI
## 22 5 22 11
## PrivateCapFlow DevAsst Remittances reserves
## 41 32 38 64
## CO2 NatResDep ForestArea FreshWater
## 6 28 6 24
## PopOnBadLand NatDisasters Pop YoungPop
## 49 17 0 10
## OldPop UrbanPop MedianAge DependencyYoung
## 10 0 10 10
## DepencencyOld SexRatio FemaleEd MaleEd
## 10 10 34 34
## FeEcPart MaEcPart FemHDI MaleHDI
## 18 18 47 47
## InMortRate HIVFemale HIVMale HIVTotal
## 2 94 94 94
## FemaleMort MaleMort Obesity HealthExp
## 3 3 7 7
## Literacy GFC Credit ExternalDebt
## 49 23 17 77
## WorkingPoor.1 LongUnemploy Homeless PrisonPop
## 99 118 65 5
## Homocide NetMigration Tourists Internet
## 18 10 10 5
## IncomingIntPhone OutgoingIntPhone FuelSupply MDPovertyIndex
## 80 62 60 104
## IntenMDPov PopNearPoverty PopSeverePoverty SuperPoor
## 104 104 104 120
## PoorIn Fertility
## 104 10
missings <- colSums(is.na(mydata)) # Count # missing in each column
summary(missings) # Evaluate the breakdown of missings
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 26.00 37.77 59.25 120.00
# missings # Alternatively, just look at them
mydata <- mydata[ , missings<40] # Keep all columns with less than 40 missings
If there are missings, you will have to eliminate them. You may also choose to eliminate particularly problematic variables (those with many missings) in the interest of maximizing your sample size.
To see the dimensions of the data set, use dim(mydata)
.
library(RcmdrMisc)
rcorr.adjust(mydata) # This function is build into R Commander.
## If you want to run this before eliminating missing values use:
# rcorr.adjust(mydata, use="pairwise.complete.obs")
write.csv(cor(mydata)>0.8, file="Suspect_Correlations.csv")
write.csv(cor(mydata), file="Correlation_Values.csv")
Use the CSV file to find variables that are too highly correlated (r>0.90) and select one variable from each pair to omit from the data set.
overcorrelated<- c("SecondEd", "OldPop", "FemaleEd", "school",
"IneqLiEx", "UndFiveMort", "YoungPop",
"MaleEd", "InMortRate", "IneqLiEx.index",
"GNI", "DepencencyOld")
mydata <- mydata[ ,!(names(mydata) %in% overcorrelated)] # Remove listed variables from data
Kaiser provided the following values for interpreting the results:
* 0.00 to 0.49 unacceptable
* 0.50 to 0.59 miserable
* 0.60 to 0.69 mediocre
* 0.70 to 0.79 middling
* 0.80 to 0.89 meritorious
* 0.90 to 1.00 marvelous
# install.packages("psych", dependencies = TRUE)
library(psych)
KMO(mydata)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mydata)
## Overall MSA = 0.59
## MSA for each item =
## life ExSchool MatMort AdBR FemPar
## 0.86 0.74 0.78 0.70 0.17
## ImmDTP ImmMeasles Antenatal PrePrimary Primary
## 0.70 0.55 0.44 0.61 0.14
## Secondary Tertiary PupTeaRat EdGDP GDP
## 0.79 0.73 0.75 0.28 0.45
## GDPPC CPI EmpPopRat BirthReg Pension
## 0.66 0.34 0.57 0.67 0.67
## refugees IntTrade FDI DevAsst Remittances
## 0.38 0.35 0.16 0.63 0.11
## CO2 NatResDep ForestArea FreshWater NatDisasters
## 0.77 0.19 0.28 0.35 0.25
## Pop UrbanPop MedianAge DependencyYoung SexRatio
## 0.28 0.58 0.67 0.67 0.60
## FeEcPart MaEcPart FemaleMort MaleMort Obesity
## 0.46 0.55 0.65 0.69 0.80
## HealthExp GFC Credit PrisonPop Homocide
## 0.57 0.15 0.66 0.30 0.30
## NetMigration Tourists Internet Fertility
## 0.30 0.50 0.82 0.89
The results (MSA=0.49) are unacceptable. I am going to deal with that by eliminating all low-contribution variables.
mydat <- mydata[, KMO(mydata)$MSAi>0.50] # Get rid of all variables with MSA < 0.50
mydata <- mydat
Let’s take a look at the KMO value now that those problematic variables have been removed.
round( KMO(mydata)$MSA, 2 )
## [1] 0.89
That is much better. These data look more suitable for running a factor model.
Please keep in mind that I do not fully condone this approach. I am just doing it out of expediency at the moment and for the sake of having something to demonstrate output below.
Bartlett’s test compares the correlation matrix to an identity matrix (a matrix filled with zeroes). The null hypothesis for Bartlett’s test is:
library(psych)
cortest.bartlett(mydata)
This brings us through steps 2 to 5.
Theoretically, you could have as many factors as you do variables. But that, of course, defeats the purpose of running a factor analysis.
Rather, we are seeking latent structure within a set of data. Therefore, we will only be interested in factors that explain a substantial proportion of variation within the data.
We have a few ways of doing that. But, they generally all have something to do with the scree plot like the one that we develop below.
ev <- eigen(cor(mydata)) # get eigenvalues
ev$values
## [1] 15.89088158 2.09852113 1.82657814 1.13201882 1.02702103 0.89445313
## [7] 0.65379609 0.58186191 0.52961144 0.42492476 0.40935656 0.36205490
## [13] 0.35611955 0.28487848 0.26191759 0.22357035 0.20112354 0.17531559
## [19] 0.15282141 0.13482767 0.09161749 0.08034115 0.06528069 0.04150550
## [25] 0.03778802 0.02947327 0.01750228 0.01483793
scree(mydata, pc=FALSE) # Use pc=FALSE for factor analysis
You can check this against a parallel analysis.
fa.parallel(mydata, fa="fa")
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
The eigenvalue method (“Kaiser’s rule”) is telling us that five factors may be best. The scree plot is putting us somewhere between three and five factors. Parallel analysis is revealing only two factors. Lets try each.
Now that you have at least some idea of how many factors to extract, you can get started. But, you still have another important decision to make. You need to consider how you plan to rotate the factors.
Think of factor rotation as being similar to focusing a telescope. You are essentially searching for a clearer association between individual factors and the various variables. The way you do this will depend on your assumption about whether the factors should be correlated with one another.
Under ideal circumstances, we would hope that the various factors are independent and, therefore, uncorrelated. But that is not always realistic. For that reason, some analysts suggest starting out with the assumption that the factors are non-independent by using an oblique rotation method. Oblique rotations will provide a correlation matrix that describes the relationships between factors. Use the correlation matrix to evaluate whether the correlations are worth keeping (e.g., r>|0.30|) Alternatively, if the correlations are fairly low (e.g., r<|0.30|), then you could be justified in switching to the assumption that the factors are independent (orthogonal) and should not be correlated.
load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7)
You can also visualize the factor model to ease your interpretation. Though, as you can see, this does not work very well when you have a lot of variables.
library(psych)
loads <- fit$loadings
fa.diagram(loads)
To simplify the interpretation further, cut down the number of factors you are considering.
loads2 <- fit$loadings[,1:2]
fa.diagram(loads2)
loads3 <- fit$loadings[,3:4]
fa.diagram(loads3)
From the work above, you will need to consider whether the rotation you selected was appropriate.
Do the factors look stable and distinct?
To help interpret the factors and report findings, it can be nice to export the complete factor loadings. Exporting the loadings if the form of a csv will make it a lot easier to report the findings. Though, this process will not retain the sorting that you currently see in the output. You will have to sort the columns yourself.
For whatever reason, the psych
package suppresses some loadings if you merely call the loadings using fit$loadings
. To work around this, we need to make sure that we specify which variables we are including in the output.
So, first check to see how many variables are present in the output and then specify that when you extract the loadings. It is also helpful to go ahead and round the output to three digits after the decimal. This will save a step later and give you one less annoyance when formatting the output.
dim(fit$loadings)
## [1] 28 4
# There are 28 variables and three factors
round(fit$loadings[ 1:28,], 3)
## Factor1 Factor2 Factor3 Factor4
## life 0.161 -0.734 0.160 -0.114
## ExSchool 0.717 0.064 0.288 -0.111
## MatMort -0.631 0.393 0.144 0.093
## AdBR -0.634 0.121 -0.023 0.117
## ImmDTP 0.072 0.139 -0.070 0.868
## ImmMeasles -0.227 0.074 0.147 0.871
## PrePrimary 0.592 -0.137 0.005 -0.123
## Secondary 0.843 -0.063 0.041 -0.040
## Tertiary 0.791 0.050 0.244 0.070
## PupTeaRat -0.712 0.209 -0.045 -0.052
## GDPPC 0.081 -0.324 0.801 0.030
## EmpPopRat -0.560 0.151 -0.039 -0.170
## BirthReg 0.655 -0.095 -0.077 -0.193
## Pension 0.899 0.346 0.132 -0.068
## DevAsst -0.228 0.477 0.017 0.041
## CO2 0.505 0.046 0.532 0.097
## UrbanPop 0.421 -0.230 0.198 0.028
## MedianAge 0.706 -0.108 0.276 0.055
## DependencyYoung -0.886 0.139 0.067 -0.013
## SexRatio 0.389 -0.281 -0.166 0.406
## MaEcPart -0.824 -0.237 -0.072 -0.155
## FemaleMort -0.252 0.793 0.047 0.044
## MaleMort 0.253 1.064 -0.188 0.034
## Obesity 0.612 -0.064 -0.027 -0.096
## HealthExp 0.103 0.322 0.498 -0.151
## Credit 0.035 -0.423 0.480 0.088
## Internet 0.292 -0.316 0.530 0.037
## Fertility -0.518 0.339 0.019 0.101
If the output looks like something that you can use, then save it as a data object and then write it to your working directory as a csv file.
FactorLoadings <- round(fit$loadings[1:28, ], 3)
write.csv(FactorLoadings, file="FacLoads.csv")
Once you have the csv file, you can open it as a spreadsheet in Excel, Google Sheets, Numbers, Libre Office, or whatever your favorite spreadsheet program may be. From there, you can decide whether there are any variables that do not load sufficiently well onto any of the factors ().
In this example, we don’t have any variables that don’t load sufficiently onto any of the factors. But, if we did, removing those additional variables could be done as we have above.
Of course, if you do elect to remove variables, the loadings are very likely to change. You will, therefore, have to re-run the factor analysis with the newly reduced data set before you can make your final interpretation and report.
For our purposes, we are satisfied with this model. So, we can move on to evaluating our factors for thier internal consistency.
How good are these factors? We are interested in whether the variables in each factor form a coherent whole when used together. In other words, are the variances internally consistent between variables in a factor?
Cronbach developed a measure of internal consistency that he named “alpha.” He named it alpha because he thought that it would be a first in a series of measures like it. Although he later fell out of love with the measure, its simplicity and notoriety have ensured that it remains one of the best used measures of its kind.
To run alpha on your factors you will first need to specify which variables belong to each factor.
f1 <- mydata[ , c("ExSchool", "MatMort", "AdBR", "PrePrimary",
"Secondary", "Tertiary", "PupTeaRat",
"EmpPopRat", "BirthReg", "Pension", "MedianAge",
"DependencyYoung", "MaEcPart", "Obesity", "Fertility")]
f2 <- mydata[ , c("Fertility", "life", "FemaleMort", "MaleMort")]
f3 <- mydata[ , c("GDPPC", "CO2", "Internet", "HealthExp")]
f4 <- mydata[ , c("ImmDTP", "ImmMeasles")]
Once you have defined each of the factors you can use that information to run the analysis.
Generally, values of Cronbach’s alpha will look a lot like positive correlation values, varying between zero and 1.0. Though, it is possible for alpha to take values greater than 1.0 if two or more variables in the set are highly correlated ($>0.90). If you see values greater than 1.0, then doublecheck the correlations between the variable. You may have some collinearity that you missed on the first pass.
As a rule, you are looking for at a minimum if you wish to consider the factor to be internally consistent. Though, it is better to have a value of if possible.
The version of the function that we use here is found in the psych
library and it is almost disappointingly simple to run.
alpha(f1)
## Number of categories should be increased in order to count frequencies.
## Warning in alpha(f1): Some items were negatively correlated with the total scale and probably
## should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( MatMort AdBR PupTeaRat EmpPopRat DependencyYoung MaEcPart Fertility ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
##
## Reliability analysis
## Call: alpha(x = f1)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.82 -0.99 0.57 -0.034 -0.5 0.22 53 10 -0.41
##
## lower alpha upper 95% confidence boundaries
## -1.24 -0.82 -0.39
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r
## ExSchool -0.79 -1.51 0.38 -0.045 -0.60 0.21 0.42
## MatMort -0.14 -0.47 0.66 -0.023 -0.32 0.19 0.42
## AdBR -1.48 -0.56 0.64 -0.026 -0.36 0.35 0.44
## PrePrimary -0.41 -1.15 0.56 -0.040 -0.53 0.13 0.44
## Secondary -0.45 -1.25 0.45 -0.041 -0.56 0.14 0.41
## Tertiary -0.53 -1.54 0.42 -0.045 -0.61 0.15 0.42
## PupTeaRat -0.99 -0.42 0.68 -0.021 -0.29 0.26 0.42
## EmpPopRat -0.97 -0.82 0.56 -0.033 -0.45 0.24 0.45
## BirthReg -0.50 -1.34 0.49 -0.043 -0.57 0.15 0.44
## Pension -0.43 -1.35 0.49 -0.043 -0.57 0.13 0.44
## MedianAge -0.70 -1.07 0.50 -0.038 -0.52 0.19 0.41
## DependencyYoung -1.22 -0.46 0.64 -0.023 -0.31 0.30 0.41
## MaEcPart -0.90 -0.65 0.63 -0.029 -0.39 0.23 0.45
## Obesity -0.70 -1.15 0.50 -0.040 -0.54 0.19 0.45
## Fertility -0.84 -0.45 0.67 -0.023 -0.31 0.22 0.43
## med.r
## ExSchool -0.41
## MatMort -0.41
## AdBR -0.41
## PrePrimary -0.41
## Secondary -0.41
## Tertiary -0.41
## PupTeaRat -0.41
## EmpPopRat -0.41
## BirthReg -0.43
## Pension -0.42
## MedianAge -0.41
## DependencyYoung -0.41
## MaEcPart -0.42
## Obesity -0.43
## Fertility -0.41
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## ExSchool 63 -0.49 0.533 0.660 -0.50 12 2.6
## MatMort 63 0.90 -0.167 -0.316 -0.56 177 183.9
## AdBR 63 0.66 -0.067 -0.196 0.45 56 43.1
## PrePrimary 63 -0.38 0.363 0.281 -0.55 55 36.3
## Secondary 63 -0.61 0.418 0.513 -0.70 73 25.1
## Tertiary 63 -0.42 0.544 0.651 -0.54 31 24.8
## PupTeaRat 63 0.58 -0.229 -0.393 0.52 27 12.1
## EmpPopRat 63 0.53 0.151 0.067 0.46 68 12.3
## BirthReg 63 -0.44 0.462 0.449 -0.56 79 26.5
## Pension 63 -0.30 0.462 0.450 -0.50 47 38.9
## MedianAge 63 -0.55 0.317 0.389 -0.59 27 8.4
## DependencyYoung 63 0.71 -0.181 -0.275 0.63 51 22.3
## MaEcPart 63 0.38 0.014 -0.155 0.33 76 9.1
## Obesity 63 -0.48 0.366 0.352 -0.53 15 9.4
## Fertility 63 0.58 -0.189 -0.377 0.57 3 1.4
In our case, some of the variables are loading negatively onto the
factor. For that reason, we should rerun this with the option check.keys=TRUE
, which will allow the function to reverse those loadings in the evaluation.
alpha(f1, check.keys=TRUE)
## Number of categories should be increased in order to count frequencies.
## Warning in alpha(f1, check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
##
## Reliability analysis
## Call: alpha(x = f1, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.78 0.96 0.98 0.64 27 0.019 365 26 0.65
##
## lower alpha upper 95% confidence boundaries
## 0.74 0.78 0.81
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## ExSchool 0.78 0.96 0.97 0.63 24 0.019 0.015 0.63
## MatMort- 0.91 0.96 0.97 0.63 24 0.011 0.016 0.63
## AdBR- 0.74 0.96 0.97 0.64 25 0.021 0.016 0.65
## PrePrimary 0.75 0.96 0.98 0.65 26 0.020 0.016 0.65
## Secondary 0.75 0.96 0.97 0.62 23 0.021 0.014 0.63
## Tertiary 0.75 0.96 0.97 0.63 24 0.020 0.016 0.63
## PupTeaRat- 0.76 0.96 0.97 0.63 24 0.020 0.016 0.63
## EmpPopRat- 0.77 0.96 0.97 0.66 27 0.019 0.015 0.66
## BirthReg 0.75 0.96 0.98 0.65 26 0.020 0.016 0.65
## Pension 0.75 0.96 0.98 0.65 26 0.020 0.016 0.65
## MedianAge 0.77 0.96 0.97 0.63 23 0.019 0.014 0.62
## DependencyYoung- 0.75 0.96 0.97 0.62 23 0.021 0.014 0.63
## MaEcPart- 0.77 0.96 0.98 0.66 27 0.019 0.014 0.66
## Obesity 0.77 0.96 0.97 0.65 26 0.020 0.015 0.66
## Fertility- 0.78 0.96 0.97 0.64 25 0.019 0.016 0.64
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## ExSchool 63 0.84 0.88 0.88 0.84 12 2.6
## MatMort- 63 0.95 0.86 0.85 0.82 623 183.9
## AdBR- 63 0.78 0.78 0.77 0.73 744 43.1
## PrePrimary 63 0.74 0.75 0.73 0.69 55 36.3
## Secondary 63 0.92 0.93 0.93 0.91 73 25.1
## Tertiary 63 0.82 0.87 0.86 0.80 31 24.8
## PupTeaRat- 63 0.86 0.89 0.88 0.85 773 12.1
## EmpPopRat- 63 0.63 0.69 0.68 0.61 732 12.3
## BirthReg 63 0.75 0.76 0.74 0.72 79 26.5
## Pension 63 0.72 0.76 0.74 0.67 47 38.9
## MedianAge 63 0.87 0.92 0.93 0.87 27 8.4
## DependencyYoung- 63 0.93 0.93 0.94 0.92 749 22.3
## MaEcPart- 63 0.60 0.68 0.66 0.59 724 9.1
## Obesity 63 0.69 0.71 0.70 0.68 15 9.4
## Fertility- 63 0.80 0.81 0.80 0.80 797 1.4
That is better. Now, let’s walk through the output.
We are concerned with the “raw-alpha” value for the entire factor. That tells us, overall, how consistent the variables are within the factor. Here, the alpha is 0.78. That is pretty good. But, it is possible that it could be better.
The next thing to consider is the way alpha would change if each of the variables were removed. Consider the “raw_alpha” column under “Reliability if an iterm is dropped:”. We can see that removing one variable in particular - “MatMort-” - would increase alpha to 0.91. As above, if you do elect to remove the variable from the data set, be sure to rerun the analysis before reevaluating.
If you wish to get just the overall alpha for each factor, you can simply specify that you are seeking the first of the output types (“raw_alpha”) in the total set measures.
alpha(f1, check.keys=TRUE)$total[1]
alpha(f2, check.keys=TRUE)$total[1]
alpha(f3, check.keys=TRUE)$total[1]
alpha(f4, check.keys=TRUE)$total[1]
## Number of categories should be increased in order to count frequencies.
## raw_alpha
## 0.7752715
## Number of categories should be increased in order to count frequencies.
## raw_alpha
## 0.6871495
## Number of categories should be increased in order to count frequencies.
## raw_alpha
## 0.006105121
## Number of categories should be increased in order to count frequencies.
## raw_alpha
## 0.8437348
Okay. Factors two and three don’t look so good in these terms. We will have to go back and reevaluate what we have done.
Once we have our final, refined factor representation, it is time to write up the results. Be sure to include all details on how the analysis was run (number of factors, how they were chosen, type of rotation, etc.). Also include the factor scores and the correlation matrix, if you used an oblique rotation. Last, be sure to interpret factors and discuss implications.
So that’s it.
Easy, right?
Good luck!