Exploratory Factor Analysis

Getting to Know Factor Analysis

GENERAL PROCEDURE
Some extra data preparation help

Step 1: Evaluate data

Actually running the Factor Analysis

Step 2. Determine Number of Factors to Extract

Step 3. Extract (and rotate) factors

Step 4. Evaluate what you have produced

Step 5. Interpret and write up results

Getting to Know Factor Analysis

Exploratory Factor Analysis

Required packages: psych and nFactors

GENERAL PROCEDURE

Set up and evaluate data set
Choose number of factors to extract
Extract (and rotate) factors
Evaluate what you have and possibly repeat 2 & 3
Interpret and write-up results

For more info: http://www.statmethods.net/advstats/factor.html

Also read more of the Field Discovering Statistics using R book

Some extra data preparation help

To omit missing variables, create a new data set without them:

Generally, factor analysis (and principal components analysis) does not play well with missing data. You will have to remove cases that have missing data.

Beware, however, just doing that before selecting your variables of interest from a data set. If you leave all variables in the data set when you remove missing cases, then you will frequently discover that a particularly problematic variable (one with a lot of missings) will cause you to cull a lot of observations that you would be able to use otherwise.

The code for removing missing cases is straightforward, but it is a blunt force tool.

If, in the course of doing a factor analysis, you decide to discard one or more variables, then you are strongly advised to go back and subset the original data set and removing missings from that before continuing.

To remove missings, use: dat <- na.omit(my.Original.data) # note: I rename the data set each time to save work later.

Also, since these examples use the hypothetical data set “mydata”, you may want to go ahead and rename your data set “mydata” to make this really easy.

mydata <- dat

Also, keep in mind that factor analysis is designed for continuous data. Although it is possible to include categorical data in a factor analysis, the procedures below will not allow that. Keep this in mind when subsetting data.

Step 1: Evaluate data

As with any analysis, begin by evaluating the data you have.

The following evaluations are important. Provided you pay attention to what they tell you, they will frequently save you a huge amount of grief. There is little reward in getting to the middle or end of an analysis and discovering that your data were not appropriate for the tool you are applying.

Try each of the following analyses to familiarize yourself with the data and evaluate what you may need to modify before applying a factor analysis. Later tools are designed to tell you, in advance, whether factor analysis will render a satisfactory set of conclusions in your data.

Think of these as time-savers, even though they are a time investment themselves. The iterative nature of factor analysis can easily result in an exercise in chasing rabbits for a while before giving up due to lack of convergence.

Start by examining the variables you are analyzing.

Basic Descriptive Statistics

If nothing else, look for the number of missings. You can, however, also see whether each variable was read in correctly as a continuous variable.

You may need to clean up problematic variables before going any further.

summary(mydata)

At the very least, identify which variables have the most missing values. They will be the most problematic, as they will force you to reduce the number of observations in the data most.

For the sake of having some output to show, I am using the "*UN.2014.rda**" data from the course website. As you will see, these data are not ideal for factor analysis. But, they do give us the opportunity to see what it is like to clean data and get it ready for this procedure.

colSums(is.na(mydata))

##             life           school         ExSchool              GNI 
##                4                8                5                5 
##              CHI         IneqLiEx   IneqLiEx.index             InEd 
##               50               10               10               42 
##       InEd.index             InIN       InIN.index         QuintRat 
##               43               44               44               78 
##         PalmaRat             Gini       GenderIneq          MatMort 
##               91               57               43               15 
##             AdBR           FemPar        GenderHDI           ImmDTP 
##               10                5               47                2 
##       ImmMeasles      UndFiveMort        Antenatal         SecondEd 
##                2                2               32               33 
##       PrePrimary          Primary        Secondary         Tertiary 
##               21               14               18               30 
##          dropout        PupTeaRat            EdGDP              GDP 
##               40               20               37               15 
##            GDPPC              CPI           FoodPI  PriceVolitility 
##               15               15               68               63 
##        EmpPopRat        VulnerPop         YouthEmp     Unemployment 
##               23              106               82               46 
##       ChildLabor      WorkingPoor   MaternityLeave         BirthReg 
##               89               99               61               34 
##          Pension         refugees         IntTrade              FDI 
##               22                5               22               11 
##   PrivateCapFlow          DevAsst      Remittances         reserves 
##               41               32               38               64 
##              CO2        NatResDep       ForestArea       FreshWater 
##                6               28                6               24 
##     PopOnBadLand     NatDisasters              Pop         YoungPop 
##               49               17                0               10 
##           OldPop         UrbanPop        MedianAge  DependencyYoung 
##               10                0               10               10 
##    DepencencyOld         SexRatio         FemaleEd           MaleEd 
##               10               10               34               34 
##         FeEcPart         MaEcPart           FemHDI          MaleHDI 
##               18               18               47               47 
##       InMortRate        HIVFemale          HIVMale         HIVTotal 
##                2               94               94               94 
##       FemaleMort         MaleMort          Obesity        HealthExp 
##                3                3                7                7 
##         Literacy              GFC           Credit     ExternalDebt 
##               49               23               17               77 
##    WorkingPoor.1     LongUnemploy         Homeless        PrisonPop 
##               99              118               65                5 
##         Homocide     NetMigration         Tourists         Internet 
##               18               10               10                5 
## IncomingIntPhone OutgoingIntPhone       FuelSupply   MDPovertyIndex 
##               80               62               60              104 
##       IntenMDPov   PopNearPoverty PopSeverePoverty        SuperPoor 
##              104              104              104              120 
##           PoorIn        Fertility 
##              104               10

missings <- colSums(is.na(mydata)) # Count # missing in each column

summary(missings) # Evaluate the breakdown of missings

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   26.00   37.77   59.25  120.00

# missings         # Alternatively, just look at them

mydata <- mydata[ , missings<40] # Keep all columns with less than 40 missings

If there are missings, you will have to eliminate them. You may also choose to eliminate particularly problematic variables (those with many missings) in the interest of maximizing your sample size.

To see the dimensions of the data set, use dim(mydata).

Evaluate the correlation matrix

library(RcmdrMisc)

rcorr.adjust(mydata) # This function is build into R Commander.

## If you want to run this before eliminating missing values use: 
# rcorr.adjust(mydata, use="pairwise.complete.obs") 

write.csv(cor(mydata)>0.8, file="Suspect_Correlations.csv")
write.csv(cor(mydata), file="Correlation_Values.csv")

Use the CSV file to find variables that are too highly correlated (r>0.90) and select one variable from each pair to omit from the data set.

overcorrelated<- c("SecondEd", "OldPop", "FemaleEd", "school", 
                   "IneqLiEx", "UndFiveMort", "YoungPop", 
                   "MaleEd", "InMortRate", "IneqLiEx.index",
                   "GNI", "DepencencyOld")

mydata <- mydata[ ,!(names(mydata) %in% overcorrelated)] # Remove listed variables from data

KMO Test

Kaiser provided the following values for interpreting the results:

* 0.00 to 0.49 unacceptable
* 0.50 to 0.59 miserable
* 0.60 to 0.69 mediocre
* 0.70 to 0.79 middling
* 0.80 to 0.89 meritorious
* 0.90 to 1.00 marvelous

# install.packages("psych", dependencies = TRUE)
library(psych)

KMO(mydata)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mydata)
## Overall MSA =  0.59
## MSA for each item = 
##            life        ExSchool         MatMort            AdBR          FemPar 
##            0.86            0.74            0.78            0.70            0.17 
##          ImmDTP      ImmMeasles       Antenatal      PrePrimary         Primary 
##            0.70            0.55            0.44            0.61            0.14 
##       Secondary        Tertiary       PupTeaRat           EdGDP             GDP 
##            0.79            0.73            0.75            0.28            0.45 
##           GDPPC             CPI       EmpPopRat        BirthReg         Pension 
##            0.66            0.34            0.57            0.67            0.67 
##        refugees        IntTrade             FDI         DevAsst     Remittances 
##            0.38            0.35            0.16            0.63            0.11 
##             CO2       NatResDep      ForestArea      FreshWater    NatDisasters 
##            0.77            0.19            0.28            0.35            0.25 
##             Pop        UrbanPop       MedianAge DependencyYoung        SexRatio 
##            0.28            0.58            0.67            0.67            0.60 
##        FeEcPart        MaEcPart      FemaleMort        MaleMort         Obesity 
##            0.46            0.55            0.65            0.69            0.80 
##       HealthExp             GFC          Credit       PrisonPop        Homocide 
##            0.57            0.15            0.66            0.30            0.30 
##    NetMigration        Tourists        Internet       Fertility 
##            0.30            0.50            0.82            0.89

The results (MSA=0.49) are unacceptable. I am going to deal with that by eliminating all low-contribution variables.

mydat <- mydata[, KMO(mydata)$MSAi>0.50] # Get rid of all variables with MSA < 0.50
mydata <- mydat

Let’s take a look at the KMO value now that those problematic variables have been removed.

round( KMO(mydata)$MSA, 2 )

## [1] 0.89

That is much better. These data look more suitable for running a factor model.

Please keep in mind that I do not fully condone this approach. I am just doing it out of expediency at the moment and for the sake of having something to demonstrate output below.

Bartlett’s test for sphericity

Bartlett’s test compares the correlation matrix to an identity matrix (a matrix filled with zeroes). The null hypothesis for Bartlett’s test is:

H_{0} : m a t r i x = i d e n t i t y

${H_0}:{matrix =identity}$ That is to say, Bartlett’s test lets you rule out that the variables in the data set are essentially uncorrelated. If you fail to reject the null for this test, then you may as well stop. There is nothing in there for you to factor. The variables are all essentially different from one another.

library(psych)
cortest.bartlett(mydata)

Actually running the Factor Analysis

This brings us through steps 2 to 5.

Step 2. Determine Number of Factors to Extract

Theoretically, you could have as many factors as you do variables. But that, of course, defeats the purpose of running a factor analysis.

Rather, we are seeking latent structure within a set of data. Therefore, we will only be interested in factors that explain a substantial proportion of variation within the data.

We have a few ways of doing that. But, they generally all have something to do with the scree plot like the one that we develop below.

ev <- eigen(cor(mydata)) # get eigenvalues
ev$values

##  [1] 15.89088158  2.09852113  1.82657814  1.13201882  1.02702103  0.89445313
##  [7]  0.65379609  0.58186191  0.52961144  0.42492476  0.40935656  0.36205490
## [13]  0.35611955  0.28487848  0.26191759  0.22357035  0.20112354  0.17531559
## [19]  0.15282141  0.13482767  0.09161749  0.08034115  0.06528069  0.04150550
## [25]  0.03778802  0.02947327  0.01750228  0.01483793

scree(mydata, pc=FALSE)  # Use pc=FALSE for factor analysis

You can check this against a parallel analysis.

fa.parallel(mydata, fa="fa")

## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

The eigenvalue method (“Kaiser’s rule”) is telling us that five factors may be best. The scree plot is putting us somewhere between three and five factors. Parallel analysis is revealing only two factors. Lets try each.

Step 3. Extract (and rotate) factors

Now that you have at least some idea of how many factors to extract, you can get started. But, you still have another important decision to make. You need to consider how you plan to rotate the factors.

Think of factor rotation as being similar to focusing a telescope. You are essentially searching for a clearer association between individual factors and the various variables. The way you do this will depend on your assumption about whether the factors should be correlated with one another.

Under ideal circumstances, we would hope that the various factors are independent and, therefore, uncorrelated. But that is not always realistic. For that reason, some analysts suggest starting out with the assumption that the factors are non-independent by using an oblique rotation method. Oblique rotations will provide a correlation matrix that describes the relationships between factors. Use the correlation matrix to evaluate whether the correlations are worth keeping (e.g., r>|0.30|) Alternatively, if the correlations are fairly low (e.g., r<|0.30|), then you could be justified in switching to the assumption that the factors are independent (orthogonal) and should not be correlated.

If you wish to begin with the assumption of correlated (non-independent) factors, you have two oblique rotation option:

Oblique option 1: promax Promax rotation is popular for its ability to handle large datasets efficiently. The approach also tends to result in greater correlation values between factors.

Oblique option 2: oblimin
The direct oblimin rotation approach is somewhat less efficient with large datasets, but can produce a simpler factor structure.

If you wish to begin with the assumption of uncorrelated (independent) factors, you have a few orthogonal rotation options:

Orthogonal option 1: varimax
Varimax rotation is optimized to reduce cross loadings and to minimize smaller loading values, making factor models clearer.

Orthogonal option 2: quartimax
Quartimax rotation works to reduce the number of variables needed to explain a factor, making interpretation easier.

Orthogonal option 2: equamax The equamax option offers a compromise between varimax and quartimax.

After some fiddling, I decided on four factors. This is mostly because ten factors just seemed like too much for a data set this small. We can always go back and try extracting a greater number of factors. But, let’s look at four for now.

We’ll also start with an oblique rotation so that we can see whether there are high enough correlations to stick with this rotation option, or whether we are safe to move on to an orthogonal approach.

Nfacs <- 4  # This is for four factors. You can change this as needed.

fit <- factanal(mydata, Nfacs, rotation="promax")


print(fit, digits=2, cutoff=0.3, sort=TRUE)

## 
## Call:
## factanal(x = mydata, factors = Nfacs, rotation = "promax")
## 
## Uniquenesses:
##            life        ExSchool         MatMort            AdBR          ImmDTP 
##            0.08            0.16            0.16            0.39            0.19 
##      ImmMeasles      PrePrimary       Secondary        Tertiary       PupTeaRat 
##            0.11            0.45            0.14            0.19            0.22 
##           GDPPC       EmpPopRat        BirthReg         Pension         DevAsst 
##            0.01            0.56            0.42            0.34            0.56 
##             CO2        UrbanPop       MedianAge DependencyYoung        SexRatio 
##            0.25            0.49            0.10            0.10            0.63 
##        MaEcPart      FemaleMort        MaleMort         Obesity       HealthExp 
##            0.52            0.04            0.07            0.54            0.66 
##          Credit        Internet       Fertility 
##            0.47            0.16            0.33 
## 
## Loadings:
##                 Factor1 Factor2 Factor3 Factor4
## ExSchool         0.72                          
## MatMort         -0.63    0.39                  
## AdBR            -0.63                          
## PrePrimary       0.59                          
## Secondary        0.84                          
## Tertiary         0.79                          
## PupTeaRat       -0.71                          
## EmpPopRat       -0.56                          
## BirthReg         0.65                          
## Pension          0.90    0.35                  
## MedianAge        0.71                          
## DependencyYoung -0.89                          
## MaEcPart        -0.82                          
## Obesity          0.61                          
## Fertility       -0.52    0.34                  
## life                    -0.73                  
## FemaleMort               0.79                  
## MaleMort                 1.06                  
## GDPPC                   -0.32    0.80          
## CO2              0.50            0.53          
## Internet                -0.32    0.53          
## ImmDTP                                   0.87  
## ImmMeasles                               0.87  
## DevAsst                  0.48                  
## UrbanPop         0.42                          
## SexRatio         0.39                    0.41  
## HealthExp                0.32    0.50          
## Credit                  -0.42    0.48          
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       8.61    3.77    2.12    1.91
## Proportion Var    0.31    0.13    0.08    0.07
## Cumulative Var    0.31    0.44    0.52    0.59
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3 Factor4
## Factor1    1.00   -0.29   -0.30   -0.57
## Factor2   -0.29    1.00    0.19    0.70
## Factor3   -0.30    0.19    1.00    0.31
## Factor4   -0.57    0.70    0.31    1.00
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 419.06 on 272 degrees of freedom.
## The p-value is 0.000000024

The correlation matrix indicates that the correlations are fairly low. So, we would be justified in moving on to try an oblique rotation.

plot Factor 1 by Factor 2

load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7)

You can also visualize the factor model to ease your interpretation. Though, as you can see, this does not work very well when you have a lot of variables.

library(psych)

loads <- fit$loadings

fa.diagram(loads)

To simplify the interpretation further, cut down the number of factors you are considering.

loads2 <- fit$loadings[,1:2]
fa.diagram(loads2)

loads3 <- fit$loadings[,3:4]
fa.diagram(loads3)

Step 4. Evaluate what you have produced

From the work above, you will need to consider whether the rotation you selected was appropriate.

If you used an oblique rotation, are there enough fairly substantial correlations (r>0.30) to merit the assumption that the factors are interrelated? If not, you may wish to switch to an orthogonal rotation.
If you used an orthogonal rotation, do the variables load onto factors cleanly? If not, you may wish to switch to the oblique rotation.

Do the factors look stable and distinct?

Do all variables load onto the factors sufficiently? ( |factor loadings| >0.40) If factor loadings aren’t sufficient, you may need to try the method of factor extraction, a different number of factors, whether to retain the variables that don’t load onto a factor.
Do all factors have at least three - or ,better, four - or more variables loading onto them? If not, you may need to decrease the number of factors.
Are there many crossloadings (variables that load onto more than one factor with values that are within 0.05 of one another)? This can be unavoidable. But, the best solution will generally minimize the number of crossloadings.
Are each of the factors interpretable? Do the variables that load onto each factor have at least a fairly clear theme (based on the variable definitions)? Ultimately, this is what you are trying to produce.

Export factor loadings

To help interpret the factors and report findings, it can be nice to export the complete factor loadings. Exporting the loadings if the form of a csv will make it a lot easier to report the findings. Though, this process will not retain the sorting that you currently see in the output. You will have to sort the columns yourself.

For whatever reason, the psych package suppresses some loadings if you merely call the loadings using fit$loadings. To work around this, we need to make sure that we specify which variables we are including in the output.

So, first check to see how many variables are present in the output and then specify that when you extract the loadings. It is also helpful to go ahead and round the output to three digits after the decimal. This will save a step later and give you one less annoyance when formatting the output.

dim(fit$loadings)

## [1] 28  4

# There are 28 variables and three factors

round(fit$loadings[ 1:28,], 3)

##                 Factor1 Factor2 Factor3 Factor4
## life              0.161  -0.734   0.160  -0.114
## ExSchool          0.717   0.064   0.288  -0.111
## MatMort          -0.631   0.393   0.144   0.093
## AdBR             -0.634   0.121  -0.023   0.117
## ImmDTP            0.072   0.139  -0.070   0.868
## ImmMeasles       -0.227   0.074   0.147   0.871
## PrePrimary        0.592  -0.137   0.005  -0.123
## Secondary         0.843  -0.063   0.041  -0.040
## Tertiary          0.791   0.050   0.244   0.070
## PupTeaRat        -0.712   0.209  -0.045  -0.052
## GDPPC             0.081  -0.324   0.801   0.030
## EmpPopRat        -0.560   0.151  -0.039  -0.170
## BirthReg          0.655  -0.095  -0.077  -0.193
## Pension           0.899   0.346   0.132  -0.068
## DevAsst          -0.228   0.477   0.017   0.041
## CO2               0.505   0.046   0.532   0.097
## UrbanPop          0.421  -0.230   0.198   0.028
## MedianAge         0.706  -0.108   0.276   0.055
## DependencyYoung  -0.886   0.139   0.067  -0.013
## SexRatio          0.389  -0.281  -0.166   0.406
## MaEcPart         -0.824  -0.237  -0.072  -0.155
## FemaleMort       -0.252   0.793   0.047   0.044
## MaleMort          0.253   1.064  -0.188   0.034
## Obesity           0.612  -0.064  -0.027  -0.096
## HealthExp         0.103   0.322   0.498  -0.151
## Credit            0.035  -0.423   0.480   0.088
## Internet          0.292  -0.316   0.530   0.037
## Fertility        -0.518   0.339   0.019   0.101

If the output looks like something that you can use, then save it as a data object and then write it to your working directory as a csv file.

FactorLoadings <- round(fit$loadings[1:28, ], 3)

write.csv(FactorLoadings, file="FacLoads.csv")

Once you have the csv file, you can open it as a spreadsheet in Excel, Google Sheets, Numbers, Libre Office, or whatever your favorite spreadsheet program may be. From there, you can decide whether there are any variables that do not load sufficiently well onto any of the factors ( $|loading|<0.40$ ).

In this example, we don’t have any variables that don’t load sufficiently onto any of the factors. But, if we did, removing those additional variables could be done as we have above.

Of course, if you do elect to remove variables, the loadings are very likely to change. You will, therefore, have to re-run the factor analysis with the newly reduced data set before you can make your final interpretation and report.

For our purposes, we are satisfied with this model. So, we can move on to evaluating our factors for thier internal consistency.

Consistency: Cronbach’s Alpha

How good are these factors? We are interested in whether the variables in each factor form a coherent whole when used together. In other words, are the variances internally consistent between variables in a factor?

Cronbach developed a measure of internal consistency that he named “alpha.” He named it alpha because he thought that it would be a first in a series of measures like it. Although he later fell out of love with the measure, its simplicity and notoriety have ensured that it remains one of the best used measures of its kind.

To run alpha on your factors you will first need to specify which variables belong to each factor.

f1 <- mydata[ , c("ExSchool", "MatMort", "AdBR", "PrePrimary", 
                  "Secondary", "Tertiary", "PupTeaRat", 
                  "EmpPopRat", "BirthReg", "Pension", "MedianAge", 
                  "DependencyYoung", "MaEcPart", "Obesity", "Fertility")]
f2 <- mydata[ , c("Fertility", "life", "FemaleMort", "MaleMort")]
f3 <- mydata[ , c("GDPPC", "CO2", "Internet", "HealthExp")]
f4 <- mydata[ , c("ImmDTP", "ImmMeasles")]

Once you have defined each of the factors you can use that information to run the analysis.

Generally, values of Cronbach’s alpha will look a lot like positive correlation values, varying between zero and 1.0. Though, it is possible for alpha to take values greater than 1.0 if two or more variables in the set are highly correlated ($>0.90). If you see values greater than 1.0, then doublecheck the correlations between the variable. You may have some collinearity that you missed on the first pass.

As a rule, you are looking for $\alpha>0.70$ at a minimum if you wish to consider the factor to be internally consistent. Though, it is better to have a value of $\alpha>0.80$ if possible.

The version of the function that we use here is found in the psych library and it is almost disappointingly simple to run.

alpha(f1)

## Number of categories should be increased  in order to count frequencies.

## Warning in alpha(f1): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Some items ( MatMort AdBR PupTeaRat EmpPopRat DependencyYoung MaEcPart Fertility ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## 
## Reliability analysis   
## Call: alpha(x = f1)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N  ase mean sd median_r
##      -0.82     -0.99    0.57    -0.034 -0.5 0.22   53 10    -0.41
## 
##  lower alpha upper     95% confidence boundaries
## -1.24 -0.82 -0.39 
## 
##  Reliability if an item is dropped:
##                 raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r
## ExSchool            -0.79     -1.51    0.38    -0.045 -0.60     0.21  0.42
## MatMort             -0.14     -0.47    0.66    -0.023 -0.32     0.19  0.42
## AdBR                -1.48     -0.56    0.64    -0.026 -0.36     0.35  0.44
## PrePrimary          -0.41     -1.15    0.56    -0.040 -0.53     0.13  0.44
## Secondary           -0.45     -1.25    0.45    -0.041 -0.56     0.14  0.41
## Tertiary            -0.53     -1.54    0.42    -0.045 -0.61     0.15  0.42
## PupTeaRat           -0.99     -0.42    0.68    -0.021 -0.29     0.26  0.42
## EmpPopRat           -0.97     -0.82    0.56    -0.033 -0.45     0.24  0.45
## BirthReg            -0.50     -1.34    0.49    -0.043 -0.57     0.15  0.44
## Pension             -0.43     -1.35    0.49    -0.043 -0.57     0.13  0.44
## MedianAge           -0.70     -1.07    0.50    -0.038 -0.52     0.19  0.41
## DependencyYoung     -1.22     -0.46    0.64    -0.023 -0.31     0.30  0.41
## MaEcPart            -0.90     -0.65    0.63    -0.029 -0.39     0.23  0.45
## Obesity             -0.70     -1.15    0.50    -0.040 -0.54     0.19  0.45
## Fertility           -0.84     -0.45    0.67    -0.023 -0.31     0.22  0.43
##                 med.r
## ExSchool        -0.41
## MatMort         -0.41
## AdBR            -0.41
## PrePrimary      -0.41
## Secondary       -0.41
## Tertiary        -0.41
## PupTeaRat       -0.41
## EmpPopRat       -0.41
## BirthReg        -0.43
## Pension         -0.42
## MedianAge       -0.41
## DependencyYoung -0.41
## MaEcPart        -0.42
## Obesity         -0.43
## Fertility       -0.41
## 
##  Item statistics 
##                  n raw.r  std.r  r.cor r.drop mean    sd
## ExSchool        63 -0.49  0.533  0.660  -0.50   12   2.6
## MatMort         63  0.90 -0.167 -0.316  -0.56  177 183.9
## AdBR            63  0.66 -0.067 -0.196   0.45   56  43.1
## PrePrimary      63 -0.38  0.363  0.281  -0.55   55  36.3
## Secondary       63 -0.61  0.418  0.513  -0.70   73  25.1
## Tertiary        63 -0.42  0.544  0.651  -0.54   31  24.8
## PupTeaRat       63  0.58 -0.229 -0.393   0.52   27  12.1
## EmpPopRat       63  0.53  0.151  0.067   0.46   68  12.3
## BirthReg        63 -0.44  0.462  0.449  -0.56   79  26.5
## Pension         63 -0.30  0.462  0.450  -0.50   47  38.9
## MedianAge       63 -0.55  0.317  0.389  -0.59   27   8.4
## DependencyYoung 63  0.71 -0.181 -0.275   0.63   51  22.3
## MaEcPart        63  0.38  0.014 -0.155   0.33   76   9.1
## Obesity         63 -0.48  0.366  0.352  -0.53   15   9.4
## Fertility       63  0.58 -0.189 -0.377   0.57    3   1.4

In our case, some of the variables are loading negatively onto the factor. For that reason, we should rerun this with the option check.keys=TRUE, which will allow the function to reverse those loadings in the evaluation.

alpha(f1, check.keys=TRUE)

## Number of categories should be increased  in order to count frequencies.

## Warning in alpha(f1, check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
##  This is indicated by a negative sign for the variable name.

## 
## Reliability analysis   
## Call: alpha(x = f1, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean sd median_r
##       0.78      0.96    0.98      0.64  27 0.019  365 26     0.65
## 
##  lower alpha upper     95% confidence boundaries
## 0.74 0.78 0.81 
## 
##  Reliability if an item is dropped:
##                  raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## ExSchool              0.78      0.96    0.97      0.63  24    0.019 0.015  0.63
## MatMort-              0.91      0.96    0.97      0.63  24    0.011 0.016  0.63
## AdBR-                 0.74      0.96    0.97      0.64  25    0.021 0.016  0.65
## PrePrimary            0.75      0.96    0.98      0.65  26    0.020 0.016  0.65
## Secondary             0.75      0.96    0.97      0.62  23    0.021 0.014  0.63
## Tertiary              0.75      0.96    0.97      0.63  24    0.020 0.016  0.63
## PupTeaRat-            0.76      0.96    0.97      0.63  24    0.020 0.016  0.63
## EmpPopRat-            0.77      0.96    0.97      0.66  27    0.019 0.015  0.66
## BirthReg              0.75      0.96    0.98      0.65  26    0.020 0.016  0.65
## Pension               0.75      0.96    0.98      0.65  26    0.020 0.016  0.65
## MedianAge             0.77      0.96    0.97      0.63  23    0.019 0.014  0.62
## DependencyYoung-      0.75      0.96    0.97      0.62  23    0.021 0.014  0.63
## MaEcPart-             0.77      0.96    0.98      0.66  27    0.019 0.014  0.66
## Obesity               0.77      0.96    0.97      0.65  26    0.020 0.015  0.66
## Fertility-            0.78      0.96    0.97      0.64  25    0.019 0.016  0.64
## 
##  Item statistics 
##                   n raw.r std.r r.cor r.drop mean    sd
## ExSchool         63  0.84  0.88  0.88   0.84   12   2.6
## MatMort-         63  0.95  0.86  0.85   0.82  623 183.9
## AdBR-            63  0.78  0.78  0.77   0.73  744  43.1
## PrePrimary       63  0.74  0.75  0.73   0.69   55  36.3
## Secondary        63  0.92  0.93  0.93   0.91   73  25.1
## Tertiary         63  0.82  0.87  0.86   0.80   31  24.8
## PupTeaRat-       63  0.86  0.89  0.88   0.85  773  12.1
## EmpPopRat-       63  0.63  0.69  0.68   0.61  732  12.3
## BirthReg         63  0.75  0.76  0.74   0.72   79  26.5
## Pension          63  0.72  0.76  0.74   0.67   47  38.9
## MedianAge        63  0.87  0.92  0.93   0.87   27   8.4
## DependencyYoung- 63  0.93  0.93  0.94   0.92  749  22.3
## MaEcPart-        63  0.60  0.68  0.66   0.59  724   9.1
## Obesity          63  0.69  0.71  0.70   0.68   15   9.4
## Fertility-       63  0.80  0.81  0.80   0.80  797   1.4

That is better. Now, let’s walk through the output.

We are concerned with the “raw-alpha” value for the entire factor. That tells us, overall, how consistent the variables are within the factor. Here, the alpha is 0.78. That is pretty good. But, it is possible that it could be better.

The next thing to consider is the way alpha would change if each of the variables were removed. Consider the “raw_alpha” column under “Reliability if an iterm is dropped:”. We can see that removing one variable in particular - “MatMort-” - would increase alpha to 0.91. As above, if you do elect to remove the variable from the data set, be sure to rerun the analysis before reevaluating.

If you wish to get just the overall alpha for each factor, you can simply specify that you are seeking the first of the output types (“raw_alpha”) in the total set measures.

alpha(f1, check.keys=TRUE)$total[1]
alpha(f2, check.keys=TRUE)$total[1]
alpha(f3, check.keys=TRUE)$total[1]
alpha(f4, check.keys=TRUE)$total[1]

## Number of categories should be increased  in order to count frequencies.
##  raw_alpha
##  0.7752715
## Number of categories should be increased  in order to count frequencies.
##  raw_alpha
##  0.6871495
## Number of categories should be increased  in order to count frequencies.
##    raw_alpha
##  0.006105121
## Number of categories should be increased  in order to count frequencies.
##  raw_alpha
##  0.8437348

Okay. Factors two and three don’t look so good in these terms. We will have to go back and reevaluate what we have done.

Step 5. Interpret and write up results

Once we have our final, refined factor representation, it is time to write up the results. Be sure to include all details on how the analysis was run (number of factors, how they were chosen, type of rotation, etc.). Also include the factor scores and the correlation matrix, if you used an oblique rotation. Last, be sure to interpret factors and discuss implications.

So that’s it.

Easy, right?

Good luck!

Exploratory Factor Analysis

Phil Murphy

4/21/2021

Getting to Know Factor Analysis

GENERAL PROCEDURE

Some extra data preparation help

Step 1: Evaluate data

Basic Descriptive Statistics

Evaluate the correlation matrix

KMO Test

Bartlett’s test for sphericity

Actually running the Factor Analysis

Step 2. Determine Number of Factors to Extract

Step 3. Extract (and rotate) factors

If you wish to begin with the assumption of correlated (non-independent) factors, you have two oblique rotation option:

If you wish to begin with the assumption of uncorrelated (independent) factors, you have a few orthogonal rotation options:

plot Factor 1 by Factor 2

Step 4. Evaluate what you have produced

Export factor loadings

Consistency: Cronbach’s Alpha

Step 5. Interpret and write up results