Multilevel CFA With R and Lavaan
by Arndt Regorz, MSc.
November 1, 2024
If you have nested (= hierarchical) data and want to run a confirmatory factor analysis (CFA), then one option is a multilevel CFA to address the dependence inherent in the data structure. This tutorial shows you how to do that with R and the lavaan package. Why Multilevel CFA?
Structural equation modeling and confirmatory factor analysis are linear models. And as with most linear models one crucial assumption is the independence of observations.
This assumption is violated if you have a nested or hierarchical data structure. Such a multilevel structure can occur in cross-sectional data, e.g. if you investigate students nested within classes. And it can occur in longitudinal data, e.g. if you investigate time points nested within participants.
Without addressing the nesting of the observations you risk getting false results, probably standard errors that are too small, test statistics that are too large, and p-values that are too small - with the risk of false positive test results.
There are different ways to address the dependence of the data. Probably the easiest way is using cluster robust standard errors, see this blog post: Cluster Robust Standard Errors With Lavaan
But especially if you are interested in the behavior of a measurement instrument on different levels, then you might choose multilevel confirmatory factor analysis (multilevel CFA) as an analysis method.
Lavaan provides limited support for such an analysis. Currently, you can analyze a two-level CFA model with random intercepts, based on continuous data. Maybe in the future there will be the option to analyze models with random slopes, too.
Two-level CFA With the Same Factor Structure on Both Levels
In this tutorial we use an inbuilt dataset in the lavaan package, Demo.twolevel. From this dataset we will analyze the six items y1, y2, y3, y4, y5, and y6.
First, we want to investigate a model where we have the same factor structure on both levels. We assume a one-factor structure with all six items loading on the only factor.
library(lavaan)
After loading lavaan we start with the model specification:
model1 <- '
level: 1
factor_w =~ y1 + y2 + y3 + y4 + y5 + y6
level: 2
factor_b =~ y1 + y2 + y3 + y4 + y5 + y6
'
Here we have to specify the factor models separately on both levels, the within-level (level 1) and the between-level (level 2). Both model parts are introduced with the key phrase “level: 1” or “level: 2” (not: “level 1:” or “level 2:” - that would lead to an error).
In this simple example the specification of each model part consists of only one line, the measurement model for the single factor.
I recommend using different names for the latent variables on both levels (e.g., factor_w and factor_b).
And if you want to use labels for (any) effects in a multilevel CFA, then please make sure to use different labels for the within-model and the between-model. Because using the same label for a within-effect and a between-effect specifies an equality constraint, forcing the estimation of the same value for within-effect and between-effect.
Following the two-level model specification we can estimate the CFA model:
fit1 <- cfa(model = model1,
data = Demo.twolevel,
cluster = "cluster")
In addition to the normal parameters, model and data, we have to include a parameter cluster that contains the clustering variable in the dataframe. In this example the clustering variable has the name “cluster”.
Having estimated the model, we can look at the results.
summary(fit1,
fit.measures = T,
standardized = T)
Before we are allowed to interpret the output we must check the model fit.
lavaan 0.6-18 ended normally after 47 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 30
Number of observations 2500
Number of clusters [cluster] 200
Model Test User Model:
Test statistic 455.957
Degrees of freedom 18
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 2121.379
Degrees of freedom 30
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.791
Tucker-Lewis Index (TLI) 0.651
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -24190.361
Loglikelihood unrestricted model (H1) -23962.383
Akaike (AIC) 48440.723
Bayesian (BIC) 48615.444
Sample-size adjusted Bayesian (SABIC) 48520.126
Root Mean Square Error of Approximation:
RMSEA 0.099
90 Percent confidence interval - lower 0.091
90 Percent confidence interval - upper 0.107
P-value H_0: RMSEA <= 0.050 0.000
P-value H_0: RMSEA >= 0.080 1.000
Standardized Root Mean Square Residual (corr metric):
SRMR (within covariance matrix) 0.029
SRMR (between covariance matrix) 0.328
The model test was significant (but it almost always is). Looking at the key fit indices the model fit is quite bad:
CFI = .791
RMSEA = .099
SRMR (within) = .029
SRMR (between) = .328
The most interesting information is contained in the SRMR because here we get different values for both levels. The SRMR for the within-level is quite good (.029). However, the SRMR for the between-level is completely unacceptable (.328). This indicates that the within-model works, but that the between-model has to change to get acceptable model fit.
In order to pinpoint the exact problems we request modification indices.
modindices(fit1,
minimum.value = 10,
sort = T)
lhs op rhs block group level mi epc sepc.lv sepc.all sepc.nox
68 y4 ~~ y5 2 1 2 124.341 0.797 0.797 0.895 0.895
69 y4 ~~ y6 2 1 2 121.020 0.771 0.771 0.877 0.877
70 y5 ~~ y6 2 1 2 108.859 0.590 0.590 0.856 0.856
53 y4 ~~ y5 1 1 1 27.980 0.127 0.127 0.115 0.115
54 y4 ~~ y6 1 1 1 16.457 0.091 0.091 0.088 0.088
Compared to the output for modification indices in a normal CFA we get additional columns. Of primary interest is the column level, that is we get different possible model changes for the two levels.
Looking at the indices we can see that the first three large modification indices are all about level 2 which fits our results for the SRMR.
Here we have three items, y4, y5, and y6, that show additional correlation with each other. A pattern like this is an indication of another factor.
For that reason we will try out a different model in which on the between-level we have two factors whereas on the within factor there remains one factor.
Two-level CFA With Different Factor Structures on Both Levels
Here is the model specification for a single factor on level 1 and two correlated factors on level 2:
model2 <- '
level: 1
factor_w =~ y1 + y2 + y3 + y4 + y5 + y6
level: 2
factor1_b =~ y1 + y2 + y3
factor2_b =~ y4 + y5 + y6
'
fit2 <- cfa(model = model2,
data = Demo.twolevel,
cluster = "cluster")
summary(fit2,
fit.measures = T,
standardized = T)
Running this code we get an acceptable model fit, CFI = .983, RMSEA = .029, SRMR-within = .029, SRMR-between = .025. Thus we can continue with the parameter estimates.
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Level 1 [within]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
factor_w =~
y1 1.000 1.018 0.720
y2 0.753 0.039 19.071 0.000 0.766 0.592
y3 0.717 0.038 18.950 0.000 0.730 0.585
y4 0.338 0.028 11.874 0.000 0.344 0.316
y5 0.267 0.028 9.518 0.000 0.271 0.246
y6 0.182 0.026 7.043 0.000 0.185 0.180
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 0.964 0.055 17.432 0.000 0.964 0.482
.y2 1.087 0.043 25.413 0.000 1.087 0.649
.y3 1.025 0.040 25.784 0.000 1.025 0.658
.y4 1.066 0.033 32.259 0.000 1.066 0.900
.y5 1.142 0.035 32.996 0.000 1.142 0.939
.y6 1.021 0.031 33.475 0.000 1.021 0.968
factor_w 1.036 0.070 14.768 0.000 1.000 1.000
Level 2 [cluster]:
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
factor1_b =~
y1 1.000 0.944 0.971
y2 0.713 0.056 12.817 0.000 0.673 0.888
y3 0.578 0.050 11.481 0.000 0.546 0.818
factor2_b =~
y4 1.000 1.007 0.946
y5 0.772 0.045 17.312 0.000 0.777 0.929
y6 0.766 0.045 16.841 0.000 0.772 0.922
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
factor1_b ~~
factor2_b 0.051 0.078 0.653 0.514 0.054 0.054
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 0.021 0.076 0.276 0.783 0.021 0.021
.y2 -0.019 0.061 -0.309 0.757 -0.019 -0.025
.y3 -0.044 0.055 -0.807 0.420 -0.044 -0.066
.y4 0.031 0.079 0.390 0.697 0.031 0.029
.y5 0.061 0.064 0.953 0.340 0.061 0.073
.y6 0.012 0.063 0.189 0.850 0.012 0.014
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 0.054 0.049 1.094 0.274 0.054 0.057
.y2 0.121 0.032 3.799 0.000 0.121 0.211
.y3 0.148 0.028 5.267 0.000 0.148 0.332
.y4 0.118 0.040 2.937 0.003 0.118 0.105
.y5 0.096 0.029 3.370 0.001 0.096 0.138
.y6 0.104 0.028 3.674 0.000 0.104 0.149
factor1_b 0.892 0.122 7.333 0.000 1.000 1.000
factor2_b 1.015 0.127 8.018 0.000 1.000 1.000
The output looks similar to a normal CFA, but we get two parts of the output. First, we get an output for the within-model (level 1), then an output for the between-model (level 2). The elements of the output are basically the same as in a normal, one-level CFA.
However, in a two-level CFA you have to interpret the output twice (level 1 and level 2). Results and their interpretation can be quite different on the two levels.
Two-level CFA With Saturated Model on one of the Levels
In order to estimate a two-level CFA with lavaan you have to specify a model for each of the levels. But what if you are only interested in one level (most often: level 1)? In that case you can simply specify a saturated model for the level you are not interested in, specifying covariances between all measured variables, as in this example code (with a saturated model for level 2):
model3 <- '
level: 1
factor_w =~ y1 + y2 + y3 + y4 + y5 + y6
level: 2
y1 ~~ y2 + y3 + y4 + y5 + y6
y2 ~~ y3 + y4 + y5 + y6
y3 ~~ y4 + y5 + y6
y4 ~~ y5 + y6
y5 ~~ y6
'
fit3 <- cfa(model = model3,
data = Demo.twolevel,
cluster = "cluster")
summary(fit3,
fit.measures = T,
standardized = T)
Alternative Estimator (EM Algorithm)
If you run into convergence problems, the most likely reason for that is some kind of model misspecification. Apart from changing your model (which is not easy without getting any preliminary results) you could try a different estimator, the EM algorithm:
fit4 <- cfa(model = model3,
data = Demo.twolevel,
cluster = "cluster",
verbose = T,
optim.method = "em")
summary(fit4,
fit.measures = T,
standardized = T)
Please note that the EM algorithm can take a long time to converge.
Citation
Regorz, A. (2024, November 1). Multilevel CFA With R and Lavaan. Regorz Statistik. http://www.regorz-statistik.de/blog/lavaan_multilevel_cfa.html