Quick and Dirty Categorical lavaan

I was tagged today on twitter asking about categorical variables in lavaan. I will say I have not done much with categorical predictors either endogenous or exogenous. I did a quick reproducible example of exogenous variables, and I will refer you to the help guide for lavaan here.

You will need both the lavaan and psych packages to reproduce this code. Ironically, this data is binary outcome data (the epi dataset in psych), which wasn’t intentional, I just knew it was a good dataset to work with to test how to do exogenous categorical variables.

First, let’s make a model that works (I do assume you know a bit about lavaan here, feel free to ask questions):

#load libraries and data
library(psych)
library(lavaan)
DF = epi

#lavaan model syntax
epi.model = 'latent =~ V1+V2+V3+V4
latent2 =~ V5+V6+V7+V8'

#analyze the model
epi.fit = cfa(model = epi.model, 
              data = DF)

#show a summary
summary(epi.fit)

The cfa and summary did not throw any errors, so the model at least runs smoothly, even if it is not a “good” model. For good measure, you can also use semPlot to create a picture of this two-factor model:

library(semPlot)

#semPaths with basic options
semPaths(epi.fit,
         whatLabels = "std",
         edge.label.cex = 1)

Two-factor lavaan model

Next, I created a fake dummy coded variable with three levels, although you could scale this easily with more levels:

DF$category = c(rep("group", nrow(epi)/3),
                rep("group2", nrow(epi)/3),
                rep("group3", nrow(epi)/3))
DF$category = as.factor(DF$category)

When I tried to run a new model with the category variable, lavaan was not happy:

Warning message:
In lav_data_full(data = data, group = group, cluster = cluster,  :
  lavaan WARNING: unordered factor(s) with more than 2 levels detected in data: category

Fine, let’s dummy code them with the gloriously easy dummy.code function in psych:

#dummy code and combine with DF
DF_dc = cbind(DF, dummy.code(DF$category))

However, I will warn you that psych does give you K columns where K = levels. Real dummy coding is K - 1 columns, so I find it odd that psych gives you K output. For example, it took our group, group2, group3 labels and transformed them into three new columns with 0 as not my group and 1 as my group. Therefore, I will advise you to pick your favorite combination of K - 1 levels, and do not use all of them or you will create a singular matrix that will be difficult to troubleshoot in any regression based analysis. Here’s an example of that error:

Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge, x.idx = x.idx[[g]],  : 
  lavaan ERROR: sample covariance matrix is not positive-definite

I can add the first two to the model predicting one of the latents using ~ for regression rather than =~ for create a latent:

#model syntax
epi.model2 = 'latent =~ V1+V2+V3+V4
latent2 =~ V5+V6+V7+V8
latent ~ group
latent ~ group2'

#analyze the model with the new DF
epi.fit2 = cfa(model = epi.model2,
               data = DF_dc)

#summarize the model               
summary(epi.fit2)

In your output, you will get two new lines for regression:

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  latent ~                                            
    group             0.026    0.014    1.847    0.065
    group2           -0.001    0.014   -0.082    0.935

The interpretation here would be that group = group 1 versus group 3 was related to/predicted latent at 0.026, so the difference in latent for group 1 to group 3 was 0.026. The second variable would be group2 = group 2 versus group 3, and they basically have no difference on latent. You can learn more about dummy coding here.

Here’s the picture of that analysis:

semPaths(epi.fit,
         whatLabels = "std",
         edge.label.cex = 1)

Two-factor lavaan model with dummy coded variables

Remember that any endogenous variables will get automatically correlated … so now we have a second latent variable hanging out in space we would want to either predict with our dummy coded variables or do something with. So, I would probably either add the correlation between latents back in with: latent ~~ latent2 or add in the regressions for using the categoricals to predict latent2: latent2 ~ group and latent2 ~ group2.

More lavaan help can be found on my youtube channel!.

statistics lavaan datacamp data camp sem dummy coding