Lasso Myself Numbers

Hey everybody!

The last couple days I have been trying to learn LASSO regression, which stands for Least Absolute Shrinkage and Selection Operator. I have several datasets with many variables, and I thought these would be a good opportunity to learn about how to lasso, while maybe answering a few questions about words.

Right, the part I forgot about is that I have repeated measures data, which always complicates things. One day I will give up liking words and do simple between-subjects studies, but today is not that day.

This project can be found at: https://github.com/doomlab/Nick-Thesis - although, if you are truly into the background details, I’d probably tell you to read the published manuscript pilot part of this study at: https://osf.io/fcesn/. Nick and I are interested in bad judgments - how can we predict the sensitivity of these judgments?

These studies are played Family Feud style. You are shown two words rock-roll and asked “if I said the first word (rock), how many times out of 100 would someone say the second word (roll)”. You guess the probability of their association, and then we compare your answer to the real answer. Our previous work has shown that people are fairly insensitive to the differences in highly related pairs and less related pairs, so they have a shallow slope (which means they can’t tell the difference between things that are only a little bit related and a lot related). Why does that matter? We guess probabilities all the time in our daily lives, but especially important as this exact same effect is shown in judgments of learning (i.e., how well do you think you will do on a test over this material?). So, to students, understanding what influences judgments seems important.

Now, I’m more interested in what part of the word network influences these judgments because words are interesting. :) We generally use measures of similarity (word pair combinations) to predict these judgments because that’s what you are asking participants to do - estimate similarity. However, there are lots of individual word influences that might affect judgments - part of speech, length, familiarity, etc.

We originally did this analysis by using ordinary least squares (“regular”) regression with a lot of steps (hierarchical). It’s a lot of variables, but stepwise regression has a bad name in the social sciences. I’m going to use LASSO regression here to see if I find anything useful.

First, let’s get the full data (across two experiments, with different word pairs) from GitHub.

library(RCurl)
thesis.dat <- read.csv(text = getURL("https://raw.githubusercontent.com/doomlab/Nick-Thesis/master/3%20manuscripts/Melted%20Data%20pilot.csv"))

pilot.dat <- read.csv(text = getURL("https://raw.githubusercontent.com/doomlab/Nick-Thesis/master/3%20manuscripts/master.csv"))

Next, data cleanup:

Fixing column names, factor label names, and using only the columns I was interested in. I dropped the phonemes and syllables columns because these are nearly perfectly correlated with word length - because longer words naturally have more sounds. Please note that the 1 and 2 at the end of variables denotes which word it relates to (cue versus target).

#we didn't name columns the same thing
colnames(thesis.dat)[2] <- "Judgment"

#put together the two datasets
full.dat <- rbind(thesis.dat, pilot.dat)

#cleaning out when people estimated more than 100
full.dat$Judged.Value[ full.dat$Judged.Value > 100] <- NA

#again consistent naming
full.dat$Judgment <- tolower(full.dat$Judgment)
full.dat$Judgment <- as.factor(full.dat$Judgment)

#all the variables I am going to use
xcolumns <- c("COS", "LSA", "FSG","QSS.1","QCON.1",
              "LogSub.1","Length.1" ,"POS.1","Ortho.1","Phono.1",
              "Morphemes.1","AOA.1", "Valence.1","Imageability.1",
              "Familiarity.1","QCON.2","LogSub.2","Length.2",
              "POS.2","Ortho.2","Phono.2","Morphemes.2",
              "AOA.2","Valence.2","Imageability.2","Familiarity.2",
              "TSS.2","FSS.1","FSS.2","COSC.1","COSC.2", "Judgment", 
              "Partno")

#just that data
full.dat <- full.dat[ ,c(xcolumns, "Judged.Value") ]

#merge several mostly empty categories together 
full.dat$POS.1 <- as.character(full.dat$POS.1)
full.dat$POS.2 <- as.character(full.dat$POS.2)
full.dat$POS.1 <- gsub("Other", "ADJ", full.dat$POS.1)
full.dat$POS.2 <- gsub("Other", "ADJ", full.dat$POS.2)
full.dat$POS.2 <- gsub("JJ", "ADJ", full.dat$POS.2)

full.dat$POS.1 <- as.factor(full.dat$POS.1)
full.dat$POS.2 <- as.factor(full.dat$POS.2)
full.dat$POS.1 <- droplevels(full.dat$POS.1)
full.dat$POS.2 <- droplevels(full.dat$POS.2)

#found out you can't have missing data for this analysis 
full.dat <- na.omit(full.dat)

Now, one thing I figured out when running the analysis below is that everyone must have a complete dataset. Meaning, if a participant skipped trial 7 or actually hit 500 on a judgment (which we would set to NA because it can’t be more than 100)…then you lose the whole participant. That is unfortunate, and I would love to know if I am wrong here, but I couldn’t get it to run otherwise.

So, I factored the participant number, calculated the number of trials that they had, and dropped everyone who was not equal to 63.

full.dat$Partno <- factor(full.dat$Partno)
temp <- as.data.frame(table(full.dat$Partno))

full.dat2 <- subset(full.dat,
                    Partno %in% temp$Var1[temp$Freq == 63])

full.dat2$Partno <- droplevels(full.dat2$Partno)

Everything I read about the LASSO indicated it needed some sort of scaling. I tried this analysis by using z-scores of everything, and the numbers I got were WILD. It seemed more appropriate to me to put the data all in the same basic scale, which means that the coefficients can be compared more directly. Since the dependent variable (Judged Value) is 0 to 100, I scaled all the data to be 0 to 100 based on the possible values of the variables.

#scale the data 0 to 100
vars <- c("COS", "LSA", "FSG")
full.dat2[ , vars] <- full.dat2[ , vars] * 100

For example, frequency (LogSub), neighborhoods (how many words sound and visually look alike), and set sizes (how many words are similar > 0), can all be 0 to possibly infinity. So I took each score, subtracted the minimum (0) and divided by the max in the data, since you can’t really determine the upper bounds of these.

#scale the data 0 to inf
vars <- c("QSS.1", "LogSub.1", "Ortho.1", "Phono.1", "AOA.1", "LogSub.2", 
          "Ortho.2", "Phono.2", "AOA.2", "TSS.2", "COSC.1", "COSC.2")
full.dat2[ , vars] <- apply(full.dat2[ , vars], 2, function(x)
  {
  (x - 0)/max(x) * 100 
  })

Length, number of morphemes, and feature set sizes are minimally 1, so did the same thing with a minimum of 1 in this group.

#scale the data 1 to inf
vars <- c("Length.1", "Morphemes.1", "Length.2", "Morphemes.2",
          "FSS.1", "FSS.2")
full.dat2[ , vars] <- apply(full.dat2[ , vars], 2, function(x)
  {
  (x - 1)/max(x) * 100 
  })

Last, the data collected for concreteness, familiarity, and imageability all ranged from 1 to 7, while valence ranged from 1 to 10.

#scale the data 1 to 7
vars <- c("QCON.1", "Imageability.1", "Familiarity.1",
          "QCON.2", "Imageability.2", "Familiarity.2")
full.dat2[ , vars] <- apply(full.dat2[ , vars], 2, function(x)
  {
  (x - 1)/7 * 100 
  })

#scale the data 1 to 10
vars <- c("Valence.1", "Valence.2")
full.dat2[ , vars] <- apply(full.dat2[ , vars], 2, function(x)
  {
  (x - 1)/10 * 100 
  })

Now, on to the good part. I am using glmmLasso because my data is repeated measures (sigh, I like mixed models, but learning new things in mixed models is hard). I took the tutorial from the glmmLasso package. In this section, you are setting up the possible penalty (\(\lambda\)) values, and estimating good starting values for your random effects. The tricky part was figuring out what the delta vector should have been - it’s fixed effects, followed by 0s for all the possible fixed effects of the number of predictors, followed by the estimated random intercepts for the participants. I am using participants here as my random variable, but I could also try items. My research tends to focus on the participants’ intercept (bias) and slope (sensitivity), so participants’ variability seems more appropriate to me, but I can see it both ways. I have tried both before and usually it doesn’t help to add the second one. I also know that participants vary a lot in their slopes, so I’ll try estimating those random effects later.

library(glmmLasso)

lambda <- seq(100, 0, by=-5)

BIC_vec <- rep(Inf, length(lambda))

## first fit good starting model
library(MASS)
library(nlme)

PQL <- glmmPQL(Judged.Value ~ 1,
             random = ~1|Partno,
             family = "gaussian",
             data = full.dat2)

## iteration 1

## iteration 2

Delta.start <- c(as.numeric(PQL$coef$fixed),
                rep(0, 40), #number of predictors
                as.numeric(t(PQL$coef$random$Partno)))

Q.start <- as.numeric(VarCorr(PQL)[1,1])

This section I ran a couple times with various lambda values (even larger ones). The purpose of this loop is to find the best penalty value - I actually found that values > 100 were all equal in BIC and everything <= 95 were all a slightly smaller value. Since I want to reduce variables, treating lambda as zero seemed inappropriate, so I just went with 95. I am also terrible at try-catch so props to the package for providing an example. PS this model runs slowly. Several sandwiches were eaten.

for (j in 1:10)
  {
    print(paste("Iteration ", j, sep=""))
    
    glm1 <- try(
      glmmLasso(Judged.Value ~ COS + LSA + FSG + QSS.1 + QCON.1 + LogSub.1 + 
            Length.1 + as.factor(POS.1) + Ortho.1 + Phono.1 + Morphemes.1 + 
            AOA.1 + Valence.1 + Imageability.1 + Familiarity.1 + QCON.2 + 
            LogSub.2 + Length.2 + Ortho.2 + Phono.2 + as.factor(POS.2) + 
            Morphemes.2 + AOA.2 + Valence.2 + Imageability.2 + Familiarity.2 + 
            TSS.2 + FSS.1 + FSS.2 + COSC.1 + COSC.2 + as.factor(Judgment),
          rnd = list(Partno = ~1), 
          family = gaussian(link = "identity"),
          data = full.dat2,
          lambda = lambda[j],
          switch.NR = TRUE,
          final.re = TRUE,
          control=list(start = Delta.start,
                       q_start = Q.start)), 
      silent = TRUE)
    if(class(glm1)!="try-error")
      {  
      BIC_vec[j]<-glm1$bic
      }
  }
    
which.min(BIC_vec)

Here’s the final model. You use the normal Y ~ X + X + X... notation to create your formula. I just threw everything in. Even though part of speech (POS) and Judgment type (Judgment) were already factored, you must use as.factor() around them in the formula. That took a while to figure out, so be noted!

Random notation is somewhat similar to lme4 or nlme with the Variable = ~1 for the intercept. I assumed these variables were somewhat normal, so I used the gaussian family to estimate the model. The rest are all default/suggested start values that I pulled from the vignette.

final_lambda <- 95
        
glm1_final <- glmmLasso(Judged.Value ~ COS + LSA + FSG + QSS.1 + QCON.1 + LogSub.1 + 
            Length.1 + as.factor(POS.1) + Ortho.1 + Phono.1 + Morphemes.1 + 
            AOA.1 + Valence.1 + Imageability.1 + Familiarity.1 + QCON.2 + 
            LogSub.2 + Length.2 + Ortho.2 + Phono.2 + as.factor(POS.2) + 
            Morphemes.2 + AOA.2 + Valence.2 + Imageability.2 + Familiarity.2 + 
            TSS.2 + FSS.1 + FSS.2 + COSC.1 + COSC.2 + as.factor(Judgment),
          rnd = list(Partno = ~1), 
          family = gaussian(link = "identity"),
          data = full.dat2,
          lambda = final_lambda,
          switch.NR = TRUE,
          final.re = TRUE,
          control=list(start = Delta.start,
                       q_start = Q.start))

## Warning in if (class(InvFisher2) == "try-error") InvFisher2 <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher2) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher2) == "try-error") InvFisher2 <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher2) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher2) == "try-error") InvFisher2 <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher2) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher) == "try-error") InvFisher <- solve(F_gross): the
## condition has length > 1 and only the first element will be used

## Warning in if (class(InvFisher) == "try-error") InvFisher <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher) == "try-error") InvFisher <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher) == "try-error") InvFisher <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher) == "try-error") {: the condition has length > 1
## and only the first element will be used

## Warning in if (class(InvFisher) == "try-error") InvFisher <-
## try(solve(F_gross), : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(InvFisher) == "try-error") {: the condition has length > 1
## and only the first element will be used

Before we see the results, let me say I asked Nick to preregister an “important value” of the estimate to be useful. He suggested 5-10 points meaning that the variable needs to influence the slope 5 to 10 points before we should consider it useful. That seemed reasonable until I remembered that the traditional formula for one of these simple judgment models is \(\hat{Y} = .23*FSG + .50\)…which means that for every one point increase in forward strength (how often someone answer the target for the cue), we get a quarter of a point increase in judgment. This means that slopes are FLAT - thus, participants are insensitivity to actual value.

That’s true in this dataset too (simple lm to demonstrate):

summary(lm(Judged.Value ~ FSG, data = full.dat2))

## 
## Call:
## lm(formula = Judged.Value ~ FSG, data = full.dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.461 -19.929   5.975  22.987  40.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 59.22297    0.32596  181.69   <2e-16 ***
## FSG          0.35312    0.01447   24.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.63 on 11527 degrees of freedom
## Multiple R-squared:  0.04913,    Adjusted R-squared:  0.04905 
## F-statistic: 595.6 on 1 and 11527 DF,  p-value: < 2.2e-16

So, maybe we should go with .10 scaled for his answer as the low end. I’ve not seen participants push much past .50 for slope, so that could be our “wow” coefficients.

library(flextable)

output <- summary(glm1_final)
coef <- as.data.frame(output[['coefficients']])
coef$Label <- rownames(coef)
coef <- coef[ , c(5, 1:4)]
coef[ , 2:5] <- round(coef[ , 2:5], 3)

flextable(coef)

Label	Estimate	StdErr	z.value	p.value
(Intercept)	91.647	0.875	104.691	0
COS	0.198	0.001	329.874	0
LSA	0.134	0.001	184.994	0
FSG	0.179	0.001	240.165	0
QSS.1	0.116	0.001	142.167	0
QCON.1	-0.238	0.002	-145.369	0
LogSub.1	0.000	NA	NA	NA
Length.1	-0.213	0.001	-143.787	0
as.factor(POS.1)NN	10.685	0.045	236.639	0
as.factor(POS.1)VB	2.288	0.089	25.739	0
Ortho.1	0.042	0.001	57.524	0
Phono.1	-0.138	0.001	-205.289	0
Morphemes.1	-0.229	0.001	-195.481	0
AOA.1	0.036	0.001	30.232	0
Valence.1	0.130	0.001	117.649	0
Imageability.1	0.227	0.002	95.730	0
Familiarity.1	0.424	0.006	69.831	0
QCON.2	0.240	0.002	138.408	0
LogSub.2	0.253	0.002	150.359	0
Length.2	0.065	0.001	47.859	0
Ortho.2	-0.111	0.001	-142.127	0
Phono.2	0.107	0.001	138.263	0
as.factor(POS.2)NN	23.547	0.052	450.900	0
as.factor(POS.2)VB	23.818	0.128	186.547	0
Morphemes.2	0.102	0.001	68.657	0
AOA.2	-0.315	0.001	-230.114	0
Valence.2	-0.421	0.001	-357.356	0
Imageability.2	-0.399	0.002	-191.577	0
Familiarity.2	-0.999	0.004	-233.136	0
TSS.2	-0.065	0.001	-88.005	0
FSS.1	-0.091	0.001	-117.161	0
FSS.2	0.019	0.001	21.635	0
COSC.1	0.014	0.001	19.705	0
COSC.2	-0.136	0.001	-218.234	0
as.factor(Judgment)semantic	3.794	0.020	191.672	0
as.factor(Judgment)thematic	9.434	0.026	368.773	0

Sooooooooooo, I learned this analysis to help remove variables, but here we are. How can they all possibly be useful?! The only one that drops out - gets “shrink”ed - is the frequency of the first word (LogSub). These z values appear especially crazy - and they are that large if I use z-scores as my scaling as well.

For the categorical variables, the larger numbers are simply the difference between the ratings for those types of words (part of speech) or those judgment conditions. These results are not that different than a normal mixed model - so, maybe just examining the coefficient size is going to be key here.

nlme_model <- lme(Judged.Value ~ COS + LSA + FSG + QSS.1 + QCON.1 + LogSub.1 + 
            Length.1 + as.factor(POS.1) + Ortho.1 + Phono.1 + Morphemes.1 + 
            AOA.1 + Valence.1 + Imageability.1 + Familiarity.1 + QCON.2 + 
            LogSub.2 + Length.2 + Ortho.2 + Phono.2 + as.factor(POS.2) + 
            Morphemes.2 + AOA.2 + Valence.2 + Imageability.2 + Familiarity.2 + 
            TSS.2 + FSS.1 + FSS.2 + COSC.1 + COSC.2 + as.factor(Judgment),
            data = full.dat2, 
            random = list(~1|Partno))
head(summary(nlme_model)$tTable)

##                  Value   Std.Error    DF   t-value      p-value
## (Intercept) 92.0909612 14.16709276 11311  6.500343 8.350859e-11
## COS          0.2111590  0.01485150 11311 14.218025 1.744498e-45
## LSA          0.1228498  0.01773113 11311  6.928478 4.484940e-12
## FSG          0.1722724  0.01819244 11311  9.469448 3.370492e-21
## QSS.1        0.1012572  0.02005057 11311  5.050092 4.484792e-07
## QCON.1      -0.2585684  0.04001545 11311 -6.461714 1.077764e-10

So, uh, everything matters in judgments? I will want to update this analysis with the direct versus indirect associations using the Small World of Words, rather than the older free association norms, but after several sandwiches of running, I thought to leave that for another day. The most interesting thing that we noted during his thesis was that the first and second words almost have a competition going - often when one of the words has a positive slope, the second has a negative. For example, the imageability of the first word (cue) is positively influencing judgments, while the imageability of the second word (target) negatively influences judgments. Maybe this explains some of the flat slopes is the contrasting influences of the networks of each word.

Thanks for reading! Questions welcome.

A list of variables:

Lexical properties:
- Length
- Frequency (LogSub)
- Morphemes
Rated Properties
- Age of Acquisition (AOA)
- Imageabilty
- Concreteness (QCON)
- Valence
- Familiarity
Neighborhood Connections
- Cue and target set size (QSS, TSS, number of words with FSG > 0 to that word)
- Feature set size (semantic richness, FSS)
- Cosine connectedness (COSC, number of words with COS > 0 to that word)
- Orthographic neighborhood (Ortho)
- Phonographic neighborhood (Phono)
Similarity
- Cosine (COS on feature production norms)
- Forward strength (FSG from Nelson et al. SWOW wasn’t out yet ok?)
- Latent semantic analysis cosine (LSA from Semantic Priming Project)