Power to the Stimuli: Not the Effect

Erin M. Buchanan

Harrisburg University

Power

  • Power is the ability to detect a specific effect if it exists
  • Power is influenced by:
    • \(\alpha\)
    • Effect Size
    • Sample Size
    • Design Type

Study Planning

  • Generally, we use power to estimate sample size necessary to detect a specific effect given other constraints
  • You must know the study design and statistical analysis
  • However, study design to statistical analysis is a not a one-to-one relationship
  • What if you don’t have a hypothesis test?
  • What if you want to use multiverse?

Traditional Power Analyses

  • Look it up in a book
  • Popular programs like G*Power
  • R packages like pwr and Shiny apps

Newer Power Analyses

  • Simulation of proposed analyses with data generation
    • Shout out to faux and simr
  • Accuracy in Parameter Estimation (AIPE)
    • Estimate the necessary sample size to provide a “sufficiently narrow” confidence interval around a specific parameter
    • For example, specific r or d and confidence interval
    • However, confidence intervals still are tied to traditional null hypothesis testing

Today’s Demonstration

  • Use a combination of AIPE and simulation to accurately measure your parameters
  • Separate from hypothesis test - simply measure your parameters well
    • Any hypothesis test turns out how it turns out
    • This idea does converge, as “accurately measured” = “sufficiently narrow” confidence intervals

Design Considerations

  • What would this allow us to do?
    • Between group means estimated separately
    • Repeated measures items estimated individually
    • Maximize sampling to only collect data where we need more information
    • Adaptive sampling and testing

Example Data

Proposed Steps

  • Use pilot data that closely resembles your target data collection
  • Calculate the standard error of each item and pick a cut off
  • Sample your pilot data at different Ns
  • Calculate the percent of items that meet your cut off
  • Find the minimum sample size at 80%-95% of the items
  • Designate a minimum, stopping rule, and a maximum N
  • * Mostly applies to repeated measures data but can be tweaked for between subjects groups

Key Issues

  • This procedure should:
    • Show differences in projected N based on item heterogeneity
    • Power should “level off” with increased pilot data sample N

Data Simulation

  • Population: 30 normally distributed items with 1000 data points
  • Scale of data:
    • Small data range (\(\mu\) = 4, \(\sigma\) = .25, Likert)
    • Medium data range (\(\mu\) = 50, \(\sigma\) = 10, Accuracy)
    • Large data range (\(\mu\) = 1000, \(\sigma\) = 150, Milliseconds)
  • Item heterogeneity:
    • Small data range (\(\sigma\) = 2, .2, .4, .8)
    • Medium data range (\(\sigma\) = 25, 4, 8, 16)
    • Large data range (\(\sigma\) = 500, 50, 100, 200)

Data Simulation

  • Samples:
    • Each population was sampled to mimic researcher pilot study
    • Samples of 20, 30, 40 … 100
  • Cut off score criterion:
    • SE of each item was calculated
    • Deciles of SE were calculated (0% smallest SE, 10% … 90%)
    • Lower would be strict and very narrow CIs
    • Higher would be less strict and wider CIs

Researcher Sample Simulation

  • For each simulated pilot sample:
    • Simulate samples of 20 to 2000 from that sample
    • Calculate the SE of items
    • Calculate the point in which 80, 85, 90, 95% of items fall below decile cut offs
    • Repeat this 100 times

Goal Reminder

  • We should see increases in proposed sample size with increases in heterogeneity
  • Increased pilot sample sizes should “level off” in their suggested sample size

Heterogeneity Results

Power = 80, Small Scale

Heterogeneity Results

Power = 80, Medium Scale

Pilot N Results

Small Scale, Small Variance

Pilot N Results

Large Scale, Large Variance

Need to Correct

  • Need to correct for the fact that this procedure is necessarily dependent on original pilot N
  • Using bias reduction (Hedges) mixed with exponential decay formulas

\[ 1 - \sqrt{\frac{N_{Pilot} - min(N_{Simulation})}{N_{Pilot}}}^{log_2(N_{Pilot})}\]

Correction in Action

Small Scale, Small Variance

Correction in Action

Large Scale, Large Variance

Recovering Projected Sample Size

  • Each researcher would only have one sample
  • So, can we figure out a correction formula for the researcher?
  • Predict the corrected sample size with:
    • The projected sample size
    • Heterogeneity (SD of items)
    • Original pilot sample size

Recovering Projected Sample Size

  • Each variable was important in their own hierarchical regression step
  • However, heterogeneity is unimportant in the last step
  • \(R^2\) > .96
Parameters for All Decile Cutoff Scores
Term Estimate \(SD\) \(t\) \(p\)
Intercept 34.549 0.425 81.264 < .001
Projected Sample Size 0.621 0.003 247.039 < .001
Item SD 0.000 0.003 0.014 .989
Pilot Sample Size -0.483 0.008 -64.040 < .001

Choosing an Appropriate Cutoff

  • We can create a correction formula that is easy for researchers to apply
  • What cut off (decile) should we suggest?
    • 0%, 10%, 20% too restrictive
    • 50% appears to have the best fit \(R^2\) = 0.968
    • Also appears to meet our two goals:
      • Increases suggestion with heterogeneity and scale
      • Leveling off for larger pilot samples

Choosing an Appropriate Cutoff

Small Scale

Choosing an Appropriate Cutoff

Large Scale

Real Example

  • Concreteness Ratings:
    • Concrete: “refers to something that exists in reality”
    • Abstract: “something you cannot experience directly through your senses or actions”
    • Participants rate from 1 to 5
    • 2008857 ratings from participants across 63039 concepts

Concreteness Effect

  • How does individual differences in concreteness ratings predict memory?
    • Generally, (average) item concreteness is positively correlated with later memory
    • In my proposed study, participants would rate item concreteness to activate their concepts and then be given a memory test
    • How many participants do I need to sufficiently measure concreteness ratings?

Simulation Example

  • Suggestion: probably best to pick the actual stimuli from this dataset that you would use
  • Randomly sampled 100 words for our example study
  • Pilot N averaged 27.97 (SD = 1.51)
    • Not every participant saw every word
    • Some data loss when participants said “do not know”
  • 50% decile for items standard error was 0.255

Simulation Example

  • We would define our sampling as:
    • Minimum 80% items: 44
    • Stopping rule: SE item <= 0.255
    • Maximum 90% items: 48
    • Note the percent choice is subjective
  • Including data retention:
    • Minimum: 53
    • Maximum: 58

Wrapping Up

  • Potentially most useful for:
    • Studies that aren’t tied to traditional hypothesis tests
    • Studies with many items that have heterogeneity that you want to average together or use multi-level techniques
    • Studies where participants see a random subset of items
  • Can be combined with traditional power analyses

     approximate correlation power calculation (arctangh transformation) 

              n = 54.19491
              r = 0.37
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Thanks

  • Thanks for listening!
  • Suggestions and questions welcome
  • GitHub: doomlab
  • Twitter: @aggieerin
  • YouTube: Statistics of DOOM