Power to the Stimuli: Not the Effect
Erin M. Buchanan
Harrisburg University
Power
- Power is the ability to detect a specific effect if it exists
- Power is influenced by:
- \(\alpha\)
- Effect Size
- Sample Size
- Design Type
Study Planning
- Generally, we use power to estimate sample size necessary to detect a specific effect given other constraints
- You must know the study design and statistical analysis
- However, study design to statistical analysis is a not a one-to-one relationship
- What if you don’t have a hypothesis test?
- What if you want to use multiverse?
Traditional Power Analyses
- Look it up in a book
- Popular programs like G*Power
- R packages like pwr and Shiny apps
Newer Power Analyses
- Simulation of proposed analyses with data generation
- Shout out to
faux
and simr
- Accuracy in Parameter Estimation (AIPE)
- Estimate the necessary sample size to provide a “sufficiently narrow” confidence interval around a specific parameter
- For example, specific r or d and confidence interval
- However, confidence intervals still are tied to traditional null hypothesis testing
Today’s Demonstration
- Use a combination of AIPE and simulation to accurately measure your parameters
- Separate from hypothesis test - simply measure your parameters well
- Any hypothesis test turns out how it turns out
- This idea does converge, as “accurately measured” = “sufficiently narrow” confidence intervals
Design Considerations
- What would this allow us to do?
- Between group means estimated separately
- Repeated measures items estimated individually
- Maximize sampling to only collect data where we need more information
- Adaptive sampling and testing
Example Data
Proposed Steps
- Use pilot data that closely resembles your target data collection
- Calculate the standard error of each item and pick a cut off
- Sample your pilot data at different Ns
- Calculate the percent of items that meet your cut off
- Find the minimum sample size at 80%-95% of the items
- Designate a minimum, stopping rule, and a maximum N
- * Mostly applies to repeated measures data but can be tweaked for between subjects groups
Key Issues
- This procedure should:
- Show differences in projected N based on item heterogeneity
- Power should “level off” with increased pilot data sample N
Data Simulation
- Population: 30 normally distributed items with 1000 data points
- Scale of data:
- Small data range (\(\mu\) = 4, \(\sigma\) = .25, Likert)
- Medium data range (\(\mu\) = 50, \(\sigma\) = 10, Accuracy)
- Large data range (\(\mu\) = 1000, \(\sigma\) = 150, Milliseconds)
- Item heterogeneity:
- Small data range (\(\sigma\) = 2, .2, .4, .8)
- Medium data range (\(\sigma\) = 25, 4, 8, 16)
- Large data range (\(\sigma\) = 500, 50, 100, 200)
Data Simulation
- Samples:
- Each population was sampled to mimic researcher pilot study
- Samples of 20, 30, 40 … 100
- Cut off score criterion:
- SE of each item was calculated
- Deciles of SE were calculated (0% smallest SE, 10% … 90%)
- Lower would be strict and very narrow CIs
- Higher would be less strict and wider CIs
Researcher Sample Simulation
- For each simulated pilot sample:
- Simulate samples of 20 to 2000 from that sample
- Calculate the SE of items
- Calculate the point in which 80, 85, 90, 95% of items fall below decile cut offs
- Repeat this 100 times
Goal Reminder
- We should see increases in proposed sample size with increases in heterogeneity
- Increased pilot sample sizes should “level off” in their suggested sample size
Heterogeneity Results
![]()
Power = 80, Small Scale
Heterogeneity Results
![]()
Power = 80, Medium Scale
Pilot N Results
![]()
Small Scale, Small Variance
Pilot N Results
![]()
Large Scale, Large Variance
Need to Correct
- Need to correct for the fact that this procedure is necessarily dependent on original pilot N
- Using bias reduction (Hedges) mixed with exponential decay formulas
\[ 1 - \sqrt{\frac{N_{Pilot} - min(N_{Simulation})}{N_{Pilot}}}^{log_2(N_{Pilot})}\]
Correction in Action
![]()
Small Scale, Small Variance
Correction in Action
![]()
Large Scale, Large Variance
Recovering Projected Sample Size
- Each researcher would only have one sample
- So, can we figure out a correction formula for the researcher?
- Predict the corrected sample size with:
- The projected sample size
- Heterogeneity (SD of items)
- Original pilot sample size
Recovering Projected Sample Size
- Each variable was important in their own hierarchical regression step
- However, heterogeneity is unimportant in the last step
- \(R^2\) > .96
Parameters for All Decile Cutoff Scores
Intercept |
34.549 |
0.425 |
81.264 |
< .001 |
Projected Sample Size |
0.621 |
0.003 |
247.039 |
< .001 |
Item SD |
0.000 |
0.003 |
0.014 |
.989 |
Pilot Sample Size |
-0.483 |
0.008 |
-64.040 |
< .001 |
Choosing an Appropriate Cutoff
- We can create a correction formula that is easy for researchers to apply
- What cut off (decile) should we suggest?
- 0%, 10%, 20% too restrictive
- 50% appears to have the best fit \(R^2\) = 0.968
- Also appears to meet our two goals:
- Increases suggestion with heterogeneity and scale
- Leveling off for larger pilot samples
Choosing an Appropriate Cutoff
![]()
Small Scale
Choosing an Appropriate Cutoff
![]()
Large Scale
Real Example
- Concreteness Ratings:
- Concrete: “refers to something that exists in reality”
- Abstract: “something you cannot experience directly through your senses or actions”
- Participants rate from 1 to 5
- 2008857 ratings from participants across 63039 concepts
Concreteness Effect
- How does individual differences in concreteness ratings predict memory?
- Generally, (average) item concreteness is positively correlated with later memory
- In my proposed study, participants would rate item concreteness to activate their concepts and then be given a memory test
- How many participants do I need to sufficiently measure concreteness ratings?
Simulation Example
- Suggestion: probably best to pick the actual stimuli from this dataset that you would use
- Randomly sampled 100 words for our example study
- Pilot N averaged 27.97 (SD = 1.51)
- Not every participant saw every word
- Some data loss when participants said “do not know”
- 50% decile for items standard error was 0.255
Simulation Example
- We would define our sampling as:
- Minimum 80% items: 44
- Stopping rule: SE item <= 0.255
- Maximum 90% items: 48
- Note the percent choice is subjective
- Including data retention:
Wrapping Up
- Potentially most useful for:
- Studies that aren’t tied to traditional hypothesis tests
- Studies with many items that have heterogeneity that you want to average together or use multi-level techniques
- Studies where participants see a random subset of items
- Can be combined with traditional power analyses
approximate correlation power calculation (arctangh transformation)
n = 54.19491
r = 0.37
sig.level = 0.05
power = 0.8
alternative = two.sided
Thanks
- Thanks for listening!
- Suggestions and questions welcome
- GitHub: doomlab
- Twitter: @aggieerin
- YouTube: Statistics of DOOM