Session 6
PMAP 8521: Program evaluation
Andrew Young School of Policy Studies
Construct validity
Construct validity
Statistical conclusion validity
Construct validity
Statistical conclusion validity
Internal validity
Construct validity
Statistical conclusion validity
Internal validity
External validity
A new program hopes to
improve student commitment to school
A new program hopes to
improve student commitment to school
Participants score 200 points higher on the SAT and have a 0.3 higher GPA, on average
A new program hopes to
improve student commitment to school
Participants score 200 points higher on the SAT and have a 0.3 higher GPA, on average
Success! Success?
Drunk guy looking for keys in the light of the lamppost instead of over in the bushes where they lost them
Are you measuring what you want to measure?
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Test scores measure how good kids are at taking tests
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Test scores measure how good kids are at taking tests
This is why we spend so much time
on outcome measurement construction!
Are your statistics correct?
Are your statistics correct?
Statistical power
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Fishing and p-hacking
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Fishing and p-hacking
Spurious statistical significance
A training program causes incomes to rise by $40
Person | Group | Before | After | Difference |
---|---|---|---|---|
295 | Control | 122.09 | 229.04 | 106.95 |
126 | Treatment | 205.60 | 199.84 | -5.76 |
400 | Control | 133.25 | 130.40 | -2.85 |
94 | Treatment | 270.11 | 206.56 | -63.54 |
250 | Control | 344.37 | 222.89 | -121.49 |
59 | Treatment | 312.41 | 268.06 | -44.35 |
Survey 10 participants
Survey 10 participants
Survey 200 participants
Use a statistical power calculator to
make sure you can potentially detect an effect
Every statistical test has certain assumptions
Every statistical test has certain assumptions
For instance, for OLS:
Linearity Homoscedasticity Independence Normality
Every statistical test has certain assumptions
For instance, for OLS:
Linearity Homoscedasticity Independence Normality
Make sure you're doing the stats correctly
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Don't!
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Don't!
If p threshold is 0.05 and you measure 20 outcomes,
1 will likely show correlation by chance
If p threshold is 0.05 and you measure 20 outcomes,
1 will likely show correlation by chance
Omitted variable bias
Selection Attrition
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Contamination
Hawthorne John Henry
Spillovers Intervening events
If people can choose to enroll in a
program, those who enroll will be
different from those who do not
If people can choose to enroll in a
program, those who enroll will be
different from those who do not
How to fix
Randomization into
treatment and control groups
If people can choose when to
enroll in a program, time might
influence the result
If people can choose when to
enroll in a program, time might
influence the result
How to fix
Shift time around
(happier people more likely to get married, so without randomly assigning marriage how would you study the impact of marriage on happiness?). They use a simple approach - since happiness varies over time set marriage equal to time zero and build a pre-post design around it. You essentially leverage the within-group variance and iron out across-age differences because of the varying ages of marriage. The whole insight it to change the time-line from calendar years to program years.
If the people who leave a program or
study are different than those who stay,
the effects will be biased
If the people who leave a program or
study are different than those who stay,
the effects will be biased
How to fix
Check characteristics of those
who stay and those who leave
ID | Increase in income | Remained in program |
---|---|---|
1 | $3.00 | Yes |
2 | $3.50 | Yes |
3 | $2.00 | Yes |
4 | $1.50 | No |
5 | $1.00 | No |
ATE with
attriters = $2.20
ID | Increase in income | Remained in program |
---|---|---|
1 | $3.00 | Yes |
2 | $3.50 | Yes |
3 | $2.00 | Yes |
4 | $1.50 | No |
5 | $1.00 | No |
ATE with
attriters = $2.20
ATE without
attriters = $2.83
Growth is expected naturally
e.g. programs targeted at childhood development
contend with the fact that children develop on their own too
Growth is expected naturally
e.g. programs targeted at childhood development
contend with the fact that children develop on their own too
How to fix
Use a comparison group to remove the trend
Patterns in data happen
because of larger global processes
Recessions Cultural shifts Marriage equality
Patterns in data happen
because of larger global processes
Recessions Cultural shifts Marriage equality
How to fix
Use a comparison group to remove the trend
Patterns in data happen because of
regular time-based trends
Patterns in data happen because of
regular time-based trends
How to fix
Compare observations from same time period
or use yearly/monthly averages
Repeated exposure to questions or tasks
will make people improve naturally
Repeated exposure to questions or tasks
will make people improve naturally
How to fix
Change tests, maybe don't offer pre-tests,
use a control group that receives the test
People in the extreme have a tendency to
become less extreme over time
Luck Crime and terrorism Hot hand effect
People in the extreme have a tendency to
become less extreme over time
Luck Crime and terrorism Hot hand effect
How to fix
Don't select super high or
super low performers
This isn’t because the universe trends toward some average; an extreme value is because of systematic and random extremes, which are rare. Luck goes away
Measuring the outcome incorrectly
will bias the effect
Measuring the outcome incorrectly
will bias the effect
How to fix
Measure the outcome well
If the study is too short, the effect might not
be detectable yet; if the study is too long,
attrition becomes a problem
If the study is too short, the effect might not
be detectable yet; if the study is too long,
attrition becomes a problem
How to fix
Use prior knowledge about the thing
you're studying to choose the right length
Observing people makes them behave differently
Observing people makes them behave differently
How to fix
Hide? Use completely unobserved control groups
Experiments in 1924-1932 at Hawthorne Works
Control group works hard to prove
they're as good as the treatment group
Control group works hard to prove
they're as good as the treatment group
How to fix
Keep two groups separate
Control groups naturally pick up
what the treatment group is getting
Externalities Social interaction Equilibrium effects
Control groups naturally pick up
what the treatment group is getting
Externalities Social interaction Equilibrium effects
How to fix
Keep two groups separate;
use distant control groups
Something happens that affects one of
the groups and not the other
Something happens that affects one of
the groups and not the other
How to fix
🤷♂️
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Contamination
Hawthorne John Henry
Spillovers Intervening events
Randomization fixes a host of issues
Selection Maturation Regression to the mean
Randomization fixes a host of issues
Selection Maturation Regression to the mean
Randomization doesn't fix everything!
Attrition Contamination Measurement
Are your findings generalizable
to the whole population?
Are your findings generalizable
to the whole population?
Are your findings generalizable
to the whole population?
Study volunteers are weird
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Not everyone takes surveys
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Not everyone takes surveys
Online surveys Amazon Mechanical Turk Random digit dialing
Does a study in one state
apply to other states?
Does a study in one state
apply to other states?
Does the effect from a mosquito net trial
in Eritrea transfer to Bolivia?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 6
PMAP 8521: Program evaluation
Andrew Young School of Policy Studies
Construct validity
Construct validity
Statistical conclusion validity
Construct validity
Statistical conclusion validity
Internal validity
Construct validity
Statistical conclusion validity
Internal validity
External validity
A new program hopes to
improve student commitment to school
A new program hopes to
improve student commitment to school
Participants score 200 points higher on the SAT and have a 0.3 higher GPA, on average
A new program hopes to
improve student commitment to school
Participants score 200 points higher on the SAT and have a 0.3 higher GPA, on average
Success! Success?
Drunk guy looking for keys in the light of the lamppost instead of over in the bushes where they lost them
Are you measuring what you want to measure?
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Test scores measure how good kids are at taking tests
Are you measuring what you want to measure?
Do test scores measure commitment to school?
Teacher performance? Principal skill?
Test scores measure how good kids are at taking tests
This is why we spend so much time
on outcome measurement construction!
Are your statistics correct?
Are your statistics correct?
Statistical power
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Fishing and p-hacking
Are your statistics correct?
Statistical power
Violated assumptions of statistical tests
Fishing and p-hacking
Spurious statistical significance
A training program causes incomes to rise by $40
Person | Group | Before | After | Difference |
---|---|---|---|---|
295 | Control | 122.09 | 229.04 | 106.95 |
126 | Treatment | 205.60 | 199.84 | -5.76 |
400 | Control | 133.25 | 130.40 | -2.85 |
94 | Treatment | 270.11 | 206.56 | -63.54 |
250 | Control | 344.37 | 222.89 | -121.49 |
59 | Treatment | 312.41 | 268.06 | -44.35 |
Survey 10 participants
Survey 10 participants
Survey 200 participants
Use a statistical power calculator to
make sure you can potentially detect an effect
Every statistical test has certain assumptions
Every statistical test has certain assumptions
For instance, for OLS:
Linearity Homoscedasticity Independence Normality
Every statistical test has certain assumptions
For instance, for OLS:
Linearity Homoscedasticity Independence Normality
Make sure you're doing the stats correctly
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Don't!
Wouldn't it be awesome to run thousands of models
with different combinations of variables
until you find coefficients that are statistically significant?
Don't!
If p threshold is 0.05 and you measure 20 outcomes,
1 will likely show correlation by chance
If p threshold is 0.05 and you measure 20 outcomes,
1 will likely show correlation by chance
Omitted variable bias
Selection Attrition
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Contamination
Hawthorne John Henry
Spillovers Intervening events
If people can choose to enroll in a
program, those who enroll will be
different from those who do not
If people can choose to enroll in a
program, those who enroll will be
different from those who do not
How to fix
Randomization into
treatment and control groups
If people can choose when to
enroll in a program, time might
influence the result
If people can choose when to
enroll in a program, time might
influence the result
How to fix
Shift time around
(happier people more likely to get married, so without randomly assigning marriage how would you study the impact of marriage on happiness?). They use a simple approach - since happiness varies over time set marriage equal to time zero and build a pre-post design around it. You essentially leverage the within-group variance and iron out across-age differences because of the varying ages of marriage. The whole insight it to change the time-line from calendar years to program years.
If the people who leave a program or
study are different than those who stay,
the effects will be biased
If the people who leave a program or
study are different than those who stay,
the effects will be biased
How to fix
Check characteristics of those
who stay and those who leave
ID | Increase in income | Remained in program |
---|---|---|
1 | $3.00 | Yes |
2 | $3.50 | Yes |
3 | $2.00 | Yes |
4 | $1.50 | No |
5 | $1.00 | No |
ATE with
attriters = $2.20
ID | Increase in income | Remained in program |
---|---|---|
1 | $3.00 | Yes |
2 | $3.50 | Yes |
3 | $2.00 | Yes |
4 | $1.50 | No |
5 | $1.00 | No |
ATE with
attriters = $2.20
ATE without
attriters = $2.83
Growth is expected naturally
e.g. programs targeted at childhood development
contend with the fact that children develop on their own too
Growth is expected naturally
e.g. programs targeted at childhood development
contend with the fact that children develop on their own too
How to fix
Use a comparison group to remove the trend
Patterns in data happen
because of larger global processes
Recessions Cultural shifts Marriage equality
Patterns in data happen
because of larger global processes
Recessions Cultural shifts Marriage equality
How to fix
Use a comparison group to remove the trend
Patterns in data happen because of
regular time-based trends
Patterns in data happen because of
regular time-based trends
How to fix
Compare observations from same time period
or use yearly/monthly averages
Repeated exposure to questions or tasks
will make people improve naturally
Repeated exposure to questions or tasks
will make people improve naturally
How to fix
Change tests, maybe don't offer pre-tests,
use a control group that receives the test
People in the extreme have a tendency to
become less extreme over time
Luck Crime and terrorism Hot hand effect
People in the extreme have a tendency to
become less extreme over time
Luck Crime and terrorism Hot hand effect
How to fix
Don't select super high or
super low performers
This isn’t because the universe trends toward some average; an extreme value is because of systematic and random extremes, which are rare. Luck goes away
Measuring the outcome incorrectly
will bias the effect
Measuring the outcome incorrectly
will bias the effect
How to fix
Measure the outcome well
If the study is too short, the effect might not
be detectable yet; if the study is too long,
attrition becomes a problem
If the study is too short, the effect might not
be detectable yet; if the study is too long,
attrition becomes a problem
How to fix
Use prior knowledge about the thing
you're studying to choose the right length
Observing people makes them behave differently
Observing people makes them behave differently
How to fix
Hide? Use completely unobserved control groups
Experiments in 1924-1932 at Hawthorne Works
Control group works hard to prove
they're as good as the treatment group
Control group works hard to prove
they're as good as the treatment group
How to fix
Keep two groups separate
Control groups naturally pick up
what the treatment group is getting
Externalities Social interaction Equilibrium effects
Control groups naturally pick up
what the treatment group is getting
Externalities Social interaction Equilibrium effects
How to fix
Keep two groups separate;
use distant control groups
Something happens that affects one of
the groups and not the other
Something happens that affects one of
the groups and not the other
How to fix
🤷♂️
Omitted variable bias
Selection Attrition
Trends
Maturation Secular trends Seasonality Testing Regression
Study calibration
Measurement error
Time frame
Contamination
Hawthorne John Henry
Spillovers Intervening events
Randomization fixes a host of issues
Selection Maturation Regression to the mean
Randomization fixes a host of issues
Selection Maturation Regression to the mean
Randomization doesn't fix everything!
Attrition Contamination Measurement
Are your findings generalizable
to the whole population?
Are your findings generalizable
to the whole population?
Are your findings generalizable
to the whole population?
Study volunteers are weird
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Not everyone takes surveys
Study volunteers are weird
Western, educated, from industrialized,
rich, and democratic countries
Not everyone takes surveys
Online surveys Amazon Mechanical Turk Random digit dialing
Does a study in one state
apply to other states?
Does a study in one state
apply to other states?
Does the effect from a mosquito net trial
in Eritrea transfer to Bolivia?