'Planned Replication Analysis Using the 2015 YRBS National Study'
(AsPredicted #3603)


Author(s)
Joseph Cimpian (New York University) - joseph.cimpian@nyu.edu
Pre-registered on
April 2, 2017 | 07:49 AM (PT)

1) Have any data been collected for this study already?
No, no data have been collected for this study yet

2) What's the main question being asked or hypothesis being tested in this study?
***NOTE: This is a pre-registration of the ANALYSIS for a secondary dataset. The data HAVE been collected, but no analyses have been run. The site aspredicted.org does not allow for pre-registration (or even printing this form) if "Yes" is checked for question #8 below; thus, we had to select "No" in order to pre-register the ANALYSIS of this replication study on secondary data.***

This study will be a replication study (using the 2015 Youth Risk Behavior Study [YRBS] National dataset) of the original findings (using the 2015 YRBS State and District dataset). While the 2015 YRBS National data are already collected and publicly available, we have conducted no analyses using these data. We expect to replicate our findings from the 2015 YRBS State and District sample with the National sample.
(1) We predict that likely "mischievous responders" can be identified from 7 screener items (height, asthma, dentist, carrots, potatoes, salad, and fruit).
(2) We expect to see significant reductions in the average LGBQ-heterosexual health disparity for males and females (analyses run separately by sex).
(3) Finally, we expect the magnitude of the change in the disparity to be predicted by the item extremity, where larger changes in disparities are associated with items having less frequently chosen response options.
We found each of the above to be true in the State and District sample, which is why we expect these findings in the National sample.

3) Describe the key dependent variable(s) specifying how they will be measured.
Because the sample size for the National sample is one-tenth the size of the State and District sample, we will only examine AVERAGE patterns for this sample. Just as in the original State and District analysis, the average will be made up of the following 20 outcomes (survey question number in parentheses): Ride with drunk driver (q10); Skipped school (q11); Fought at school (q16); Forced into sex (q20); Partner violence (q21); Partner forced sex (q23); Bullied at school (q24); Felt sad/hopeless (q26); Considered suicide (q27); Planned suicide (q28); Attempted suicide (q29); Smoking days/mo (q33); Alcohol days/mo (q43); Cocaine use (life) (q50); Heroin use (life) (q52); Ecstasy use (life) (q54); Steroids use (life) (q56); No. of sex partners (q63); Phys. activ. past week (q80); TV watching hours/day (q81).

4) How many and which conditions will participants be assigned to?
N/A

5) Specify exactly which analyses you will conduct to examine the main question/hypothesis.
Analyses conducted separated by sex. Boosted logistic regression to predict "LGBQ" status from the 7 screener items (listed above), sampling weight, and stratum. (As we discuss in the paper, we would suspect that LGBQ could not be predicted by these items in theory, thus any prediction is expected to be due to extreme-response patterns among mischievous responders.) After the boosted logistic regressions, a series of disparities will be estimated on the 20 outcomes using stratum fixed effects regression models with LGBQ as the only predictor, and then aggregated to predict the average LGBQ-heterosexual disparity. We estimate the models on the full sample, then remove the top 1% of most likely mischievous responders (based on the boosted regressions) and re-estimate the disparities, and so on until 75% of the data remain. Standard errors will be obtained via 1999 bootstrapped replications on the entire process (incl. boosted regressions and disparity estimation), with clustered sampling first at the stratum then PSU levels to mimic the data-collection sampling design and adjust errors accordingly.

6) Any secondary analyses?
We predict the magnitude of the disparity change via item random-effects models, conditioning on model (i.e., how many observations removed) and the natural log of the base response rate for the most extreme response option. This is the only item-level analysis we will attempt to replicate because it is based on the point estimates of the reductions, not the precision of those estimates (which we expect to be greatly reduced in the National sample).

7) How many observations will be collected or what will determine sample size?
No need to justify decision, but be precise about exactly how the number will be determined.

As the data are publicly available from the CDC, they are already collected. The original State and District final analytic sample size was 146,149. The National final analytic sample will be 14,612, one-tenth the size of the original study sample.

8) Anything else you would like to pre-register?
(e.g., data exclusions, variables collected for exploratory purposes, unusual analyses planned?)

Because the sample size in this National replication is about one-tenth the size of the State and District sample, this replication has theoretically lower power in all respects. Thus, we are not necessarily concerned with replicating the statistical significance of the original study; rather, our objective is to test whether the same PATTERNS of results hold. Related to the lower power, we expect the item-level analyses to be significantly underpowered, and so we only make predictions about the AVERAGE reductions and the PATTERNS among the item-level reductions (as described in question #1).

Version of AsPredicted Questions: 1.05