Reference Collection to push back against "Common Statistical Myths"

ADAlthousePhD · 2019-06-27T12:51:24.065Z

Note: This topic is a wiki, meaning that this main body of the topic can be edited by others. Use the Reply button only to post questions or comments about material contained in the body, or to suggest new statistical myths you’d like to see someone write about.

I am not claiming to be the leading authority on any or all of the things listed below, but several of us on Twitter have repeatedly floated the idea of creating a list of references that may be used to argue against some common statistical myths or no-nos.

By posting in thread format, I will start but others should feel free to chime in. However, it MAY be easier if the first post (or one of the first posts) is continually updated so all references on a particular topic are indexed at the top. I am not sure of the best way to handle this - either I will try to periodically edit the first post to keep the references updated, or maybe Frank will have a better idea for how to structure and update this content.

I was hoping to organize this into a few key myths/topics. While I am happy to add any topic that authors think is important, the intent here is not to recreate an entire statistical textbook. I’m hoping to provide an easy to navigate list of references so when we get one of the classic review comments like “the authors should add p-values to Table 1” we have some rebuttal evidence that’s easy to find.

I’ve listed a few below to start. Please feel free to email, Twitter DM, or comment below. This will be a living ‘document’ so if there’s something you think is missing, or if I cite a paper that you feel has a fatal flaw and does not support its stated purpose, let me know. We’ll see how this goes.

Reference collection on P value and confidence interval myths

PubMed Central (PMC)

Statistical tests, P values, confidence intervals, and power: a guide to...

Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive,...

https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108

https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

Are confidence intervals better termed “uncertainty intervals”? - PubMed (nih.gov)

Reverse-Bayes analysis of two common misinterpretations of significance tests

https://amstat.tandfonline.com/doi/full/10.1080/00031305.2018.1529625

LWW

Living with P Values: Resurrecting a Bayesian Perspective... : Epidemiology

ull, when they are no such thing. This misuse may be lessened by recognizing correct Bayesian interpretations. For example, under weak priors, 95% confidence intervals approximate 95% posterior probability intervals, one-sided P values approximate...

P-Values in Table 1 of Randomized Trials

Rationale: In RCT’s, it is a common belief that one should always present a table with p-values comparing the baseline characteristics of the randomized treatment groups. This is not a good idea for the following reasons.

Altman DG, Dore CJ. Randomisation and baseline comparison in clinical trials. Lancet 1990; 335: 149-153.
Begg CB. Significance tests of covariate imbalance in clinical trials. Controlled Clin Trials 1990; 11: 223-225.
Senn SJ. Baseline comparisons in randomized clinical trials. Stat Med 1991; 10: 1157-1160
Senn SJ. Testing for baseline balance in clinical trials. Stat Med 1994; 13: 1715-1726.

Covariate Adjustment in RCT

Rationale: Somewhat related to the above, many consumers of randomized trials believe that there is no need for any covariate adjustment in RCT analyses. While it is true that there is no need for a valid RCT, there are benefits to adjusting for baseline covariates that have strong relationships with the study outcome, as explained by the references below. If a reader/reviewer questions why you have chosen to adjust, these may prove helpful.

Canner PL. Covariate adjustment of treatment effects in clinical trials. Controlled Clin Trials 1991; 12: 359-366.
Neuhaus JM. Estimation Efficiency with Omitted Covariates in Generalized Linear Models. J Am Stat Assoc 1998; 93: 1124-1129.
Hauck WW, Anderson S, Marcus SM. Should We Adjust for Covariates in Nonlinear Regression Analyses of Randomized Trials? Controlled Clin Trials 1998; 19: 249-256.
Steyerberg EW, Bossuyt PMM, Lee KL. Clinical trials in acute myocardial infarction: should we adjust for baseline characteristics? Am Heart J 2000; 139(5): 745-751.
Hernandez AV, Steyerberg EW, Habbema JDF. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epi 2004; 57(5): 454-460.
Hernandez AV, Eijkemans MJC, Steyerberg EW. Randomized controlled trials with time-to-event outcomes: How much does prespecified covariate adjustment increase power? Ann Epi 2006; 16(1): 41-48.
Gray LJ, Bath P, Collier T. Should stroke trials adjust for functional outcome for baseline prognostic factors? Stroke 2009; 40: 888-894.
Kent DM, Trikalinos TA, Hill MD. Are unadjusted analyses of clinical trials inappropriately biased toward the null? Stroke 2009; 40(3): 672-673.
Lingsma H, Roozenbeek B, Steyerberg E. Covariate adjustment increases statistical power in randomized controlled trials. J Clin Epi 2010; 63(12): 1391.
Groenwold RHH, Moons KGN, Peelen LM, Knol MJ, Hoes AW. Reporting of treatment effects from randomized trials: A plea for multivariable risk ratios. Contemp Clin Trials 2011; 32(3): 399-402.
Ciolino JD, Martin RH, Zhao W, Jauch EC, Hill MD, Palesch YY. Covariate imbalance and adjustment for logistic regression analysis of clinical trial data. J Biopharm Stat 2013; 23(6): 1383-1402.

Analyzing “Change” Measures in RCT’s

Rationale: Many authors and pharmaceutical clinical trialists make the mistake of analyzing change from baseline instead of making the raw follow-up measurements the primary outcomes, covariate-adjusted for baseline. To compute change scores requires many assumptions to hold (for more detail, see Frank’s blog post on this: Statistical Thinking - Statistical Errors in the Medical Literature). It is generally better to analyze the “follow up” measurement as the outcome with a covariate adjustment for the baseline value, as this seems to better match the question of interest: for two patients with the same pre-trial value of the study outcome, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values?

Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow up measurements. BMJ 2001; 323: 1123.
Vickers AJ. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med Res Methodol. 2001;1:6. doi: 10.1186/1471-2288-1-6. Epub 2001 Jun 28. PMID: 11459516; PMCID: PMC34605.
Bland JM, Altman DG. Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials. 2011 Dec 22;12:264. doi: 10.1186/1745-6215-12-264. PMID: 22192231; PMCID: PMC3286439.
Bland JM, Altman DG. Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach. Am J Clin Nutr. 2015 Nov;102(5):991-4. doi: 10.3945/ajcn.115.119768. Epub 2015 Sep 9. PMID: 26354536.

Using Within-Group Tests in Parallel-Group Randomized Trials

Rationale: Researchers often analyze randomized trials and other comparative studies by separate analysis of changes from baseline in each parallel group. Sometimes, they will incorrectly conclude that their study proves that a treatment effect exists if there is a “significant” p-value for the within-group test for the treatment group, although this ignores the presence of the control group (what’s the purpose of having the control group if you’re not going to compare the treated group against the control group?)

Bland JM, Altman DG. Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials 2011; 12: 264.
Bland JM, Altman DG. Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach. Am J Clin Nutr 2015; 102(5); 991-994.

Sample Size / Number of Variables for Regression Models

Rationale: It is common to see regression models with far too many variables included relative to the amount of data (as a reviewer, I’ll see papers that report a “risk score” that includes 20+ variables in a logistic regression model with ~200 patients and ~30 outcome events). A commonly cited rule of thumb is “10 events per variable” in logistic regression, but in fact the specific number is more complex than that, though it may function as a useful “BS test” at a first glance.

Courvoisier DS, Combescure C, Agoritsas T, Gayet-Ageron A, Perneger TV. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epi 2011; 64(9): 993-1000.
van Smeden M, de Groot JA, Moons KG, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology 2016; 16(1): 163.
Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epi 2016; 76:175-82.
van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res 2018 (epub).
Riley RD, Snell H, Ensor J, Burke DL, Harrell FE, Moons KG, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II – binary and time-to-event outcomes. Stat Med 2019; 38(7): 1276-1296.

Stepwise Variable Selection (Don’t Do It!)

Rationale: Though stepwise selection procedures are taught in many introductory statistics courses as a way to make multivariable modeling easy and data-driven, statisticians generally dislike it for several reasons, many of which are explained in the reference below:

Smith G. Step away from stepwise. Journal of Big Data 2018; 5: 32

Medium – 11 Dec 18

Stopping stepwise: Why stepwise selection is bad and what you should use instead

This is crossposted from my statistics site: www.StatisticalAnalysisConsulting.com

Reading time: 13 min read

Screening covariates to include in multivariable models with bivariable tests

Rationale: People sometimes decided to include variables in multivariable models only if they are “significant” predictors of the outcome when included in the model by themselves (i.e. they are crudely associated with the outcome). This is a bad idea, partly due to the same reasons as stepwise regression (it is essentially a variant of stepwise regression done manually), partly due to the fact that it neglects the multivariate structure, and a variable’s effect on the outcome might be different when viewed in isolation and when several variables are considered simultaneously.

Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. Journal of Clinical Epidemiology. 1996. 49, 8:907-16
Greenland S. Modeling and variable selection in epidemiologic analysis. American Journal of Public Health. 1989. 79, 3:340-9.
Ferenci T. Variable selection should be blinded to the outcome. International Journal of Epidemiology. 2017. 46, 3:1077-1079.

Post-Hoc Power (Is Not Really A Thing)

Rationale: In studies that fail to yield “statistically significant” results, it is common for reviewers, or even editors, to ask the authors to include a post hoc power calculation. In such situations, editors would like to distinguish between true negatives and false negatives (concluding there is no effect, when there actually is an effect, and the study was just too small to pick it up). However, reporting post-hoc power is nothing more than reporting the p-value a different way, and will therefore not answer the question editors want to know.

Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician 2001; 55 (PDF link)
Lenth RV. Post Hoc Power: Tables and Commentary. (PDF link)
Goodman SN, Berlin JA (1994) The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 121:200-206
Althouse AD (2021) Post Hoc Power: Not Empowering, Just Misleading. Journal of Surgical Research . 259:A3-A6

Misunderstood “Normality” Assumptions

Rationale: One of the pieces of information that many folks who have taken an introductory statistics class retain is that the Normal distribution is basically everything, and they often assume that the data need to be normally distributed for ALL statistical procedures to work correctly. However, in many of the procedures and tests that we use, the normality of the error terms (or residuals) matters, not the normality of the data points themselves.

psychometroscar – 12 Jul 18

Normality: residuals or dependent variable?

So… something interesting happened the other day. As part of an unrelated set of circumstances, my super awesome BFF Ed and I were discussing one of those interesting, perennial misconceptions with…

Absence of Evidence is Not Evidence of Absence

Rationale: It’s also common to break results into “significant” (p<0.05) and “not significant” (p>0.05); when the latter occurs, many interpret the phrase “no significant effect” as evidence that there is no effect, when this is not really true (thanks to @davidcnorrismd for adding another reference below).

Altman DG, Bland JM. Statistics Notes: Absence of Evidence is Not Evidence of Absence. BMJ 1995; 311; 485.
Braithwaite R. EBM’s six dangerous words. JAMA . 2013;310(20):2149-2150.
Gelman A, Stern H. The Difference Between “Significant” and “Not Significant” is Not Itself Statistically Significant. The American Statistician 2012.

Inappropriately Splitting Continuous Variables Into Categorical Ones

Rationale: People often choose to split a continuous variable into dichotomized groups or a few bins (e.g., using quartiles to divide the data into four groups, then comparing the highest versus the lowest quartile). There are select and limited reasons why one may choose to partition continuous variables into categories, but more often than not this is a bad idea and done simply because it’s believed to be “easier” to perform or understand.

Naggara, O. et al. Analysis by categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms. American Journal of Neuroradiology 32, 437–40 (2011).
Royston, P., Altman, D. & Sauerbrei, W. Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine 25, 127–41 (2006).
Dawson, N. & Weiss, R. Dichotomizing continuous variables in statistical analysis: a practice to avoid. Medical decision making : an international journal of the Society for Medical Decision Making 32, 225–6 (2012).
Altman, D. Problems in dichotomizing continuous variables. American Journal of Epidemiology 139, 442–5 (1994).
Thoresen, M. Spurious interaction as a result of categorization. BMC Medical Research Methodology 19, 28 (2019).
Altman, D. & Royston, P. The cost of dichotomising continuous variables. BMJ (Clinical Research Ed.) 332, 1080 (2006).

Use of normality tests before t tests

Rationale: This is commonly recommended to researchers who are not statisticians.

Example Citation: Ghasemi A, Zahediasl S. Normality tests for statistical analysis: a guide for non-statisticians. Int J Endocrinol Metab . 2012;10(2):486–489. doi:10.5812/ijem.3505

The assumption of normality needs to be checked for many statistical procedures, namely parametric tests, because their validity depends on it.

Problem: The nominal size and power of the unconditional t test is changed with the combined procedure in unknown ways.

Rasch D, Kubinger KD, Moder K (2011): The two-sample t test: pre-testing its assumptions does not pay off. Statistical Papers 52(1): 219-231
Rochon J, Kieser M (2010): A closer look at the effect of preliminary goodness‐of‐fit testing for normality for the one‐sample t ‐test. Br J Math Stat Psychol 64: 410-426
Rochon J, Gondan M, Kieser M (2012): To test or not to test: Preliminary assessment of normality when comparing two independent samples. BMC Med Res Methodol 12: 81
Schoder V, Himmelmann A, Wilhelm KP (2006): Preliminary testing for normality: some statistical aspects of a common concept. Clin Exp Dermatol 31: 757-761

I² in meta-analysis doesn’t refer to an absolute measure of heterogeneity

Rationale: When reporting heterogeneity results in a meta-analysis the value of I² is often misinterpreted and treated like an absolute measure of heterogeneity when in fact is not.

Borenstein M, Higgins JPT, Hedges LV, Rothstein HR (2017) Basics of meta-analysis: I 2 is not an absolute measure of heterogeneity. Res Syn Meth 8:5–18.

Number Needed to Treat (NNT)

Andrade C (2015) The numbers needed to treat and harm (NNT, NNH) statistics: what they tell us and what they do not. The Journal of Clinical Psychiatry 76:e330-3
Citrome L, Ketter T (2013) When does a difference make a difference? Interpretation of number needed to treat, number needed to harm, and likelihood to be helped or harmed. International journal of clinical practice 67:407–11
Grieve, A.P. (2003). The number needed to treat: a useful clinical measure or a case of the Emperor’s new clothes? Pharmaceutical Statistics 2, 87–102.

Propensity-Score Matching - Not Always As Good As It Seems

Rationale: Conventional covariates adjustment is enough for most cases with adequate sample size, and propensity-score matching is not necessarily superior.

Elze MC, et al. (2017). Comparison of Propensity Score Methods and Covariate Adjustment: Evaluation in 4 Cardiovascular Studies. Journal of American College of Cardiology 69(3):345-357.
Gary King on “Why Propensity Scores Should Not Be Used for Matching” (YouTube video)
Brooks JM, Ohsfeldt RL. (2013) Squeezing the balloon: propensity scores and unmeasured covariate balance. Health Services Research. 48(4):1487-507.
Ali MS, Groenwold RH, Klungel OH. (2014) Propensity score methods and unobserved covariate imbalance: comments on “squeezing the balloon”. Health Services Research. 49(3):1074-82.

Responder Analysis

Rationale: In some cases, authors attempt to dichotomize a continuous primary efficacy measure into “responders” and “non-responders." This is discussed at length in another thread on this forum, but here are some scholarly references:

Snapinn SM, Jiang Q. Responder analyses and the assessment of a clinically relevant treatment effect. Trials 2007.

Significance testing in pilot studies

Rationale: Authors often perform null-hypothesis testing in pilot studies and report p-values. However, the purpose of pilot studies is to identify issues in all aspects of the study ranging from recruitment to data management and analysis. Pilot studies are not usually powered for inferential testing. If testing is done, p-values should not be emphasized and confidence intervals should be reported. Results on potential outcomes should be regarded as descriptive. A CONSORT extension for pilot and feasibility studies exist, and is a useful reference to include in submissions and cover letters. Editors may not be aware of this extension of CONSORT.

Eldridge SM et al. CONSORT 2010 statement: extension to randomised pilot and feasibility trials, BMJ 2016
Moore CG et al. Recommendations for Planning Pilot Studies in Clinical and Translational Research, Clin Transl Sci 2011

P-values do not “trend towards significance”

Rationale: It is common for investigators to observe a “non-significant” result and say things like the result was “trending towards significance”, suggesting that had they only been able to collect more data, surely the result would have been significant. This misunderstands the volatility of p-values when there is no effect of the treatment under test. Simply put, p-values don’t “trend”, and “almost significant” results are not guaranteed to become significant with more data - far from it.

Wood J et al. Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data, BMJ 2014
Hankins MC. Still Not Significant

Additional Requested Topics

Feel free to add your own suggestion here, we are happy to revisit and update whenever practical.

MaartenvSmeden · 2019-06-27T13:05:10.613Z

Really good initiative Andrew. Sorry for the shameless self-promotion but I do have a thread with some misconceptions and references that might be helpful for getting this list together

twitter.com

Maarten van Smeden (MaartenvSmeden)

Let's try something new: *7 days, 7 statistical misconceptions*. Over the next 7 days I'll post about 1 statistical misconception a day. Curious to hear your favorite #statsmisconceptions, so feel free to add you own to this thread

8:47 PM - 14 Oct 2018 650 260

f2harrell · 2019-06-27T18:06:00.531Z

Wow this is incredible Andrew. In a moment I’m goint to make my first attempt to converting a topic to a wiki topic that anyone can edit. Let’s see if that’s a good approach for growing this resource which you so nicely started.

Update: it’s now a wiki. Apparently you click on a small orange pencil symbol inside a small orange box to edit the topic. Then you’ll see another option to Edit Wiki. Perhaps others will reply here with more pointeres.

f2harrell · 2019-06-27T22:45:00.642Z

R_cubed · 2019-06-28T15:13:44.788Z

Would this be the appropriate thread to add references on the issues related to using parametric assumptions on ordinal data? This has always bothered my mathematical conscience.

Prof. Harrell had posted a great link to a recent paper in another thread:

Responder Analysis: Loser x 4 measurement

Much has been written on that. It might go all the way from a minor to a major breach. Why take the chance when we have wonderful ordinal regression models that make no assumptions about inter-category spacings?

A draft copy an be found here (I assume it is OK to post a link to the draft):

Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2692323

lobrowR · 2019-06-28T17:27:38.521Z

What is the scope of material that should be contributed? The theme of this site in general and this list so far is med stats / clinical trials / epi. Would materials from other disciplines be appreciated, or would they be redundant or off topic; e.g., there are publications on post hoc power in subfields of biology (ecology, evolution, and animal behavior) that present the issues from that field’s perspective, but have nothing to do with the core themes of this site.

ADAlthousePhD · 2019-06-28T17:37:38.667Z

lobrowR:

What is the scope of material that should be contributed? The theme of this site in general and this list so far is med stats / clinical trials / epi. Would materials from other disciplines be appreciated, or would they be redundant or off topic; e.g., there are publications on post hoc power in subfields of biology (ecology, evolution, and animal behavior) that present the issues from that field’s perspective, but have nothing to do with the core themes of this site.

I’ll leave that for Frank to answer, as this website is his brainchild. I don’t think it’s unreasonable to present a paper from a different field that still addresses a core statistical topic (the examples you cite are good ones). Here is the site description from the home page, so perhaps this can be our guide, though of course there is some judgement in what exactly fits into this:

This is a place where statisticians, epidemiologists, informaticists, machine learning practitioners, and other research methodologists communicate with themselves and with clinical, translational, and health services researchers to discuss issues related to data: research methods, quantitative methods, study design, measurement, statistical analysis, interpretation of data and statistical results, clinical trials, journal articles, statistical graphics, causal inference, medical decision making, and more.

ivivek87 · 2019-06-28T19:13:25.798Z

Amazing initiative. It will be really great to breakdown the misconception myths that have been been plaguing the field for a while, specially in applied contexts. Thanks for the effort of putting this wiki page up with references!

Elias_Eythorsson · 2019-06-29T12:33:30.319Z

What are your thoughts on also including informative and well written blogs and shiny apps?

ADAlthousePhD · 2019-06-29T15:07:34.171Z

I’ve got no problem with it. I was initially trying to prioritize scientific publications if only because when you’re appealing to an editor they might be more inclined to take that seriously versus a blog (fairly or not…) but there are certainly some excellent blog posts that may be useful here as well.

Stephen_Olivier · 2019-07-02T12:21:52.914Z

Would Gelman and Stern’s point in this paper be appropriate to add to the list?

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant
https://www.tandfonline.com/doi/abs/10.1198/000313006X152649

Georgeta · 2019-07-03T00:03:03.769Z

This is a great resource. Thank you.

2 other prevalent myths come to mind:

Matching - or not all that has intuitive appeal is actually good (or worthy)
The unbearable lightness of NNTs

ADAlthousePhD · 2019-07-03T12:14:26.698Z

Some excellent suggestions in the last few posts - please feel free to add your favorites on those (after all, I want this to be crowd-sourced, not just my favorites!) I’ll also try to add a few when I get a chance.

Bajun · 2019-07-04T12:36:49.955Z

Great initiative. Look forward to seeing this build.

I have some quibbles about the advice on covariate adjustment in RCTs. A classic paper is Pocock et al, Subgroup analysis and other (mis)uses of baseline data in clinical trials. However, things have moved on a little since then. But I’ll start from the beginning.

The reason you do not adjust an RCT is because you have randomised. If you have not introduced any bias through the conduct of the trial, then any difference between the groups must be due to a treatment effect or chance. The p-value accounts for those instances where you got unlucky. If you’re going to start adjusting, why randomise?

It is fine, of course, to do exploratory analysis beyond the primary endpoint but emphasis should always be on the unadjusted result, because you randomised. You have the luxury of actually random samples, you don’t need to resort to tricks from other designs.

The big problem with adjusting results is the sheer scope for finding models that give you the answer you want. Measure two dozen baseline characteristics, chuck them all in and see what comes out in the wash (see also: subgroups, pre-specification of as an unreliable marker for biological plausibility).

Pre-specification could help, but why would you pre-specify a baseline imbalance? If you knew it was an important prognostic factor, why didn’t you stratify the randomisation? Why did you leave yourself scope to cheat when you could have designed out the problem?

And that brings us onto a more recent development on covariate adjustment. The key to an RCT is to “analyse as randomised” (see also: intention-to-treat). So if you stratified the randomisation, you should stratify the analysis by exactly the same factors. The unadjusted p-value is accounting for a lot of possible outcomes that you made impossible by design. So you are fully entitled to take account of that (but it may be wise to report the unadjusted results also, and/or stick in a reference for the statto reviewer).

There’s a nice empirical reference for that last point: Reporting and analysis of trials using stratified randomisation in leading medical journals: review and reanalysis

And the EMA guidline (broken because I can’t post 3 links HTTPS&c–www.ema.europa.eu/en/documents/scientific-guideline/guideline-adjustment-baseline-covariates-clinical-trials_en.pdf) may also be of interest (aimed at Pharma and device manufacturers).

Thanks again for this. It’s a really useful initiative.

f2harrell · 2019-07-04T12:53:42.465Z

I have to strongly disagree with that. Much has been written about this. Briefly, you have to covariate adjust in RCTs to make the most out of the data, i.e., to get the best power and precision. It’s all about explaining explainable outcome heterogeneity, and nothing to do with balance. And concerning stratification, Stephen Senn has shown repeatedly that the correct philosophy is to pose the model and then randomize consistent with that, not the other way around as you have suggested.

R_cubed · 2019-07-06T01:30:00.350Z

Does anyone think a section on using normality tests before doing a t-test is needed? I see it frequently in the rehabilitation literature.

Example:

ncbi.nlm.nih.gov

How do we stand? Variations during repeated standing phases of asymptomatic subjects and low back pain patients.

H Schmidt, M Bashkuev, J Weerts, F Graichen, J Altenscheidt, C Maier and S Reitmaier, Journal of biomechanics, 21 2018 03

An irreproducible standing posture can lead to mis-interpretation of radiological measurements, wrong diagnoses and possibly unnecessary treatment. This study aimed to evaluate the differences in lumbar lordosis and sacrum orientation in six repetitive upright standing postures of 353 asymptomatic subjects (including 332 non-athletes and 21 athletes - soccer players) and 83 low back pain (LBP) patients using a non-invasive back-shape measurement device. In the standing position, all investigated cohorts displayed a large inter-subject variability in sacrum orientation (∼40°) and lumbar lordosis (∼53°). In the asymptomatic cohort (non-athletes), 51% of the subjects showed variations in lumbar lordosis of 10-20% in six repeated standing phases and 29% showed variations of even more than 20%. In the sacrum orientation, 53% of all asymptomatic subjects revealed variations of >20% and 31% of even more than 30%. It can be concluded that standing is highly individual and poorly reproducible. The reproducibility was independent of age, gender, body height and weight. LBP patients and athletes showed a similar variability as the asymptomatic cohort. The number of standing phases performed showed no positive effect on the reproducibility. Therefore, the variability in standing is not predictable but random, and thus does not reflect an individual specific behavioral pattern which can be reduced, for example, by repeated standing phases.

I looked up the paper and this is what they did:

Blockquote
The Kolmogorov-Smirnov-Lilliefors test was applied to evaluate the normal distribution for each
investigated group. To detect the presence of outliers, the Grubb’s test was performed. Levene’s test was used to test for variance homogeneity. For normally distributed data and variance homogeneity, Student’s t-test was applied to access gender-specific differences and a one-way analysis of variance (ANOVA) followed by post hoc Scheffé’s test to analyze differences between cohorts. The subjects were later grouped based on the variability in the sacrum orientation and lumbar lordosis during different standing phases. Due to small size of the individual sub-groups, the non-parametric Friedman test was performed to assess the differences between repeated measurements in the subgroups, followed by post hoc Nemenyi test. Additionally, a regression analyses was applied and the coefficient of determination (R2) was calculated. P-values of <0.05 were considered statistically significant. The statistical analyzes were performed with R 3.2.5 (R-Core-Team, 2016).

In defense of the authors – their hypothesis was that there would be greater variance in the low back pain group vs asymptomatic participants, so some of these methods were understandable.

Their study found large amount of variability in sacral orientation and lordotic curvature.

I thought the following stack exchange threads were appropriate:

stats.stackexchange.com

Normality test before testing the difference between two groups. Is it necessary?

hypothesis-testing, normality-assumption

asked by juanmeque on 06:43PM - 22 Nov 16 UTC

stats.stackexchange.com

Is normality testing 'essentially useless'?

hypothesis-testing, normality-assumption, philosophical

asked by shabbychef on 05:47PM - 08 Sep 10 UTC

Anyone have more scholarly references?

COOLSerdash · 2019-07-06T11:06:39.822Z

I’m aware of at least ~~three~~ four papers on this topic:

Rasch D, Kubinger KD, Moder K (2011): The two-sample t test: pre-testing its assumptions does not pay off. Statistical Papers 52(1): 219-231 (link)
Rochon J, Kieser M (2010): A closer look at the effect of preliminary goodness‐of‐fit testing for normality for the one‐sample t‐test. Br J Math Stat Psychol 64: 410-426 (link)
Rochon J, Gondan M, Kieser M (2012): To test or not to test: Preliminary assessment of normality when comparing two independent samples. BMC Med Res Methodol 12: 81 (link)
Schoder V, Himmelmann A, Wilhelm KP (2006): Preliminary testing for normality: some statistical aspects of a common concept. Clin Exp Dermatol 31: 757-761 (link)

SameeraDaniels · 2019-07-06T16:32:26.798Z

Great thread. Thanks for listing the attendant articles.

ADAlthousePhD · 2019-07-08T11:54:00.347Z

R_cubed:

Does anyone think a section on using normality tests before doing a t-test is needed?

I think this would be a fine addition; though I think it is somewhat related to a topic already listed, Misunderstood “Normality” Assumptions. You and @COOLSerdash should feel free to edit and add things to this section, including the “Rationale” at the top as well as the references. As noted in a reply above, while scholarly references are preferred due to the goal of this resource, well-written blog posts also are welcome as they may provide additional useful ammunition for authors in their efforts to reply to reviewers and/or editors.

Mike_Babyak · 2019-07-09T16:32:20.477Z

IIRC, Box (or maybe Rozeboom?) said testing for some of these assumptions was like putting a rowboat out on the ocean to see if it’s calm enough for the Queen Mary.

557	Randomisation and baseline comparisons in clinical trials. - PubMed - NCBI nih.gov
516	Step away from stepwise \| Journal of Big Data \| Full Text springeropen.com
491	Statistical tests, P values, confidence intervals, and power: a guide to misinterpret... nih.gov
482	https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-wh...
391	Maarten van Smeden on Twitter: "Let's try something new: *7 days, 7 statistical misco... twitter.com

Reference Collection to push back against “Common Statistical Myths”

Reference collection on P value and confidence interval myths

P-Values in Table 1 of Randomized Trials

Covariate Adjustment in RCT

Analyzing “Change” Measures in RCT’s

Using Within-Group Tests in Parallel-Group Randomized Trials

Sample Size / Number of Variables for Regression Models

Stepwise Variable Selection (Don’t Do It!)

Screening covariates to include in multivariable models with bivariable tests

Post-Hoc Power (Is Not Really A Thing)

Misunderstood “Normality” Assumptions

Absence of Evidence is Not Evidence of Absence

Inappropriately Splitting Continuous Variables Into Categorical Ones

Use of normality tests before t tests

I² in meta-analysis doesn’t refer to an absolute measure of heterogeneity

Number Needed to Treat (NNT)

Propensity-Score Matching - Not Always As Good As It Seems

Responder Analysis

Significance testing in pilot studies

P-values do not “trend towards significance”

Additional Requested Topics

created

last reply

replies

views

users

likes

links

Frequent Posters

Popular Links

Reference collection on P value and confidence interval myths

P-Values in Table 1 of Randomized Trials

Covariate Adjustment in RCT

Analyzing “Change” Measures in RCT’s

Using Within-Group Tests in Parallel-Group Randomized Trials

Sample Size / Number of Variables for Regression Models

Stepwise Variable Selection (Don’t Do It!)

Screening covariates to include in multivariable models with bivariable tests

Post-Hoc Power (Is Not Really A Thing)

Misunderstood “Normality” Assumptions

Absence of Evidence is Not Evidence of Absence

Inappropriately Splitting Continuous Variables Into Categorical Ones

Use of normality tests before t tests

I2 in meta-analysis doesn’t refer to an absolute measure of heterogeneity

Number Needed to Treat (NNT)

Propensity-Score Matching - Not Always As Good As It Seems

Responder Analysis

Significance testing in pilot studies

P-values do not “trend towards significance”

Additional Requested Topics

created

last reply

replies

views

users

likes

links

Frequent Posters

Popular Links

I² in meta-analysis doesn’t refer to an absolute measure of heterogeneity