WHIPPING BOYS AND WITCH HUNTERS (ii)

At least as apt today as 3 years ago…HAPPY HALLOWEEN! Memory Lane with new comments in blue.

In an earlier post I alleged that frequentist hypotheses tests often serve as whipping boys, by which I meant “scapegoats”, for the well-known misuses, abuses, and flagrant misinterpretations of tests (both simple Fisherian significance tests and Neyman-Pearson tests, although in different ways)—as well as for what really boils down to a field’s weaknesses in modeling, theorizing, experimentation, and data collection. Checking the history of this term however, there is a certain disanalogy with at least the original meaning of a “whipping boy,” namely, an innocent boy who was punished when a medieval prince misbehaved and was in need of discipline. It was thought that seeing an innocent companion, often a friend, beaten for his own transgressions would supply an effective way to ensure the prince would not repeat the same mistake. But significance tests floggings, rather than a tool for a humbled self-improvement and commitment to avoiding flagrant rule violations, has tended instead to yield declarations that it is the rules that are invalid! The violators are excused as not being able to help it! The situation is more akin to that of witch hunting that in some places became an occupation in its own right.

Now some early literature, e.g., Morrison and Henkel’s Significance Test Controversy (1962), performed an important service over fifty years ago. They alerted social scientists to the fallacies of significance tests: misidentifying a statistically significant difference with one of substantive importance, interpreting insignificant results as evidence for the null hypothesis—especially problematic with insensitive tests, and the like. Chastising social scientists for applying significance tests in slavish and unthinking ways, contributors call attention to a cluster of pitfalls and fallacies of testing.

The volume describes research studies conducted for the sole purpose of revealing these flaws. Rosenthal and Gaito (1963) document how it is not rare for scientists to mistakenly regard a statistically significant difference, say at level .05, as indicating a greater discrepancy from the null when arising from a large sample size rather than a smaller sample size—even though a correct interpretation of tests indicates the reverse. By and large, these critics are not espousing a Bayesian line but rather see themselves as offering “reforms” e.g., supplementing simple significance tests with power (e.g., Jacob Cohen’s “power analytic movement), and most especially, replacing tests with confidence interval estimates of the size of discrepancy (from the null) indicated by the data. Of course, the use of power is central for (frequentist) Neyman-Pearson tests, and (frequentist) confidence interval estimation even has a duality with hypothesis tests!) But see reforming the reformers on CIs: “Anything tests can do CIs do better” and a follow-up.

But rather than take a temporary job of pointing up some understandable fallacies in the use of newly adopted statistical tools by social scientific practitioners, or lead by example of right-headed statistical analyses, the New Reformers have seemed to settle into a permanent career of showing the same fallacies. Yes, they advocate “alternative” methods, e.g., “effect size” analysis, power analysis, confidence intervals, meta-analysis. But never having adequately unearthed the essential reasoning and rationale of significance tests—admittedly something that goes beyond many typical expositions—their supplements and reforms often betray the same confusions and pitfalls that underlie the methods they seek to supplement or replace! (I will give readers a chance to demonstrate this in later posts.)

I think we all reject the highly lampooned, recipe-like uses of significance tests; I and others insist on interpreting tests to reflect the extent of discrepancy indicated or not (as far back as when I was writing my doctoral dissertation!). I never imagined that hypotheses tests (of all stripes) would continue to be flogged again and again, in the same ways!

Frustrated with the limited progress in psychology, apparently inconsistent results, and lack of replication, an imagined malign conspiracy of significance tests is blamed: Traditional reliance on statistical significance testing, we hear,

“has a debilitating effect on the general research effort to develop cumulative theoretical knowledge and understanding. However, it is also important to note that it destroys the usefulness of psychological research as a means for solving practical problems in society” (Schmidt 1996, 122).

Meta-analysis was to be the cure that would finally provide cumulative knowledge to psychology:

“It means that the behavioral and social sciences can attain the status of true sciences: they are not doomed forever to the status of quasi-sciences or pseudoscience. … [T]he gloom, cynicism, and nihilism that have enveloped many in the behavioral and social sciences is lifting” (Schmidt, p.123).

Unsurprisingly, meta-analysis was no panacea.

Lest enthusiasm for revisiting the same cluster of elementary fallacies of tests begin to lose steam, the threats of dangers posed become ever shriller: just as the witch is scapegoated for whatever ails a community, the significance test is portrayed as so powerful as to be responsible for blocking scientific progress. To keep the gig alive, a certain level of breathless hysteria is common: “statistical significance is hurting people, indeed killing them” (Ziliak and McCloskey 2008, 186); significance testers are members of a “cult” led by R.A. Fisher” whom they call “The Wasp”. (See post on NHST Task Forces)

Normal curve pumpkin

What’s really spooky is that the “reformers” are often the ones in need of reforms when it comes to interpreting tests correctly. For example, Ziliak and McCloskey claim:

“If the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct” (Z & M, pp. 132-3).

But this is not so. Perhaps they are slipping into the cardinal error of mistaking the power of a test to detect alternative H’ as a posterior probability in H’, given a rejection of the null.

(11/1/15) It’s not just that a posterior probability in H’ isn’t warranted from a rejection of the null, the problem is actually more serious. Say that test T rejects Ho when the test statistic, S, exceeds a cut-off S*, that is whenever S > S*. Suppose the observed outcome So just rejects the null, i.e., So = S*. If we’re given that the power to detect H’ is high, then this is poor evidence for alternative H’. That’s because if H’ were the case, a larger value of S (than was observed) would have occurred with high probability: Pr(S > S*;H’) is high. That’s what high power to detect H’ means. See recent posts on power howlers [i].

Other problematic assertions:

“If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low” (Z & M, p. 133).

No, higher power = lower hurdle. For some reason, they keep saying Fisher is guilty of requiring very low hurdles (for rejection) because his tests have low power, but it’s high power that translates into low hurdles.

‘What is relevant here for the statistical case is that refutations of the null are trivially easy to achieve if power is low enough or the sample size is large enough” (Z & M, p. 152).

As Aris Spanos comments, “their two instances of ‘easy rejection’ separated by ‘or’ contradict each other! Rejections of the null are not easy to achieve when the power is ‘low enough’. They are more difficult exactly because the test does not have adequate power (generic capacity) to detect discrepancies from the null; that stems from the very definition of power and optimal tests. [Their second claim] is correct for the wrong reason. Rejections are easy to achieve when the sample size n is large enough due to high not low power. This is because the power of a ‘decent’ (consistent) frequentist test increases monotonically with n!” (See Spanos’ review of Z & M, Aris Spanos 2008.)

I’m still hopeful that they will want to correct these howlers. I did try. (See “No headache power: for Dierdre”, and “To raise the power of a test is to lower the hurdle for rejecting the null”.)

[i] “Get empowered to detect power howlers”.

“How to avoid making mountains out of molehills”

“Telling what’s true about power”

Morrison, D. and Henkel, R. (eds.) (1970), The Significance Test Controversy, Aldine, Chicago.

Rosenthal, R. and Gaito, J. (1963), “The Interpretation of Levels of Significance by Psychological Researchers,” Journal of Psychology 55:33-38.

Schmidt, F. (1996), “Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers,” Psychological Methods, Vol. I. No. 2: 115-129.`

Spanos, A. (2008), “Review of S. Ziliak and D. McCloskey’s The Cult of Statistical Significance,” Erasmus Journal for Philosophy and Economics, Volume 1, Issue 1: 154-164.

Ziliak, T. and McCloskey, D. (2008), The Cult of Statistical Significance, University of Michigan Press.

Filed under: P-values, reforming the reformers, significance tests, Statistics Tagged: reformers, significance test controversies, significance tests