Sunday, March 01, 2015

Nominating frequentism as the biggest fail

I posted a rant against nutrition science, and got this comment:
This is all true, but it should be noted nutrition science failed because of the Frequentist statistical methods that were used in the studies.

Frequentist statistical methods have screwed up a lot of other subjects. pretty much everyone it's touched in fact, from economics to the psychology, neither of which can be said to have increased their predictive capabilities within living memory.

I nominate Frequentist statistics as the biggest fail. ...

Already by the 50's and 60's statisticians had discovered a mass of (theoretical) problems with p-values and Confidence Intervals (the primary tools used in all those 'scientific' papers which turn out to be wrong far more than they're right). They only really work in very simple cases where they happen give answers operationally identical to the Bayesian answer.

These problems aren't merely faulty application. They are problems of principle and are inherent in Frequentist Statistics even if performed correctly. ...

The only difference between now and 50 years ago is that back then people could only point out theoretical problems with Frequentist methods. Since then, it's become clear to everyone that Frequentist statics is a massive practical failure as well.

Most heavy statistics laden researcher papers are wrong.

Every branch of science that relies on classical statistics as their main tool has stagnated. Just like Economics and Psychology, their predictive ability hasn't improved in half a century despite hundreds of thousands of peer reviewed research papers, and massive research spending that dwarfs everything that came before.
He refers to this 1976 E.T. Jaynes article demonstrating how the frequentist gets wrong answers.

This seemed a little extreme to me, but now I see that a reputable journal is banning frequentism:
The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.

With the banning of the NHSTP from BASP, what are the implications for authors? The following are anticipated questions and their corresponding answers.

Question 1. Will manuscripts with p-values be desk rejected automatically?

Answer to Question 1. No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).

Question 2. What about other types of inferential statistics such as confidence intervals or Bayesian methods?

Answer to Question 2. Confidence intervals suffer from an inverse inference problem that is not very different from that suffered by the NHSTP. In the NHSTP, the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding. Regarding confidence intervals, the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely that if an infinite number of samples were taken and confidence intervals computed, 95% of the confidence intervals would capture the population parameter. Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP. ...

The NHSTP has dominated psychology for decades; we hope that by instituting the first NHSTP ban, we demonstrate that psychology does not need the crutch of the NHSTP, and that other journals follow suit.
The beauty of the NHSTP is that it reduces an experiment to a single number to decide whether it is publishable or not. As far as I know, there is no other single statistic that is a suitable substitute.

The original post was attacked by professional skeptics, and defended by Dilbert. The skeptics have the attitude that if you say that scientists were wrong, then you do not understand science. Science works by correcting previous errors, they say.

I think that the skeptic-atheists hate Dilbert because he has posted some criticism of how evolution is taught and explained. That makes him an enemy of the atheist-evolutionists, but then call him a creationist. I do not think that he is religious at all, but that is the way leftist ideologues are.

The Skeptic radio podcast that attacked Dilbert is best known for co-host Rebecca Watson and Elevatorgate. She told some crazy and probably made-up story about a fellow atheist flirting with her in an elevator at an atheist convention. The details are unimportant, except that these folks get very upset about this sort of thing.


Anonymous said...

Ha! I wrote that comment, but as it happens I'm dead set against the p-value and NHST ban.

Although I believe the entire world would be better off ditching p-values and NHST, the only legitimate way to achieve that happy state is if people are convinced and persuaded to drop them.  Banning them is a bad shortcut for winning this argument.  In practice anyone using a Frequentist foundation for statistics (including most nominial “Bayesians”) will continue to do so under such a ban, so they'll make the same old mistakes just without those banned tools.

The only legitimate way for Bayesians to win is the old fashioned one of reasoned arguments.  If Bayesians aren't immediately convincing currently then their time would be better spent getting a deeper understanding of their subject in order to make better arguments.

Or to put it another way, I believe any trend toward banning methods will hurt Bayesian statistics in the long run even if initial banning efforts are neutral toward or favor Bayesian statistics.  Bayesians shouldn't do it and shouldn't advocate for it.

Note Frequentists either openly or unofficially banned Bayesian methods for a very long time (sometimes entire fields had such bans--good luck trying to publish in many parts of biostatistics during certain decades without using Fisher's or Neyman/Pearsons proscribed methods).  How did that work out for Frequentism or science?

Well in a now famous attempt to reproduce 50 or so of the most important and most cited cancer research findings, they got negative results in all but a handful of cases (I believe the numbers were 52 attempted and successful replications in only 4).

Anonymous said...

What it all boils down to is there are three major interpretations of probability distributions.

(1) Subjective: probabilities represent in some way people's beliefs.

(2) Frequentist: probabilities model objective physical frequences.

(3) Objective Bayes: probability distributions like P(mu |K) model the objective uncertainty in the true value of mu implied by the knowledge K. The high probability region (HPR) of P(mu |K) provides a kind of 'uncertainty bubble' around the true value of mu (assuming you did your modeling right!). Any 'highly probable' consequences of P(mu | K) will hold true for the vast majority of mu's in the HPR. That's the basis for thinking 'highly probably' consequences will hold for the true mu as well when K is true. All probability distributions have this interpretation, and if you are genuinely concerned about real physical frequencies in any way then you work with P( frequencies |K).

The latter is by far the oldest of the three and is represent by the Bernoulli, Laplace, Jeffreys, Jaynes (all physicists) line of the subject. (3) is also vastly more general. It subsumes all cases of (1) and (2) where they actually make sense, but is applicable in a far wider range of problems.

For some reason I cannot comprehend most people cannot wrap their brain around (3). For most statistically aware minions only (1) and (2) exist and no matter how many times you try to explain (3) or point them to published references, they never get it. They never even get a vague hint of it. The next time you bring the subject up with them, they will insist (1) and (2) are the only possibilities in an endless cycle of stupidity.

Roger said...

Thanks for educating me, but tell me this. The FDA decides whether to license new drugs based on p-values. If it stops using p-values, won't it have a more difficult job, and be more easily manipulated by the drug companies?

Anonymous said...

The short answer is it's not my problem. My problem is to get the foundations of statistical inference right and then to apply it successfully.

In practice p-values are trivial to manipulate (unless the effect is so large you could have seen the truth without statistics), which is part of the reason why the connection between p-value laden papers and reality is nearly random.

Note, p-values, like everything else ever done by any statisticians, are just an assumption based function of the data. As such they have no special status which makes them un-manipulable.

You could argue the opposite. Bayesian methods are based on the sum and product rules of probability theory. There are a mass of theorems saying that under various types of natural conditions/assumptions the sum and product rules are exactly what you should use. Unsurprisingly, they are the only two facts all statisticians can agree on.

Bayes thus has an impeccable theoretical foundation. If you reject the Bayesian methods, you are necessarily rejecting the sum and product rules at least sometimes and are violating at least one or more of the assumptions in every one of those theorems.

Indeed the sum and product rule are so strong in practice that in the right hands, they are a far more reliable guide to inference than is any one scientists or philosophers intuition about inference is.

P-values on the other hand, have no theoretical basis. They were ad-hoc methods chosen because they seemed (sorta) intuitive and were initially tried on very simple kinds of problems where they give answer identical to Bayesians.

Outside of those examples, you get all kinds of horrendous behavior. Like Confidence Intervals which you can prove (using the same assumptions)cannot contain the true value.

Unsurprisingly, this cannot happen with a full Bayesian analysis, because using the sum and product rules forces all Bayesian conclusions to be consistent with whatever can be proved deductively. That's one of the practical benefits of that "theoretical foundation" for Bayesian methods which I spoke about.

Anonymous said...

"The beauty of the NHSTP is that it reduces an experiment to a single number to decide whether it is publishable or not. As far as I know, there is no other single statistic that is a suitable substitute."

By the way this is untrue. If you really wanted such a thing, which I most experience statisticians do not, you could just use the posterior derived probabilities.

So if H0: "mu is less than mu_0" and H1: "mu is greater than mu_0", then instead of using the p-value you could use the single number Pr(H0 | data, evidence).

Again, the best and fully Bayesian thing to do is to report the posterior, but using Prob of H0 given the data and model is at least as good or better in every way than using p-values.