There’s an interesting article by Joel Katzav looking at climate model testing from a sort of Poppery “severe-testy” point of view. It’s here in EOS. It pops up from time to time in various places.

The interest, at least for me, derives from its being almost completely at odds with how I understood the relationship between Popper’s idea of severe tests and Bayesian reasoning. As I understood it, a severe test is one that distinguishes a particular hypothesis or theory from all the alternative hypotheses. This would be something that the candidate theory predicts will happen, but the incumbent alternative(s) absolutely forbids.

To get Bayesian for a moment. If you have an existing theory H1 and a new theory to severely test Hz, the you want to look for a phenomenon, G, such that

P(G|H1) = 0


P(G|Hz) = 1

You can see why by thinking about what the prior probability of Hz is and what you have to do to generate any belief in it. My understanding of Popper is that he argued more or less (though not in those exact terms) that the prior probability of any particular new hypothesis is infinitesimal, if not actually zero. Therefore, for the posterior probability given by:

P(Hz|G) = P(G|Hz)P(Hz) / ( P(G|Hz)P(Hz) + P(G|H1)P(H1) )

to be anything other than infinitesimal, the conditional probabilities of G have to be as given above. Most crucially, P(G|H1) has to be zero. It’s a bootstrap process whereby new hypotheses have to pull themselves out of the void by grabbing onto the bootstraps of other hypotheses and tugging wildly, or something: it’s a messy process.

The Bayesian approach as I’ve outlined it above is, to my mind, very close to how I’d understood severe tests to work.

I think Katzav has confused the probability of the observed consequences (a particular change in ocean heat content say) of a given climate sensitivity (P(data|F) as he calls it) and the probability of a given value of the sensitivity itself, P(F). To put it another way, it confuses the probability of the observed consequence of a hypothesis and the probability of the hypothesis itself. He claims that a severe test is one in which P(F) is small, but actually it’s one in which P(data|F) is small for all values of F except the one being tested.

A second confusion seems to be the interpretation of P(data) which he states “cannot … be an indicator of the severity with which data test different estimates of F” because it is constant. The point of severe tests is that the component of P(data) is zero, or very, very small, “in light of background knowledge” i.e. for all currently-favoured hypotheses excluding the one that is being severely tested.

A third difficulty with the argument is that the example given doesn’t even constitute a severe test if one dumps Bayesian reasoning altogether. Values of sensitivity inhabit a continuum, with nearby values being indistinguishable by reason of having very similar observational consequences. By choosing paleodata, the problem would be further exacerbated because the uncertainty in the observed values are typically large, increasing the overlap. A wide range of climate sensitivities are consistent with a given measured isotope ratio (to pick an example out of a hat) therefore the observed value cannot decide between different sensitivities outright, only change the weight we should give to each one.

The power of the Bayesian approach is that it deals with this situation AND it deals with the classic “severe” test. It provides a consistent approach across a range of different types of evidence.