When you deal with scientific data, you need to understand the concept of uncertainty. Measurement uncertainty, for example, tells you something about how close the person who made the measurement thinks the measurement is to the true value. You also need to know how to pass that uncertainty information through calculations appropriately.

Almost everything you could ever want to know about this topic can be found in the GUM – the Guide to the Expression of Uncertainty in Measurement. It is the most readable document with sub-sub-sections that I have ever read. That a document on uncertainty should be so clearly written is one of life’s delightful ironies.

Sadly, there’s some bad advice out there too. Some of that bad advice is propagated by perfectly sensible people. Go figure. A non-zero percentage of people reading this will conclude that I’m giving out bad advice. I welcome feedback from you all, but with that in mind…

**Myth 0**: it’s simple

It’s not. NEXT!

**Myth 1**: If you average together lots of measurements the uncertainty gets smaller and smaller

It is sometimes said that the uncertainty on an average of N measurements is sqrt(N) times smaller than the uncertainty on a single measurement. For something like the global mean temperature, which is aggregated across 10s or 100s of thousands of measurements the uncertainty is therefore teeny-tiny.

While it is true that certain kinds of error average out in this way, not all of them do and certainly not all of the errors associated with the measurements that go into the global average temperature. Averaging does reduce the effect of independent errors, which helps to detect the presence of errors which have a large systematic component, but it does not reduce systematic errors that are common to all measurements.

**Myth 2**: You can only reduce the uncertainty by averaging repeated measurements of the same thing

In many text books, the process of reducing uncertainty by averaging across multiple measurements is illustrated with the example where one has multiple measurements of the same thing. It’s common when doing an experiment to make multiple measurements of the same thing and take an average, thus reducing the uncertainty. Some people claim that this is the only situation in which averaging can help to reduce uncertainty, but this is not the case. One has to be careful though to understand what it is you are reducing the uncertainty of.

The formula for the propagation of uncertainty can be used to work out the uncertainty in an average of a set of values very easily. In the situation that the uncertainties in the individual values are independent and all the same (a situation that is by no means universal) then the uncertainty **of the average** is smaller than the uncertainties of the individual values. Note though that the uncertainties in the individual values are unchanged; it is only the uncertainty of their average that is smaller. Depending on what that average is intended to represent there may be other uncertainties to take into account, but that is another story.

A longer post on the topic of whether this is the case.

**Myth 3**: You can only reduce the uncertainty by averaging repeated measurements if the uncertainties on the measurements are all the same

This is an extension and even stricter version of Myth 2 and states that you can only reduce the uncertainty in an average if the errors are i.i.d – independent and identically distributed. The argument is: because different instruments have different error characteristics, then one cannot benefit from the error-reducing effects of averaging.

However, the simple formula for the propagation of uncertainty shows this is not necessarily true. The uncertainty in an average of values with different uncertainties is easy to calculate (again assuming they’re independent) and it is simply the sum of the squared uncertainties divided by the square of the number of values (all square rooted). If you have *n* measurements with an uncertainty of *sigma* and one more measurement with an uncertainty of *delta* then the average of the *n+1* measurements will have a smaller uncertainty than an average of the *n* measurements if *delta/sigma* is less than *sqrt(2+1/n)*. If delta is larger than this limit then (all else being equal) adding that extra value to the average won’t reduce the uncertainty, which is worth bearing in mind.

**Myth 4**: You should only retain as many significant figures in a calculation as there are in the inputs

A widely used (but incorrect) rule of thumb for estimating the uncertainty in a calculated figure is to keep only as many significant figures in the result as you had in the inputs. If the inputs have different numbers of significant figures, then go with the smaller number of significant figures. This is a quick and dirty approximation that works OK in some simple situations, but is usually wrong and is no replacement for a proper uncertainty calculation.

Some people insist that this rule of thumb actually places a fundamental limit on the accuracy of a calculation, but this is not the case and it’s easy to construct examples which show that it is not.

A longer post on the murky origins of the significant figures approximation.

A longer post with code snippets showing it’s wrong.

**Myth 5**: statements about uncertainties are unverifiable and errors are completely unknowable

In the guide to the expression of uncertainty in measurement (all praise its name) “*error*” is defined as the “*result of a measurement minus a true value of the measurand*“. The unknowability of the true value implies that the error is also unknowable. Some have taken this to mean that the uncertainty is unknowable and that we should just accept whatever value they conjure up. However, although we don’t know what the individual errors are, hypotheses about their form – magnitude, correlation structure, whether they are independent or systematic – are testable and can have observable, real world consequences. For example, the assumptions will tell us something about the expected variance of multiple subsets of the data, which is relatively easy to test.

We also know that assumptions about errors must be logically and physically consistent. One cannot assume that errors are completely independent at one point in a calculation and perfectly dependent in another.

It’s also possible to make statistical inferences about the values of errors in measurements, particularly where we have multiple measurements and a good understanding of how they relate to each other.

**Myth 6**: we just don’t know how uncertain historical measurements are

While it is true that historical measurements do not usually come with calibration certificates and ISO-compliant uncertainty quantifications (sometimes they do, but rarely), that does not mean that we don’t know anything or that we can’t know anything. As scientists, we can form and test hypotheses about data and, as I note above, those hypotheses can include hypotheses about the size and nature of the errors and uncertainties in that data. Statistical models which encapsulate these hypotheses (formally and less formally) can be used to make inferences about the nature and magnitude of the uncertainties and much besides.

It’s likely true that we will never know for certain how accurate a particular measurement was, but that doesn’t mean that we can’t learn anything about it.

**Myth 7**: Measurements are either reliable or they are not

Often, someone will ask whether historical measurements are “reliable”? It’s a fair question, perhaps, but not quite the right one. The pertinent question is “how reliable are historical measurements?” followed up by “how reliable do they need to be for a particular purpose?”

Old measurements nearly always contain some useful information, the key is working out what that is. SST measurements made by a ship in the tropical Pacific in 1865 might only be within a degree or so of the true SSTs, but they might still usefully pin down the state of El Nino – where variations span several degrees between an El Nino state and La Nina – and, together with many others, could contribute to a usefully accurate global mean. However, if you wanted to know what the SST was at a particular place and time to a hundredth of a degree, they wouldn’t be a great help.

**Myth 8**: A single measurements does not have a variance since there is only one data point

There is a misunderstanding about what an uncertainty on a single measurement represents and what variance means. One way of thinking about the uncertainty is to say that the value of the measurement and the uncertainty define the mean and variance of a probability distribution (of some shape, often assumed to be gaussian though it needn’t be). The probability distribution is a way of describing uncertainty in the true value and while there is a single measured value and, one assumes a single true value, we don’t know what the single true value is. Consequently, we must consider a range of possibilities. The single measurement can have a variance in that sense and the variance need not be based on multiple measurements.

**Myth 9**: Things that have happened don’t have a probability distribution because they either happened (probability = 1) or they didn’t (probability = 0)

This is related to Myth 8, but interestingly different: errors in a measurement can’t have a “distribution” because we know that only one thing happened and so that must have probability equal to one. All other things – the things that didn’t happen – must have probability equal to zero.

When we talk about uncertainty, a key point is that we don’t know what the errors are (if we did know what they were then we could subtract the error from the measurement and get a better measurement). We know that something happened, but we don’t know exactly what. A probability between zero and one is a way of indicating that fuzziness. Less colloquially, probabilities can reflect our beliefs regarding events. Zero indicates we think a thing definitely did not happen. One indicates that a thing definitely did happen. A number in between the two limits suggests we think it might or might not have happened, with higher numbers suggesting we think it more likely that a thing happened.

For a measurement, we could (maybe) write down a list of possible measurement errors and assign a probability to each one. That’s what the distribution is effectively – an exhaustive list of possibilities and their likelihoods. The individual probabilities must add up to one and this means that *something* definitely happened. That no individual value has a probability of one indicates that we don’t know exactly what.

Typically, there are an infinite number of possible errors, so a mathematical description is used to describe the distribution. Probability and statistics give us tools for reasoning quantitatively about these kinds of uncertain events.

**Myth 10**: It all depends on Gaussians

Some claim that cancellation of errors can only occur if the distributions are Gaussian (often with one or more additional stipulations as detailed in the other myths). This is not the case. The propagation of uncertainty formula doesn’t care what shape the distributions are, only that they have a mean and variance. If you are using a Monte Carlo method to calculate the uncertainties then you can propagate any input distributions you like, assuming you know what they look like.

Off-the-shelf statistical methods do often assume that errors are Gaussian (and also that they are uncorrelated, independent etc). This could be a problem, depending on the types of errors. There are various solutions. One could develop methods more appropriate to the data. Or, one could test to see if the methods are actually sensitive to the expected distributions shapes.

**Myth 11**: Uncertainty in global temperature is like measuring two planks of wood or whatever

Analogies are like analogies, they’re great when they work and not great when they don’t. While the task of measuring planks of wood and measuring temperatures (later used to calculate a global temperature) are ostensibly similar, they are also different. When measuring two planks of wood, you’ll likely be in your workshop (or whatever passes for a workshop round yours) and measure them both with the same tape measure or ruler. A bag of resistors (if electronics is your thing) could conceivably be from a single bad batch. It is not difficult to conceive of, or accidentally introduce, systematic errors in such situations. How illuminating such small scale examples might be for global temperature is an interesting question. We know that the temperature at a station in Siberia is always measured with the same thermometer, but that the temperature at a station Thailand is likely measured using a different one. While all measurements made at the Siberian station could, conceivably, have the same systematic error, the chances of that error being the same as any affecting the station in Thailand are slim. One person’s systematic error is another’s random error. On the other hand, if both thermometers are housed in the same type of screen, then they could share weather-dependent errors that would be correlated to the degree that weather at the two stations is correlated. Siberia and Thailand are sufficiently far away that any correlation is liable to be small, but for stations within a few hundred kilometres of each other, the correlation could be meaningful.

**Myth n**: we know it all

All of the above myths have been raised at various times with regards to estimating uncertainty in global temperature. The papers describing global temperature data set development go to some lengths to estimate the uncertainties inherent in their estimates and dataset producers are well aware that historical measurements are uncertain and have sometimes severe limitations. They’re also aware that the process of understanding uncertainty in historical data is ongoing and that there is still much to learn about the detailed nature of observations. But, none of this means that we know nothing or that nothing is to be known.