I’ve been staring intently at this diagram (from Evidence for a limit to human lifespan) for a while now, particularly panel b. Panel b shows linear regressions of the logarithm of the number of survivors per 100,000 people at different ages from 70 up to 110. Each coloured line represents a linear regression for a particular age, with the colours indicating roughly what age that is.
I’ve been trying to make my own version of this diagram from data in the Human Mortality Database. I’m not sure if
(a) I have the correct file (fltper_1x1.txt) or
(b) whether I have the correct column in that file (lx) and then
(c) even assuming I do have the correct file and column, I’m not clear exactly how to do the linear regressions because for some of the more advanced ages, there are no survivors per 100,000 people early on in the 20th century and zero doesn’t log very well.
However, in mulling and intently staring, I noticed that something odd is happening in panel (b): the lines cross. One might interpret that as meaning that at some point before 1920, you were more likely to live out the year if you were 110 than if you were 105 which seems unlikely. Another way to think about it is that linear-regression might not be the best way to represent this data.
A simpler problem is to look only at the period from 1980, during which the data are non-zero in all age ranges given in the data set. With the caveat that I’m not sure if I’m looking in the right file, the pattern I see looks like this.
There are two lines for each age. One is wiggly and that’s the log of the data values that came out of the file. The other is straight and is the linear regression of that same logged data. The gradient of the lines increases with increasing age over this period, which, on the face of it would contradict the claim made in the paper that somehow the rate of change shows diminishing gains at the higher end of the age range.
Behind the face of it, there are some caveats that need to be considered. First, the data are heterogeneous. Since 2005, the input data on numbers of deaths lumps all deaths at ages of 105 and up together. Before 2005, deaths are recorded at each age all the way up to 124. There’s a change in the way the input data are presented at that point at that point.
Second, a series of calculations (making a range of necessary assumptions) are performed on the data to convert the reports of births, deaths and censuses into a consistent format and to derive the statistic I plotted – survivors per 100,000 at age x. What effect these assumptions and calculations have, particularly at the very upper end of the age range where individual deaths can make quite a difference, isn’t clear to me.
What this means for the analysis in the paper, I don’t know. It might, of course, mean nothing. This process of learning about how the data were gathered and processed and just what exactly they mean, is always an interesting aspect of exploring new datasets. It does however, confirm my initial concern about the heterogeneous nature of the data and it’s the kind of detail I’d like to have seen explored in the manuscript.