A Sixty-Year Old Program for Predicting the Future

Posted by Cleve Moler, May 19, 2024

98 views (last 30 days) | 0 Likes | 2 comments

The graphics in my post about R^2 were produced by an updated version of a sixty-year old program involving the U.S. census. Originally, the program was based on census data from 1900 to 1960 and sought to predict the population in 1970. The software back then was written in Fortran, the predominate technical programming language a half century ago. I have updated the MATLAB version of the program so that it now uses census data from 1900 to 2020.

censusapp2024
Risky Business
Splines
Exponentials
Predictions
Conclusion
Blogs
FMM
Software

`censusapp2024`

The latest version of the census application is now available at censusapp2024. Here are the data and the opening screenshot.

[t,p]=UScensus;fprintf('%12d%12.3f\n',[t,p]')

        1900      75.995
        1910      91.972
        1920     105.711
        1930     123.203
        1940     131.669
        1950     150.697
        1960     179.323
        1970     203.212
        1980     226.505
        1990     249.633
        2000     281.422
        2010     308.746
        2020     331.449

Risky Business

Today, MATLAB makes it easier to vary parameters and visualize results, but the underlying mathematical principles are unchanged:

Using polynomials to predict the future by extrapolating data is a risky business.

One new observation is added to the data every 10 years, when the United States does the decennial census. Originally there were only 7 observations; today there are 13. The program now allows you to fit the data exactly by interpolation with a polynomial of degree 12 or fit it approximately by polynomials of degree less than 12.

Here are the least-squares fits with linear, cubic, and degree seven polynomials and the interpolating polynomial. As the polynomial degree increases, so does R^2, until R^2 reaches one with the exact fit.

Do any of these fits look like they could be used to predict future population growth?

Splines

In addition to polynomials, you can choose interpolation by three different piecewise Hermite cubics.

spline Continuous second derivate, "not-a-knot" end condition.
pchip Continuous first derivative, strictly shape-preserving.
makima Continuous first derivative, relaxed shape-preserving.

Since these fits interpolate the data, all their R^2 values are one. But before 1900 and after 2020 these functions are cubic polynomials that are not designed for extrapolation.

Exponentials

It is also possible to do nonlinear least squares fits by an exponential, a logistic sigmoid, and an exponential of an exponetial known as the Gompertz model.

exponential exp(b*t+c)
logistic a./(1+exp(-b*(t-c)))
gompertz a*exp(-b*exp(-c*t))

An article by Kathleen and Even Tjørve, from the Inland Norway University of Applied Sciences in Elverum, Norway, in the journal PLOS ONE has this to say about Gompertz. "The Gompertz model has been in use as a growth model even longer than its better known relative, the logistic model. The model, referred to at the time as the Gompertz theoretical law of mortality, was first suggested and first applied by Mr. Benjamin Gompertz in 1825. He fitted it to the relationship between increasing death rate and age, what he referred to as 'the average exhaustions of a man’s power to avoid death” or the 'portion of his remaining power to oppose destruction.' "

Predictions

Which fits are suitable for predicting future population size?

Despite their large R^2 values, polynomials of any degree are not suitable because outside of the time interval they behave like polynomials and do not provide realistic predictions.

Splines were never intended for extrapolation.

That leaves the exponentials. The simple exponential model grows exponentially and is not suitable. The Gompertz fit does approach a finite asymptotic limit, but the value is an astronimical a = 2101, corresponding to 2.1 $\times 10^9$ inhabitants. Hopefully, that is out of the question.

The logistic fit has an asymptotic limit of a = 655.7. We recently passed the value of t where p(t) reaches a/2, namely c = 2018. So, the logistic model predicts that the long-term size of the U.S. population will be about twice its current value. Is that realistic? Probably not.

Conclusion

The British statistician George Box once said, "all models are wrong, some are useful." This is true of the models of the U. S. Census that I have discussed over the past sixty years.

Here is censusapp2024 after all its buttons have been pushed. The extrapolation date is set to 2040. White noise has been added to the data. The model is a fourth-degree polynomial with an R^2 = 0.99. The R^2 value and the error estimates produced by errs account for errors in the data, but not in the model.

This particular model does a lousy job of predicting even twenty years in the future. Some of the other models are better, many are worse. Hopefully, their study is worthwhile.

Blogs

I have made blog posts about the census before, in 2020 and in 2017.

FMM

Predicting population growth is featured in Computer Methods for Mathematical Computations, by George Forsythe, Mike Malcolm and myself, published by Prentice-Hall in 1977. That textbook is now available from an interesting smorgasbord of sources, including Google Scholar, Amazon, dizhasneatstuff, Abe Books, Internet Archive, PDAS, WorldCat (Chinese).