Over the past two weeks, what began as a simple new usage example to include with my free, open-source evolutionary parameter-finding software wound up turning into something of a ghoulish personal challenge: to identify a statistically plausible mathematical model that would fit decently with the exploding numbers of reported Covid-19 cases in my beloved but increasingly broken country, the United States of America.
I’ve made lots of refinements to the code and its underlying modeling over that time, and discussed that along with some projections (accompanied by important disclaimers, which also apply here) along the way in my blog posts of March 19, March 22, and March 25. Along the way, I tried several different approaches to modeling this thing: the logistic growth model, the power law with exponential decay, and even a linear combination of the two, always accompanied by a small linear term.
What increasingly intrigued and frustrated me, though, was the
inability of any of those models to deal with a slight “flattening of
the curve” that has become apparent in the U.S. national numbers of
recent days. Fitting the logistic growth model xd(t,x)=x*r*(1-x/L)
to the data resulted in an unrealistically low value of L, because
that’s the best such a simple model could do with the recent reduction
in growth. The power-law model with exponential decay1
xd(t)=a*(t-ts)^n*exp(-(t-ts)/t0)
provided interesting results that
didn’t seem unrealistic and that proved pretty accurate at predicting
the next few days’ reported case numbers. But there was no evidence of
any realistic value being found for ts to implement the “decay” part
of the model, and so the values beyond just a few days out had a very
wide span of uncertainty.
I tried a linear combination of the two models, but that meant a lot of degrees of freedom. The combined model valiantly tried fitting itself with all its parameters to an anomaly in the US reported case data from 3/16 to 3/18, and offered me even less confidence than usual for making extrapolations.
But today an idea occurred to me to try, and it worked. The result is a logistic growth model that does a smooth transition from an earlier growth rate to a later, lower one:
flatten(t) = 0.5*rf*(1-tanh(1.1*(t-t0)/th))-rf+1
There are six parameters:
L is the maximum number of possible cases, which at worst case is limited by the country’s population.
r is the initial exponential growth rate, before any curve-flattening becomes apparent.
rf is the fractional reduction in growth rate at the conclusion of the lower-growth (flattened curve) regime.
th is the time interval (in days) over which the middle half of a smooth transition occurs from the original higher-growth behavior to a fully flattened curve with the lower “flattened” growth rate r*rf.
t0 is the time (in days after 1/22/20) at the middle of the transition from regular logistic-growth behavior to a fully flattened curve.
b is a very small constant number of new cases per day. This is mostly there to zero out the mean of the residuals so that I could get decent goodness-of-fit indicators for the model.2
Here is the result, my latest (and perhaps last?) revision to what I’ve been calling the Nightmare Plot. The image was produced by the latest commit of covid19.py Python script using data released this evening from Johns Hopkins University. It’s an important image, so go ahead and click on it to see it with full resolution, or even download it if you want.
Extrapolating forward (admittedly a perilous enterprise, but I’m curious and so are you), the model is projecting half a million cases one week from today, and over a million the week after that. The end of April is looking pretty scary indeed, with the potential for one in every hundred Americans to be reporting infection with Covid-193 and the numbers still continuing to climb.
Logistic Growth with Growth-Regime Transition
If it weren’t for the terrifying reality of what is represented by the numbers it spits out, I would be delighted with the beautiful simplicity and performance of this new model. It adds an important feature to the logistic growth model that is well accepted for population-limited reproduction. And it’s a feature uniquely suited to what’s happening right now with the Coronavirus.
This new feature I’m calling “growth-regime transition.” It smoothly reduces the growth rate from its initial unrestrained amount to a lower one. Thus it can account for recent efforts of non-idiot governors and their less ignorant citizenry to “flatten the curve” by staying home and being careful and apart when they have to go out.
And that’s exactly what it’s doing, to a remarkable degree of precision. Even as the numbers have multiplied fivefold since a week ago, the model tracks each day’s cumulative numbers to within 5%. Going back a week further to when there were about 1/70 as many cases (3,499) as today, the model is no more than 13% from each day’s actual numbers. Getting that close that many times in a row with just six parameters, and with the numbers doubling six times, feels a little spooky to me.
It’s not just my gut feeling that indicates this model is really onto something. The residuals (a fancy name for the difference between what the model said a day’s numbers should have been and what they actually were) are shown in the middle of the Nightmare Plot with some information about them. There is no statistically significant evidence that the errors are anything but normal random variation. There may be a little heteroskedasticity to them, but I think they pretty much look like residuals should. The square-root transform I’m applying to the modeled and actual data before calculating the sum of squared errors mostly seems to be dealing with the huge change in magnitude of the numbers over time.
The middle plot includes a computation of the model’s Akaike information criterion, corrected (AICc), score. It’s -8.62 for this “logistic growth with growth-regime transition” model, lower and thus better than for any of the other models I’ve tried. Operating on the same dataset, the AICc was around zero for the 7-parameter linear combination of logistic growth, power law, and linear (projecting 460,791 cases in one week). It was +3.60 for just the logistic growth plus linear model (projecting 384,264 cases in a week).
The power-law model with time-shifting, exponential decay, and a linear term did pretty well as a second runner-up to the new one. Its AICc was -4.58 with today’s Johns Hopkins data, and it is no more than 8% from those actual numbers going back to 3/25. It has nice random-seeming residuals with no evidence whatsoever of non-normality. But it does only track recent data, quickly deviating from what actually happened when you go more than a week back. For what it’s worth, it projects 503,435 cases a week from now.
Another promising aspect of the model: Its parameter values lie in nice rounded distributions with respect to the sum of squared error, indicating that they (or most of them, L being a stubbornly hard one to identify at this still-early stage) are all earning their keep.
Here is the distribution of both parameters L and r for the logistic-growth aspect of the model, plotted against SSE (sum of squared error, after the square-root transform is applied to modeled and actual data values). The black dots represent parameter values that challenged some member of the population at some point and failed. The blue dots are values that were part of the population at some point in the 50-200 generations that the differential evolution algorithm ran, but got replaced by somebody better. And the red dots are members of the final population when the evolution algorithm terminated.4
It’s hard to make much of L except that it’s probably a bigger number than any of us would like. The initial growth rate r is very well defined, though, at an increase of around 35% per day.
The parameters rf (r-flattened), th (half-transition time), and t0 (transition midpoint) are all very nicely defined. The growth rate transitions to about 68% of its original value, with the middle (non-tail) half of the transition taking place over the course of a little over six days centered around 63 days after 1/22/20, or around March 24. The linear term b (actually, a constant number of new cases per day) is tiny–tending toward zero–and would be omitted except that it allows the residuals to be centered around zero without impacting the other parameters.
In Praise of the Hyperbolic Tangent Function
As I’ve explained in previous posts, my programming work (I’m a retired electrical engineer and patent agent turned hobbyist open-source software developer) in regular life about a month ago (seems like more like a year) was wrapping up a two-year-long project that does advanced simulation of power semiconductor circuits. I’ve developed a model for power MOSFETs that evolves optimal values of its 40-odd parameters using an asynchronous version of the differential evolution algorithm that I developed for that project.
A function I have grown to know and love from that work is the hyperbolic tangent. It plays an important role in electronic circuit simulation software to provide smooth transitions from one state to another. You can’t expect software that does stepwise integration to behave well when you suddenly change a zero to a one. The integrator wants to follow a smooth path. If nature abhors a vacuum, then electronic circuits–real or simulated–abhor discontinuity. You have to go gradually from one voltage or number to another, even if “gradually” involves a transition time measured in the billionths of a second.
When it comes to “flattening the curve” of a deadly pandemic by incarcerating ourselves for weeks on end, my layman’s hunch was that the transition time would be in days. Well, it was.
According to what evolved as the best-fitting combination of the six model parameters, the rate of growth in reported U.S. Covid-19 cases transitions halfway from a 35% daily increase to around 23% over the course of six days. The middle of that transition was on March 24. Look at the closeness of fit around that time, and afterward. There’s not even a tenth of a percent error in what the model says yesterday’s numbers should have been. The model’s worst error in the past four days, as the numbers have doubled, has been 1%.
Previous Projections
On March 19, I made my first public post about my modeling efforts. That day, the cumulative number of reported cases of Covid-19 in the United States was 13,677. That is according to data released each day by Johns Hopkins University, on which I have been basing all of my modeling work.
When I refer to “cases” below, it is from this same data set, provided as a CSV file from Johns Hopkins. They’ve done the public a great service by making such a valuable resource freely and openly available. I’m releasing everything from my own 70+ hours of amateur work to the public as well, in a small way of following that spirit.5 The Nightmare Plot is dedicated to the public domain; you can copy it without restriction, though I hope you will leave the identifier there that urges reference to my disclaimer language.
I was using a non-differential form of the logistic growth model, with four parameters: an upper reproduction limit L, the growth exponent k (I now call it r, because the programmer in me hates using k for anything but an index variable), plus two of my own, a time-shifting parameter t0 and a linear term a (I now call it b). The function for the modeled number of total cases, cumulative, on a given day t (units are in the number of days after 1/22/20) was f(t)=L/(1+exp(-k*(t-t0)))+a*t.
With those parameters evolved for a best fit with the 3/19 data, using the well-known differential evolution algorithm implemented by my ade Python package, that model predicted that there would be 429,414 cases today. It was 76% too pessimistic; we “only” have about 18 times as many cases as then, not 31.
On March 22, a week and a half ago when there were 33,272 U.S. reported cases, I posted again, this time about results from modeling with a linear combination of the regular logistic growth component, a power-law component with exponential decay, and the linear term (constant differential) b fitted to that day’s Johns Hopkins data. The combination model projected that there would be around 350,000 cases today. That was around 40% too high.
Thankfully, there has been some flattening of the curve since then.6 I wasn’t really seeing it show up in my other models, though. My most recent blog post before the one you are now reading, from March 26 was still being pessimistic, with a projection of 359,169 cases for today.
I was waiting and hoping for the model to continue to be shown pessimistic due to some “fancy theoretical factors that I didn’t even bother trying to understand finally emerge in a few days (hell, how about tomorrow?),” and for that curve to finally get “its long-awaited downward bend.” That has happened to some extent, and hopefully the open-source code that runs my latest (and hopefully last!) attempt at a model addressing it will continue to update to an even more lowered growth rate. Making pretty plots of that happy circumstance is left as an exercise to the Python-savvy reader.
A Final Note
Awful implications aside in a time of global crisis, there is a certain illicit thrill to doing my own naive research and thinking in a field supposedly best left to the experts, enjoying the faint flicker of possibility that I might be seeing the equations slide into place in a way that all their formal training somehow missed.
But, this is as good a time as any to repeat, yet again, that I have no expertise in biology, medicine, or infectious disease. See all the disclaimer stuff in my previous posts. I’m just a retired engineer who has always had a passion for modeling interesting data. I’m really looking forward to releasing a free, open-source Python package that does some really cool modeling of circuits using power MOSFETs, and I hope that makes the free Ngspice circuit simulator a lot more interesting to people.)
Also worth repeating:
The model is just the integration of a first-order differential equation that can be extrapolated to predict reported U.S. cases of Covid-19 if the reported cases continue as they have for the past two weeks.
Be well. And, please, as a fellow citizen who is more worried right now about the health of our Republic than that of my body, remember when you vote in November how much your President fucked things up and how he will have doubtless continued to behave like an idiot wannabe dictator in the months ahead.7 Personally, as I contemplate a near future consisting mostly of hiding away from friends, family, and neighbors to avoid a disease that otherwise could very well drown me in my own lungs, I am not quite yet sick of winning.
Notes
-
Modified by me with an additional ts parameter. It shifts the start of both growth and decay to some best-fit date well after the first reported case in the Johns Hopkins data set, on 1/22/20. ←
-
With every single model I’ve tried thus far, including this latest and hopefully final one, the parameter evolution favors smaller and smaller values of b tending toward zero. ←
-
The careful reader may notice that I always refer to “reported U.S. cases” or some alternate variation and wording. I will repeat yet again that I have no expertise in biology, medicine, or the spread of infectious disease, and so I try not to speculate on how many of our fellow citizens have actually gotten infected with this thing without it ever being reported. Again, I’m just a retired engineer who has spent years constructing nonlinear models of mostly electronic things and believes this one is a pretty well grounded for doing the following, and only the following: predicting what will happen if the data continues as it has recently, especially as it has in the past week or two. ←
-
You can create these plots for yourself from the .dat files in the ade/examples directory, or from a .dat file you generate yourself. Just run the pv command that gets installed when you run pip install ade. There are some important command-line options; run pv -h for details. ←
-
My ade Python package of which the covid19.py file is a part, is free to download, run, modify, and incorporate into other work under the terms of the Apache license.
The terms of use for the Johns Hopkins data (and I expressly incorporate by reference their disclaimer language into this and all my other Covid-19 related blog posts, obviously with my name substituted for theirs) are as follows:
“This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.” ←
-
At that point, I had not transitioned to modeling the number of new cases each day and then integrating the model’s first-order differential equation into a plot of cumulative cases each day. Instead, a nonlinear regression version of the logistic growth model was fitted to cumulative case numbers. ←
-
To say nothing of his pathological narcissism being the driving force for his every decision, from downplaying the significance of the pandemic just weeks ago and unleashing the wave of infection now emerging in MAGA country, to his demands that governors be nice to him before getting much-needed (and taxpayer-funded) federal assistance. ←