Friday, April 10, 2020

Marching Towards a Million

[T]he feelings of knowing, correctness, conviction, and certainty aren’t deliberate conclusions and conscious choices. They are mental sensations that happen to us.
—Robert Burton. On Being Certain
Update, 4/12: The curve is continuing to flatten a bit more than the model is able to extrapolate from having its six parameters fitted to the data as it stands each day. (I certainly have no objection to things coming in on the low side, even if it reveals some room for further possible improvement in the model, or just some random-walk stochastic behavior that simply can’t be chased via curve fitting and extrapolation, even with more model parameters.) There’s not been enough of a difference to go through this post and do a bunch of edits, so I’ve just added a few footnotes and this link to the latest version of the Nightmare Plot. Don’t get complacent now, OK?

For just over a week now, I’ve been running and re-running the “logistic growth with growth-regime transition” model I’ve developed for reported cases of Covid-19, evolving populations of individuals whose digital DNA consist of six parameters for that model:

xd(t, x) = r*x*(1-x/L) * flatten(t) + b
flatten(t) = 0.5*rf*(1-tanh(1.1*(t-t0)/th))-rf+1

Two of the parameters, L and r, are for the conventional logistic growth component of the model. L is the maximum number of possible cases, which at worst case is limited by the country’s population. The other parameter r is the initial exponential growth rate, before any curve-flattening becomes apparent.

The “growth-regime transition” component of the model, implemented by flatten(t), has three parameters of its own: There is a fractional reduction rf in growth rate at the conclusion of a lower-growth (flattened curve) regime. My modification to the conventional logistic-growth model uses the tanh function to smoothly transition from the original growth rate r to a lower “flattened” growth rate r*rf over a time interval th (in days). The transition interval is defined as the middle half of the full transition between growth rates, i.e., from still 75% of old vs new down to 25%. The midpoint of the transition is defined by the third parameter t0, in days after 1/22/​20.

Finally, a very small constant number of new cases per day is included as a sixth parameter b. This does manage to “earn its keep” in the model by more flexibly allowing the mean of the residuals to be zeroed out.

Here is what the model is saying today, with Johns Hopkins data from this evening (4/10, click here for the latest one with 4/12 data).

April 10, 2020: Reported U.S. Covid-19 cases vs days since 1/22

Extrapolating forward (admittedly a perilous enterprise, as I’ve said before with endlessly repeated disclaimers and caveats, applicable yet again here), the model is projecting three quarters of a million cases1 one week from today, and over a million the week after that.2 The model’s projection–both now and as it stood on 4/5–is that there will be around a million and a half Americans–one in every two hundred–reporting infection with Covid-19 in early May, with the number still climbing faster every day.3

The curve is bending downward a bit, yes, but things still look pretty grim.

Past Performance

You’ve probably heard the phrase “past performance does not indicate future results,” and that’s true if something happens (as it often does) that’s not accounted for by the model. Life is messy and complicated, and that includes pandemics. But shitty performance does tell you not to bother looking any further. And that’s definitely not what’s been happening with my little model.

With parameters evolved to yesterday’s Johns Hopkins data (4/9), it had projected today’s increase in the number of cumulative cases to be 32,920 instead of the 35,098 new cases that actually got reported today. That was off by about 6%. (I calculate the error from the projected vs actual increase from the most recent known value, because the model is anchored to that point and can only be given credit or blame for what it projects from that last known point.)

With data from the day before yesterday (4/8), it projected yesterday’s number of new cases at 32,385. There were 32,305 new cases yesterday, an error of 0.2%.

And with data from 4/7, the model evolved to parameters that had it projecting 30,849 new cases the next day. There were 30,849 new cases on 4/8, a 6% error.

On 4/5, the model projected 32,346 vs the 32,133 that there were on 4/6, an error of just 0.7%.

The model of course does a little worse when it looks further ahead. You’d expect that of any naive model (i.e., not theoretically informed, beyond “the growth rate is going down”) of data from six empirically-optimized parameters being extrapolated with exponential growth. And it’s certainly not anything I’m ashamed of.

On 4/3, the projection was for there to be just over 600,000 cases today, compared to the 496,535 we have had reported at this point in the U.S. Quite a bit off, but remember, that was looking forward a full week.

On 4/5, it was for there to be 510,425 cases today, a total error of less than 8%, again with the error measured from the projected vs actual increase, not the absolute number.4 On 4/7, the model projected there would be 486,367 cases as of today, and that was off (in the increase) by 10%.5

Evolution in Action

(In more ways than one, unfortunately.)

I’ve been wanting to write a little about the whole process of computer evolution that I use to fit the model’s six parameters to the time-series data. It begins with a population of 240 simulated organisms (not a virus!), digital “individuals” whose randomly chosen6 DNA consists of the values of those six parameters, within predefined bounds.

After working on this model for over a week, I’ve refined each of those bounds to a reasonable range of possible values. Updating those ranges as the model makes sometimes failed attempts to find a convincing best fit is my sole remaining human-engineering activity, now that the model is designed and the code implementing it is working.

Each of those 240 individuals is given a chance to show how fit its DNA is by running a backwards integration of the model from the last known data point. The model, you may recall from reading my previous blog post is for the number of new cases each day, not the cumulative number of total cases ever.

The model is a differential equation; xd(t,x), not x(t). So, to have it project the cumulative number of cases x(t), I integrate the differential equation forward or backward from a fixed point represented by the most recent known number of cases, a point to which it is anchored.7

The modeled number of reported cases is compared to the actual number that were reported, for each day going back what is now a full five weeks’ worth of data.8

The fitness of each individual is measured as the sum of the squared errors (SSE) between each day’s modeled number of reported cases (the value that the model would expect, being integrated backwards) vs. the number of cases there actually were as of that date. The two figures are compared only after they have had a square-root transform applied to them. This limits how much more recent, larger numbers of new daily cases weigh in the fitness calculation vs earlier ones.

Then evolution gets underway, with each population member getting challenged by an individual, which gets spawned from combinations of not two but four population members. These mutant offspring have the model run with their parameters through the integration and SSE calculation. If they are better (lower SSE) than whichever population member is being challenged in its turn, they replace it. When all members of the population have received their challenge, possibly having been replaced, evolution proceeds to the next generation.

The whole spawning process is worth a moment of technical discussion. Differential evolution uses an interesting9 sort of mathematical equivalent to some kind of alien four-way sex to create challengers. The trial individual is formed from “crossover” between the a “target” (a rough DE equivalent of a mother) and a “donor” individual (closest thing it has to a father). The donor is formed from the vector sum of a base individual and a scaled vector difference between two randomly chosen other individuals that are distinct from each other and both the target and base individuals.

That is hard to follow in print, but this equation might help. The many-tentacled alien infant we want from this tangled act is ic, the individual challenging:

id = ib + F*(i0 - i1)
ic = crossover(it, id)

The crossover consists of giving each parameter of the donor individual a chance (usually a very good chance) to appear in the challenger, as opposed to using the target’s parameter. Basically, think of a higher number as being more for paternal rather than maternal inheritance. The default value used by my program (apologies for what is becoming an awkward analogy) is 0.7.

I refine the parameter bounds as needed and sit back while my ade Python package (of which this whole Covid-19 modeling is a single example file covid19.py) had 75 generations of these simulated individuals spawn and fight it out on a virtual Darwinian savanna. The software dispatches SSE-calculating jobs asynchronously to worker Python interpeters across the timeslots of my six-core CPU. It takes about two minutes. The software produces a plot with the model’s curves in red and what actually happened in blue. It’s what I’ve been calling the Nightmare Plot.

Singing in the Apocalypse

The Nightmare Plot really is horrifying when you think about what those numbers represent. Their relentless upward march–slowing but by no means stopping–is making everyone’s life suck including my own. The novelty of this whole apocalyptic survival thing is starting to wear off just a bit, even for me.

Maybe I’m a terrible person, but dammit I can’t help experiencing a bit of pride, too. This “little model that could” is proving to be the no-pay, no-fame, no-acceptance academic-mathematical equivalent of, say, some college undergraduate inventing an optimal radio receiver frequency arrangement, now in use by the circuitry of your smartphone, as part of an independent senior project he decided to work on weekend after weekend a quarter century ago. (This hypothetical individual never was much for working in groups.)

So, take it or leave it, folks, you’ve got yourself a six-parameter mathematical model for the number of reported cases of Covid-19. It was never “published” in some elite-accepted overpriced package of specialty information. It wasn’t part of any network of peer reviewers. (Like the aforementioned loner radio geek, I’ve never been one for playing in groups.)

But it does appear to work.

———

A friend of mine told me today that he is selfishly rather enjoying this whole situation. He now has lots of time to learn things on his own that he’s been wanting to work on, time for “driving out to pretty places to take pictures and go for walks in the woods.”

He’d rather be back doing what he does in person. But, he admits, there are upsides.

I assured him that it’s 100% OK to enjoy those upsides, even as I admitted my own feeling of finding it a little weird to derive satisfaction from successfully modeling these awful numbers. But I had the benefit of receiving yesterday some reassurance in this area, as I was talking about this very topic with a friend of mine whose life’s work revolves around how people think and feel about things.

He told me he can’t be as helpful to others in his profession if he isn’t taking care of himself, and that means enjoying life in spite of or even at times because of what is otherwise a horrible situation. Of course he’d rather not be in it, nor would I or you, dear reader with a delicate pair of lungs of your own. But he is, and you are, and I am, and so let’s take what good there is to be had.

Smile and sing and laugh, and take pride in the work that you now have. Even through the Apocalypse.

Notes


  1. With data updated as of 4/12, the projection is now 700,000 by 4/17. A bit lower, and the curve is continuing to flatten, but not by much.

    The careful reader may notice that I always refer to “reported U.S. cases” or some alternate variation and wording. I will repeat yet again that I have no expertise in biology, medicine, or the spread of infectious disease, and so I try not to speculate on how many of our fellow citizens have actually gotten infected with this thing without it ever being reported. Again, I’m just a retired engineer who has spent years constructing nonlinear models of mostly electronic things and believes this one is a pretty well grounded for doing the following, and only the following: predicting what will happen if the data continues as it has recently, especially as it has in the past two weeks

  2. With data updated as of 4/12, the projection is now for a bit less than 900,000 cases by 4/24. One redditor cleverly observed that my modeling’s million-case projections have been like those for fusion energy. (The saying is that fusion is 30 years away and always will be.) I won’t dispute that observation at this point; each day’s new data for the past week or so has pushed that projection outward a bit, though never making it look any more implausible to reach eventually. 

  3. Let’s call it mid-May now, given the recent additional curve flattening (4/12 data). 

  4. Although the difference between computing the error is less with such a large increase from the 4/5 last-known number of 337,072. In case you really want to know, the absolute error from the model projecting forward five days was 2.8%. 

  5. The actual number of cases as of 4/12 from Johns Hopkins’ daily data was 555,313, an increase of 58,778. On 4/10, the model was projecting 567,306 cases, or a projected increase of 70,771. The error in the increase was 20% over the two-day interval, or 10% per day. Not as good as previous days’ next-day predictive performance, but not terrible, either. And since there will always be an error when extrapolating from a curve fit to data having a random component, I’m happy the data is lower than projected and not higher, because I have my own delicate pair of lungs that I’ve grown fond of, too. 

  6. Actually, they’re not quite chosen with uniform randomness: The default is to set up the initial population using a Latin hypercube. That way you get initialize pseudorandom parameter values, but with minimal clustering. You want the six-dimensional search space to be explored as fully as possible. 

  7. This is known as an “initial value problem,” the initial value here being the last known number of cases. You can go either direction in time from the initial value. For fitting the model parameters, my algorithm goes backwards from the most recent data. To extrapolate forward and make projections, it goes forward from the same “initial value.” 

  8. Reported cases numbers before March 5 are omitted from both the curve fitting and plots. The modeled (red) curve deviates from the historical data when you go earlier, and I’m not sure a good fit that far back in the history of this thing is relevant to what’s happening now. 

  9. You claim you wouldn’t find alien four-way procreative sex interesting? Well, I don’t believe you.