Thursday, March 19, 2020

Applying the Logistic Growth Model to Covid-19

The chief task in life is simply this: to identify and separate matters so that I can say clearly to myself which are externals not under my control, and which have to do with the choices I actually control.
—Epictetus, Discourses.
Dad, you’re just some guy who knows how to obsess over numbers. We have actual people who are experts at this stuff. Go and write it if you want, but don’t feel like you have to!
—Daughter of Ed Suominen, March 2020.
TL;DR: A very good fit between data obtained on March 19 from Johns Hopkins University and a logistic+linear growth model indicates there there will be over 50,000 reported cases of Covid-19 in the United States on March 25, over 300,000 cases one week after that (4/1), and several million cases by early April. See the Nightmare Plot and the Disclaimer below.
Update, March 20: There was a significant uptick in U.S. cases today, bringing us to a total of 13,677 according to Johns Hopkins data provided this evening. The increase is more than was expected this time, and the jump is significant enough that I would rather not publish results with the logistic growth model fitted to the latest data. Doing so results in projections that are considerably higher than what the rest of this blog post discusses. I will wait for tomorrow’s data and then perhaps consider a modification to the model if we get unexpectedly high numbers again, one possibility being to change the linear term to a power-law one of the form a*t^b. That might reflect the effect of better testing without forcing an artificially high value for the exponential model parameter k.1
———

On March 15, I wrote to some friends on Facebook about the latest results of putting my computer evolution code and skills to the task of finding parameters for the logistic growth model as applied to the number of U.S. reported cases of Covid-19. My belief–speaking as someone with expertise in fitting nonlinear models to data but not any kind of expert in the fields of biology, medicine, or infection disease–was that we would reach 10,000 cases on March 17, and would have reported cases numbering in the hundreds of thousands by March 29.2 By the end of April, I believed there would likely be millions of Americans being reported as having this virus. The rate of growth I thought would be unlikely to even start slowing down before April.

The prediction for the one date that has come and gone was not quite accurate. On March 17, there were 6,421 cases reported in the U.S., a little less than two-thirds of what the model said was most likely. But, in my defense, I ask you to look back at yourself enjoying a Sunday in mid-March. Would you have been truly untroubled just two days ago by hearing that six thousand of your fellow Americans would have a deadly respiratory infection that puts a fifth of its hosts into hospitalization? The model was pessimistic but not ridiculous.

Two days earlier, on March 13, I had introduced the project to my Facebook friends, prefacing the discussion with the acknowledgment that I’m not a biologist, or a doctor, or an infectious disease expert of any kind. Just a retired engineer and inventor who knows how to write Python code and has been working on modeling and simulation for over a year now.

After seeing predictions ranging from dismissive to hysterical about the Coronavirus, I saw a useful if sobering application example of a tool that I’d written specifically for my electronic simulation work, ADE. This new example would apply the fairly well-known “logistic growth model”3 to what has now bloomed into a pandemic.

Writing and running covid19.py forced me to some stark conclusions: In one week (3/20), I said, we would be likely to have over 20,000 U.S. cases. A week later (3/27), around 10 times that. “By the first few days of April we could very plausibly hit the one million mark. There will certainly be nobody saying this is just like the flu by then.”

The most I was–and am–willing to guarantee about those predictions, however, is that the red line in the plot I included with the post is a nearly optimal fit of the function f(t)=L/(1+exp(-k*(t-t0)))+at to the number of cases versus time provided by Johns Hopkins university, including an update made that evening.

So how did that prediction fare? With data updated Thursday evening (3/19), it now appears that there will be around 13,500 reported cases on March 20. Again the model is pessimistic, with 68% as many cases reported as expected to be. But, again, even the lower one is a huge number of people getting very sick all of a sudden. Were you expecting anything like that just five days ago?

If not, don’t feel bad. There was an excellent reason why you might have been surprised then at what is now clearly plausible to anyone looking at the plot below: Your President was telling you that it was no big deal.

With the data available on March 13, the logistic+linear growth model predicted there would be around 200,000 U.S. cases on March 27. That is a little more than double what the model’s current best fit says is most likely. Again, the model was and still may be pessimistic, predicting too many cases. But again, even 90,000 or so infected Americans–with probably 20,000 of them very sick and at least a thousand of those dying and thousands left with permanent lung damage–is a very big deal. And the virus will still just be getting started.

Yesterday, March 18, I released the first version of this blog post with the projection that there would be nearly 15,000 cases tomorrow (3/20). (Actually 14,538.) Once again, the model was a bit pessimistic; the current projection is for 13,549 cases, or 93% as much.4 And the longer-term projections are slightly lower, which is moving in the direction we all want, though only by a little bit.5

———

So, on March 19, here is what the admittedly pessimistic logistic+linear growth model now says, based on Johns Hopkins data updated this evening. The numbers are all in reported U.S. cases:

  • The day after tomorrow (3/21), there will be nearly 18,000 cases.

  • In one week (3/26), there will be more than 60,000 cases.

  • In two weeks (4/2), there will be over 400,000 cases.

  • We will reach the million mark between April 4 and 6.6

  • On April 11, there will be five million cases.7

  • The U.S. outbreak won’t even begin slowing down until mid-April at the earliest. In other words, there will be increasing numbers of new cases until probably around April 24 when there are finally fewer new cases one day than there were the day before.

  • The ultimate number of Americans being reported as being infected by the novel Coronavirus will ultimately reach several tens of millions.

This is some scary shit. And it may be even worse than it looks right now. What the data show, and the model is fitted to, is the number of reported cases; several days ago, some experts in the Seattle area were saying that the number of true cases in Washington State to be several times the number being tested and reported.8 Isn’t it reasonable to expect that to remain largely true? Our medical system will almost certainly become overloaded and the focus will simply turn to saving those lives that can be saved, as it has already in Italy.

But all this is just me talking, not the model. It makes no assumptions or judgments about the data. It doesn’t care if some political situation has caused fewer tests, or suddenly more tests. It doesn’t care about an idiot chief executive downplaying the danger and thus encouraging its spread (at least among his cult following), then abruptly deciding to join the adults in the room.9

The model simply predicts what will happen if the data continues as it has recently, especially as it has in the past few days.

That’s it. The interpretation and explanation is up to you.

The Nightmare Plot

Returning to the model and its neat little world of reported cases, here is a plot from a simulation I ran this evening, whose results I summarized in the bullet points above. It should make you listen very carefully to what you are being told by medical experts about social distancing, washing your hands, not touching your face, and staying the fuck home.

Now, this is one really important plot. It shows up way too small in this blog post for you to be able to see its important details. So please click or tap on it to open it as an image by itself.

Reported U.S. Covid-19 cases vs days since Jan. 22, 2020

You can also click here to see the plot with data from yesterday, 3/18. Open them in two tabs of your browser and then switch between to see how the model is holding up.

The upper subplot shows the best-fit logistic growth model in red, with the actual number of cumulative reported cases of Covid-19 in blue. The error between the model and the data is shown with each annotation. Look how small the residuals are compared to the exponentially rising numbers of cases. It’s a scarily impressive fit, even if the model has proved a bit pessimistic thus far.

The lower subplot shows the number of cases expected to be reported over time, slightly in the past and then extrapolating to the future. Fifty generations of running a differential evolution algorithm10 resulted in a 120-member population of combinations of parameters for the model. I deliberately terminated the algorithm sooner than I would otherwise so that there would be some visible variation in the extrapolations. The black dots show expected reported-case values with parameters from each member of the population, plotted at a bunch of random times from 3/12 to early April.

Significantly, the subplots both have a logarithmic y-axis. Exponential growth is linear when viewed through a logarithmic lense. When you see that straight line marching steadily upward toward those massive numbers, you really want all your modeling to wind up an embarrassing public failure.

Covering my Posterior

A better way to model this might have been to use a Monte Carlo analysis (e.g., with the Metropolis-Hastings algorithm) to obtain posterior probability distributions for the parameters, and then run a bunch of extrapolations based on parameters drawn from the distributions. But I had the tools handy for using ADE instead; I’ve been wrapping up a year-long project modeling power semiconductor devices using it with the free Ngspice simulation software. So this is what I have to offer, and it seems plenty illuminating to me.

But even without having posterior distributions to draw random variates from, what I am seeing in the scatter plots of value vs SSE for parameter L is not reasssuring. That parameter represents the total number of cases expected to ever be reported. And the data we have, with its steady logarithmic-scale march upward, is not satisfying my computer evolution algorithm that there is any upper limit before the nation’s entire population is infected.

SSE vs value: Parameter L (3/18 data)

Simply put, this thing is currently showing no signs of slowing down anytime soon. It is very possible, even likely, that these values of L are due more to genetic drift than any optimality-of-fit of the modeling they represent.11

A word of explanation of this scatter plot: The red dots hugging the left side of the plot are values (y-axis) of L in the final population of parameter combinations, plotted against the sum of squared error (x-axis) that those combinations had vs the data. The distribution of values seems to indicate that we shouldn’t hope for less than several million U.S. cases, and that we can’t count on any upper limit before the virus runs out of hosts to infect.12

There is a fair amount of correlation between the model parameter t0 and two other parameters, k and L. The parameter k represents how drastic the exponential behavior is; higher values cause things to blow up faster and thus start to reach limits sooner. Thus the highest values of k in the final population are associated with somewhat lower values of t0. The time when the number of new daily cases reaches its maximum happens a few days earlier.

Regarding the correlation between parameters t0 and L, a positive-valued one this time, it simply makes sense to realize that increasing new cases longer before you finally start to slow down the increases is associated with having more people ultimately infected.

Reasons Why Things Might Not Be So Bad

I want to emphasize that there is also the distinct possibility of L coming down by a lot within the next couple of days. (Unfortunately, I thought it would do that a couple days ago already, but it’s done the opposite.) It could still happen for a couple of reasons I can think of:

  • A curtailing effect becoming apparent soon from containment measures that just aren’t being noticed quite yet due to the incubation period.

  • A sudden recent increase in the number of reported cases due to testing finally being available. The rate of tested vs actual may be increasing, not just the absolute number of people testing positive. This would mean that the model is currently getting fitted to an overly dire set of parameters (especially L) due more to recent dramatic increases in reported cases from better testing than exponential spread of the virus.

And there are probably many more reasons I haven’t even imagined why that curve might start bending down sooner than in this simulation. Again, I need to emphasize my lack of biological or medical expertise. And this leads to . . .

The Disclaimer

First, I disclaim everything that John Hopkins does when offering the data on which this analysis is based.13 I’m pretty sure their lawyers had good reason for putting that stuff in there, so I’m going to repeat it. Except think “Ed Suominen” when you are reading “The Johns Hopkins University”, and this blog post when you read “the Website.”

This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

Second, I know very little about biology, beyond a layman’s fascination with it and the way everything evolved. (Including this virus!) I do have some experience with modeling, including using my ADE Python package to develop some really cool power semiconductor simulation software that I’ll be releasing in a month or so from when I’m doing the GitHub commit with this COVID-19 example. The software (also to be free and open-source!) has a sophisticated subcircuit model for power MOSFETs that evolves 40+ parameters (an unfathomably huge search space). It uses the same principle–differential evolution of nonlinear model parameters–as this unfortunate example we find ourselves in.

The model I’m using for the number of reported cases of COVID-19 follows the logistic growth model, with a small (and not terribly significant) linear term added. It has just 4 parameters, and finding the best combination of those parameters is no problem at all for ADE.

Remember, I am not an expert in any of the actual realms of medicine, biology, etc. that we rely on for telling us what’s going on with this virus. I just know how to fit models to data, in this case a model that is well understood to apply to biological populations.

Don’t even think of relying on this analysis or the results of it for any substantive action. If you really find it that important, then investigate for yourself the math, programming, and the theory behind my use of the math for this situation. Run the code, play with it, critique it, consider how well the model does or does not apply. Consider whether the limiting part of the curve might occur more drastically or sooner, thus making this not as big a deal. Listen to experts and the very real reasoning they may have for their own projections about how bad this could get.

It’s on you. I neither can nor will take any responsibility for what you do. I will say this, though: If you haven’t been sitting at home for a week straight already, wash your hands a lot and don’t itch that nose unless you really have to and you just got done with one of those hand washings. It’s a hot zone out there already. You don’t need my fancy modeling to see that.

Finally, if this is getting you down, please think of all the people who were living and loving and looking up at the blue sky even during the fall of Rome and the Black Death. We have a front-row seat on history being made. Yes, it is a worldwide biological cataclysm not seen since the days of polio, smallpox, and the Spanish Flu.

Yes, this really sucks. But you are alive, and there is so much left to see. A world in crisis can sometimes be an exhilarating world to live in, like a sharp fresh breeze tickling your face on a clear winter’s day. Your grandparents saw cold bracing days like these, and were called the Greatest Generation for the way they responded.

To anyone in despair: Leaving the show early would be a sad waste of the seat that was reserved for you. Stick around. Do what you can to make your life a little better, and the lives of those who love you and whom you love. Allow your worries and fears and sadness to seep into the gentle awareness that an entire world now worries with you.

And there is a bit of good news to share, though it may be cold comfort for my fellow citizens in the U.S.

South Korea is fully in its containment phase, well past its t0 that took place over two weeks ago. They followed the logistic growth model all the way to the containment phase. Look at the two curves and annotated +/- numbers in the upper subplot! The lower subplot zooms in on a narrow range of case numbers around 8,000, where it is unlikely to increase much further.

Reported Covid-19 cases in South Korea vs days since Jan. 22, 2020

Italy’s numbers should start leveling off significantly in the next week. They reached t0 yesterday, according to my best fit of the logistic+linear model with this evening’s data. They appear to be headed for around 70,000-80,000 cases, or about 1% of their population. Even that doesn’t sound too bad.

Reported Covid-19 cases in Italy vs days since Jan. 22, 2020

Be well. And stay home.

Notes


  1. Ng Yi Kai Aaron pointed out an article referencing a paper (Ziff, Anna L. and Ziff, Robert M., “Fractal kinetics of COVID-19 pandemic,” preprint available online) suggesting that the data from China’s experience with the virus “are very well fit by assuming a power-law behavior with an exponent somewhat greater than two.” 

  2. See the important section entitled Disclaimer

  3. See, e.g., https://services.math.duke.edu/​education/ccp/​materials/diffeq/​logistic/logi1.html

  4. All these significant figures are only used for comparison purposes. It is of course silly to put more than a couple of significant digits on extrapolations this uncertain. 

  5. You may think that’s progress, but I consider it disappointing (as a human being with a pair of lungs, not as a data modeler) that the model is tracking the model’s exponential growth phase so closely, and that t0 seems to remain far in the future. 

  6. This projection remains unchanged from the one done with data from yesterday (3/18). 

  7. With yesterday’s data, I thought we would reach the five million mark a day earlier, 4/10. 

  8. Trevor Bedford, for example, a scientist at the Fred Hutchinson Cancer Institute in Seattle “studying viruses, evolution, and immunity,” has mentioned a 10:1 true vs reported cases ratio. https://twitter.com/​trvrb/status/​1238643292197150720?s=20.

    “I could easily be off 2-fold in either direction,” he Tweeted on March 13, when there were just over 2,000 cases being reported in the U.S., “but my best guess is that we’re currently in the 10,000 to 40,000 range nationally.” 

  9. Those who follow me on Facebook know how much contempt I have for the incompetent, malicious, destructive asshole who found enough bigots and morons in a key combination of states to make it past the Electoral College. No, I will not mince words. If you still support Donald Trump– knowing that he dismantled the office that Obama had set up to address pandemics, that he fired people with expertise to deal with this, that he downplayed and denied the reality of the problem until just days ago–then I think there is something deeply wrong with you.

    In my previous post, I asked, “Do many of his supporters even realize how much they’ve been played?” I quoted the self-confessed narcissist Sam Vaknin, who wrote that “the narcissist abuses people. He misleads them into believing that they mean something to him, that they are special and dear to him, and that he cares about them. When they discover that it was all a sham and a charade, they are devastated” (Malignant Self-love: Narcissism Revisited, Narcissus Publications, 2015, p. 69.)

    So far the deranged narcissist’s base of support has proven remarkably resilient to plain facts about how much of a sham it really is. I hope that changes very soon. 

  10. Using my free, open-source Python package ade, Asynchronous Differential Evolution. 

  11. Genetic drift is an evolutionary phenomenon where a population “drifts” certain bits of its genetic code toward what appears to be an optimal range when in reality it is just the survivors propagating a consensus that has no actual selection value. I’ve seen it happen with my computer evolution of simulation model parameters just like it happens in nature.

    The final population of L with 3/19 data ranges from around 20,000,000 to more than the population of the U.S., where the logistic model would obviously run into a stark limitation. Not different enough to show an updated plot. 

  12. This scatter plot doesn’t show a real probability distribution, as a Monte Carlo analysis would. But it does seem instructive, to represent a confidence interval of sorts. I’m guessing that it is no narrower than a posterior distribution obtained from a random walk with well-informed priors. On this question, however, my modeling knowledge reaches its current limits. 

  13. The GitHub repo is at https://github.com/​CSSEGISandData/COVID-19