Ed Suominen’s Shitty Little Blog: Life and Death

Showing posts with label Life and Death. Show all posts

Sunday, July 19, 2020

Balancing Harms

Why our teenagers won’t be on their high school campus this fall.

Since early March, I’ve been watching the relentless spread of SARS-COV-2 throughout the United States with morbid fascination and a grim determination not to expose myself or my family to this novel virus that looked plenty dangerous even without its full effects being understood. Now that we are learning more about the long-term suffering and damage it can do with the COVID-19 disease it causes–in younger people as well as those my age and older–this determination has not wavered.¹ There is nothing more important in my life right now than protecting my family and my own body from this virus.

We’ve been holed up on our rural property with only occasional trips to get curbside pickup or have open-air driveway visits with friends. During these long months of quarantine, I’ve combined my technical background in signal processing and programming with a long-standing interest in math and data modeling to get myself a uniquely clear view into the situation with the COVID-19 pandemic.² Yes, that’s a bold claim, but there is a lot of work and I think some pretty informative results to back it up.

Modeling the Spread of COVID-19

This work was to develop a nonlinear mathematical model for the number of reported cases of the disease that fits remarkably closely to what we’ve seen for the past eleven weeks in the U.S. as the curve flattened and then started heading upward again.³ For my own personal interests, I’ve applied the model to reported case data from my own Washington State and the most populous county out here in Eastern Washington where I live. For some context, I’ve also used it to consider how badly things are going in neighboring Idaho, and to stare at the dumpster fires raging in states like Arizona, California, Florida, and Texas.

This plot shows what’s happened during May, June, and now half of July in the U.S. overall, and how well the model fits with the cases reported for each of the past 79 days. The data is generously provided to the public by the New York Times, based on reports from state and local health agencies.⁴ I hope you will understand from these plots why I dared to see myself as having having a uniquely clear view of our pandemic-documentary version of American Horror Story.

U.S. Reported COVID-19 cases vs days elapsed since 1/22

The upper subplot shows the number of reported cases for the entire U.S. since the first of May. There are two curves plotting cumulative case numbers on a logarithmic scale vs the number of days that have elapsed since the first case was reported in this country back on January 22.

The blue curve shows what was actually reported, according to the New York Times data. The red one shows what my model expects was reported for each day in the past, looking backwards from the most recent date (to which it is fixed as an anchor point) with its ten model parameters fitted to the data. The largest error between what the data and the model’s fit to it is just 2.2%, way back on May 11 when there was just over a third as many cases as now.⁵

Under the Hood

You can skip this section if you don’t love math. Your loss.

The skinny subplot below the big one shows the error between the model’s expectations and reality.⁶ What you want to see in such a residual plot is a relatively even distribution of modeling error vs the amount being modeled. This one looks about as good as you could ask for, especially when you consider the significance of normality p of 0.38

That means that the “leftovers” from modeling the past data are not much different from what you would get from normally-distributed random noise.⁷ Because it’s impossible to model noise, you can have some confidence that the model is accounting for most everything but randomness when it is nearly as probable as not that your residuals would look this random if it were indeed just noise causing the error.

It may be more instructive to consider the opposite case, if there were low p value for the non-normality statistic. Say, 0.02 instead of the actual 0.38. That low p value would indicate that re-running 50 experiments (obviously not possible with a natural experiment like the one we are running with worst pandemic in a century) would get you residuals that distinct from normal random error only once on average. That would be a pretty good indication that your model isn’t accounting for some noteworthy phenomenon.⁸ But that’s definitely not what’s happening here with the fit of this model to reported cases of COVID-19 in the United States, at least not as of July 16.

So my “logistic growth with multiple growth regimes model” is accounting for what we see in the data. It is a naive curve-fitting model that does not assume anything beyond the following:

The number of reported cases of COVID-19 in the U.S. is following a logistic growth model with L (the ultimate upper limit) fixed at 1/4 of the U.S. population,⁹
but with three separate growth regimes (3 parameters) having smooth transitions between them (4 parameters),
and with a sinusoidal component that imposes a weekly variation (2 parameters) on the current growth rate for each day,
plus, finally, with a fixed number of new cases per day (1 parameter), to allow the model to only account for reported cases on or after May 1.

The best-fit curve has an artificially high initial growth rate r1 of 4.8966 (nearly 500% per day!), which the differential evolution algorithm arrives at because it isn’t actually looking at numbers before May 1. It just wants to fit the data between May 1 and now as closely as possible, and it found the way to do that was to jack up the growth rate for all those unseen days. It’s doing it’s job, and that’s fine; we don’t care about that earlier growth rate for this analysis, just what is happening now.

Following the model forward,¹⁰ we soon transition to a relatively sedate growth rate r2 of 1%. The transition occurs over a 32-day window (e.g., the time it takes for half the smooth transition between growth rates) defined by s1 that is centered on Feb. 6, (16 days after the first U.S. case), defined by t1. Then the red-state governors reopen things followed by people living it up on Memorial Day weekend. And we wind up with a 5/30 midpoint between the second growth regime where the curve had been flattening nicely and our current scarier one r3 where SARS-COV-2 reminds us who it is with a 1.9% increase in cases per day.

Fortune Telling with Function Fitting

This model of mine is a naive empirical one, using a cool evolutionary algorithm to fit a curve to data. It’s a very elegant curve, constructed from a first-order differential equation with multiple growth rates and smooth transitions and weekly variation, though that doesn’t make it a basis for much extrapolation.

But it certainly does match up to what’s been happening. An error betwen modeled and actual values of 1% or less going back four weeks, and then 2.1% or less for another eight weeks–forever in pandemic time. That’s 79 data points fitted with just ten free parameters. As with a shoe, if the model fits, wear it.

It fits well enough that I will try out some extrapolation anyhow, despite having just acknowleged the limitations of the model for prophetical purposes. The following plot shows what the model is projecting for U.S. reported cases two weeks into the future.

U.S. Reported COVID-19 cases vs days elapsed since 1/22

The upper subplot of this plot shows how many reported U.S. cases the model projects under the assumption that the data will continue to reflect a 1.9% daily growth rate, with an 11% weekly variation imposed by reporting limitations over the weekends.¹¹ That is of course not an entirely safe assumption to make, no matter how closely the model fits to past data, and I have mostly limited my plots to just a couple weeks of extrapolation.¹²

The bottom two subplots show the new cases being reported each day, first as a percentage of the cumulative cases already reported as of that day and then (bottom subplot) as a simple number of new daily cases.

We have gotten jaded to the horror of this pandemic over the past several months, but take another look at that number in the upper right. It’s a big one: nearly five million people testing positive in August. And it’s increasing fast.

In the middle plot, we are seeing one fiftieth more being infected with each passing day, with weekly variation due to testing and people being off on weekends. And on the bottom plot, sixty thousand plus or minus new cases each and every day, all carrying the risk of a person losing their health and vigor for weeks if not months, in some possibly for a lifetime. Occasionally it’s a lifetime that the virus cuts short.

Yet people are complaining about wearing masks when they go in a store.

Dr. Anthony Fauci expressed the belief (or perhaps just hope) that the number of daily reported cases would never reach 100,000.¹³ I fully accept that Dr. Fauci has a wealth of knowledge and insight that is not reflected in my naive curve-fitting to the time-series data. But from what I’m seeing on that bottom red curve, it’s hard for me to see how we can avoid that grim milestone.

To do so, we would need a significantly lower daily growth rate during the coming weeks. It would have to go down enough to cause the value of parameter r3 to decrease, perhaps enough to justify yet another growth regime in the model and an additional three parameters. There is of course no way a naive model builder can know that in advance; this is nonlinear curve fitting with a modest and limited amount of extrapolation, not prophecy.

Track Record

This essay will not dwell on the model’s track record; I’ve done plenty of that in previous blog posts. I’ll just offer a couple of observations from back-testing the model, along with a long footnote about some reddit critics, and leave it at that.¹⁴

With data from the day before yesterday (7/16), the model projected that today’s New York Times cumulative cases number would have been 3,738,827, an increase of 149,338 over the two days. It was 3,719,110, an actual increase of 129,617. The model was pessimistic by 15%, or 7% off per day. It’s a naive curve-fitting model, and does not inform us whether this is because the weekly variation is increasing or the growth rate is settling back down, or there was just quite a bit of random variation in one direction.

With data from seven days ago (7/11), the model projected that today’s New York Times cumulative cases number would have been 3,703,746, an increase of 443,181 over the week. Again, it was 3,719,110 today. That error is too small to bother even worth trying to calculate. The projection was essentially perfect one week out.

State and Local

My own state of Washington was doing pretty well with this virus up until early June. Now things aren’t looking great at all, with not just the number of daily cases increasing each day, but also the rate of growth in daily cases. Here is what I’ve been calling the “Nightmare Plot” with the full set of information about the model’s fit to Washington’s reported case data along with a couple weeks of extrapolation into the end of July.

Washington State cases vs days elapsed since 1/22

I don’t actually expect that the long-term growth rate for reported COVID-19 cases in Washington State will settle at an absurd 33% per day, despite the model’s best fit assigning that value (0.32583) to the parameter r2. Something’s gotta give long before that happens, because no society can sustain having its population infected with a deadly virus at a reported rate that increases each day by a third of the total number of cases reported thus far.

The reason the curve fit can get away with such a high estimation of r2 is that it paired it with a very large value of t1, 271.65. That corresponds to an interim point of 272 days after January 22 that lies midway between the initial growth rate (nearly zero) and that crazy high final growth rate of nearly 33%. That’s October 20. I don’t believe things will continue increasing in Washington state until then, and neither should you. But it is useful to know that the best fit for the model parameters right now is one that projects a lot more cases, and a continued increase in how fast those cases are coming, for months ahead. That’s what would happen if the virus were allowed to progress as it has.

It can’t, of course. We won’t allow it to, whatever our politics or petty objections to wearing something on our face to protect other people. You can see why that growth rate will have to go back down–one way or the other–by humoring me with an extrapolation of the Washington State model through the end of August, when school is scheduled to start. This plot shows the percentage of Washington’s population that the model expects will have tested positive for COVID-19 on each day.

Percentage testing positive in Washington State vs days since 1/22

According to the model, cranking away on the data in its ignorance about anything people might do in response to the situation, or about whatever limitations there are on how many tests can be processed in a given day or week, we would be seeing 2% of the people in the entire state testing positive on August 25. That’s right around the time the kids would be heading back to campus.¹⁵

Assuming the continued accuracy of CDC Director Robert Redfield’s June 29 assessment that there are ten times as many actual cases as reported ones, one in every five citizens of Washington would have actually contracted COVID-19. Long before that, though, you would see masks everywhere even in the rural red eastern half of the state and Karen would finally shut the fuck up already. It’s actually started happening out here now, after months of opposition and denial.

The curve will flatten again, inevitably. This is the one bit of armchair epidemiology I will dare to offer, if you don’t also count my assigning 1/4 of the region’s population to model parameter L. The rest of my work is just looking carefully at data with a highly refined non-linear model that has reflected that data really well thus far and is pretty good at looking into the future a few days, perhaps even a week or two.¹⁶

Here’s the Nightmare Plot for Spokane County, by far the most populous one in Eastern Washington:

Spokane County (Eastern WA) cases vs days elapsed since 2/25

In this instance, we have the extreme periodic behavior of zero cases being reported each of the past two Saturdays (including today). But the model isn’t fooled; it mostly accounts for the periodicity by evolving its parameter aw to an unusually high value of 83% (0.83329). The residuals are fine, as normally distributed as would be expected from random Gaussian errors nearly half the time. And thus it can project with some confidence there will be well over a hundred newly reported cases this coming Tuesday and also on Wednesday. I am pretty confident that we (I live in a surrounding county, but Spokane County looms large) will be seeing four thousand reported cases (in a county of just over half a million) by early August.¹⁷

Finally, before going into the heavy parental decision that all this data was in service of, I will offer a Nightmare Plot for neighboring Kootenai County, Idaho. I expect the number of reported cases there to double in the next two weeks.

Kootenai County (North Idaho) cases vs days elapsed since 3/21

Hell No, They Won’t Go

I’ll repeat again that my only relevant expertise is in applying nonlinear mathematical models to time-series data.¹⁸ But, alongside my wife, I am still am entrusted with a decision that will affect some young people’s lives. Using this limited area of expertise along with a very comprehensive collection of data, we have decided that our high school kids will not be attending campus in person this fall, regardless of what precautions the school administrators take or what requirements they have. We just won’t do it.

How Many is Too Much?

The metric I have been using to assess how serious things are is 1% of the local or national population testing positive for COVID-19. Again relying on the view of CDC Director Robert Redfield (whose agency is now being shunned by the deranged racist narcissist in the White House) that there are ten times as many actual cases as reported,¹⁹ this equates to one tenth of the population that has thus far been infected.²⁰

This 1% threshold has already been reached nationally, around July 12. My modeling of Washington State, Spokane County (WA), and Kootenai County (ID) for the weeks ahead makes me believe that it will also be reached locally by the time my kids would be asked to return to campus this fall. Using Dr. Redfield’s 10:1 actual vs reported estimate, every tenth person will have been infected in the region around my kids’ high school.

Think about that number for a moment. One out of every ten residents of the most populous county in Eastern Washington, or indeed the entire state–will have had that virus growing inside their bodies. Imagine one finger on your two hands held up in front of you being a random person in your community who is or has been a host for SARS-COV-2.²¹

How Many Contagious People Around?

The thought of 10% of the national or local population having contracted COVID-19 is pretty scary, but how many of them will actually pose a danger to my kids and my wife and me? This is a tough question to ask, and the weakest link in my analysis. My wife and I had an important decision to make, and we’re pretty much on our own in the failed state that America has become, and all I have to work with is the reported cases plus whatever assumption I put on top of them.

Let’s first consider the time window of when people can spread the virus to others. In an article published in March and updated July 2, The Harvard Medical School says the

time from exposure to symptom onset (known as the incubation period) is thought to be three to 14 days, though symptoms typically appear within four or five days after exposure.

We know that a person with COVID-19 may be contagious 48 to 72 hours before starting to experience symptoms. Emerging research suggests that people may actually be most likely to spread the virus to others during the 48 hours before they start to experience symptoms.

The article provides little guidance about when infectiousness might end. “Most symptomatic carriers “will no longer be contagious by 10 days after symptoms resolve,” is the best I can find, wondering if this also applies to people whose symptoms last for weeks.

Assembling all these vague numbers together, I wind up with the assumption of a fuzzy, ill-defined “contagion window” extending out to ten days after exposure. I have no idea what the shape of that distribution would look like. Are there lots of people who you still wouldn’t want to be around more than ten days after they were exposed, or just a few? But to keep things simple, to limit the number of people testing positive who I will consider infectious, and to not be quite so alarmist, I’ll assume that the window extends from the exposure date (realistically, it’s probably the day after exposure at the earliest) to ten days after.

So when did all the thousands of people who tested positive on a given day actually get exposed to the virus and start that (highly uncertain) ten-day contagion window?

Unfortunately, the delay from when a person gets exposed until their exposure results in a reported case is variable, long, and may be getting longer as backlogs of tests pile up. In a scathing April 4 critique of the very kind of analysis I’ve attempted to do–by a person who knows a thing or two about modeling data–Nate Silver said he assumes

that there’s a delay of 15 days . . . between infection and the test results showing up in the data though if anything I suspect this is too generous, given the huge testing bottlenecks in places such as California.

I’ll go with his 15 day estimate. In doing so, I am mindful of his warning that the “number of reported COVID-19 cases is not a very useful indicator of anything unless you also know something about how tests are being conducted.” But I will go ahead and make the terribly simplistic yet perhaps still useful assumption that (1) the exposure resulting in a given day’s new daily cases occurred 15 days before that day, and (2) the window in which all those infected people were infectious to others was from 5-15 days before they became a reported case.

This means that you have to look forward 5-15 days along the red projected-cases curve to see how many people around you are infectious on any given day. What the projection does is to give you an idea as to how many of those as-yet uncounted people are out there being contagious right now.

In the U.S. overall, my model is projecting that we will go from yesterday’s 3.7 million reported cases (none of whom are still in that 5-15 day window) to around 4.8 million 15 days from now.²² That’s 1.1 million additional people testing positive, with most of them in that contagion window right now. I’m further assuming, with Dr. Redfield, that there are ten times as many people actually infected on each given day as what the reported case data shows, which means there are maybe 8-10 million Americans you really don’t want to be around at this time.

Nationally, with our current growth rate and my shaky assumptions, it appears that there are now three infectious people for every reported case. That’s a lot of virus walking around.

How about for Washington State? Simply multiplying by three the 2.2% figure I dared to extrapolate above for Washington on 8/27 would result in around 7% of the population being infectious then. That’s a lot. I’m skeptical of it, too.²³ So let’s (irresponsibly) project out the actual number of reported cases and then do the math like we just did for the U.S.

Ignorant of the likelihood that the curve will flatten between now and then, the model projects that 1% of Washington’s population will be testing positive on August 7, which is comparable to the situation now in the U.S. overall.²⁴ See the “Percentage testing positive” plot in the section above. Re-doing the plot with cumulative case numbers rather than percentages looks like this:

“No-flattening” long-term projection for reported cases in Washington State

To do a SWAG²⁵ for the percentage of Washington that is infectious on 8/7 with its projected 78,000 reported cases or so, let’s look forward 15 days to when the model has a business-as-usual projection of around 133,000 reported cases. That’s 55,000 additional cases. If most of those new cases appearing on 8/22 are infectious on 8/7, and if the 10x multipler for actual vs reported cases holds true, that’s perhaps nearly half a million Washingtonians capable of giving us COVID-19.

So, when it projects that 1% of my state’s citizens will be testing positive, assuming the virus is still on the rampage, my model tells me that around 6% of the state population will be capable of infecting me if I get too close. As I said before, I doubt if it will be growing as fast then as the model currently projects, but even half that would be too much.²⁶

And of course I could not resist doing the same irresponsible extrapolation for Spokane County:

“No-flattening” long-term projection for reported cases in Spokane County

The model projects around 2,200 new reported cases from 8/7 to 8/22. Ten times that would be around 4% of the county population actually infected, with most of them infectious to others.²⁷

What’s So Bad About 3%?

Imagine that there is indeed a 3% probability of any randomly selected person from the population around you being able to infect you with COVID-19 by coughing, talking, or even just breathing.

If you (or your school kid) encounters just 30 people randomly chosen from the population in a given day, you have only a 40% probability of avoiding any encounters with an infectious COVID-19 carrier. If you mingle in society and encounter a different set of people each day, your probability of avoiding proximity to a COVID-19 carrier go down dramatically each day. After a week, it’s practically impossible to remain free of any such encounters.

Hopefully those encounters are brief and separated by at least six feet (more space is better), with masks on everybody unless outdoors. That doesn’t sound like any sort of high school experience my kids would have this fall, whether attending under reasonable conditions or staying home. No dances, hugs, high fives, being the class clown or acerbic wit, band, etc.

So they will just stay home regardless and wait for our national folly to play itself out. Hopefully before or at least by the time we reach 5% of Americans reported as infected and probably half actually, we will have some herd immunity going and Spring Quarter will look safe.²⁸

If it reopens, each of the students attending our kids’ high school will have a terrifyingly high probability of being exposed at some point through their day to someone infected by COVID-19. This presents an individual and family risk that my wife and I will not ask our kids to bear.

The Ethical Dimension

There is an ethical dimension to this as well. Admittedly, it is a small and secondary part of my considerations because, like you, I consider my health and that of my own family of paramount importance.

But here it is: Do I want to participate in an activity whose existence poses a serious health risk to a teacher, janitor, teacher’s assistant, or administrative employee? A person already underpaid and unappreciated who probably feels compelled to enter that building full of people whose skills in risk assessment and decision-making will not fully develop for a few years yet?

That person may be decades older than the students talking loudly in their classrooms²⁹ or the kids whose bathrooms they are tasked with cleaning. And they may have little choice but to put their bodies at risk for simple economic survival, including, ironically, being able to keep the very health insurance they rely on to keep from going bankrupt if they get sick. They don’t have the luxury my kids have (though they hate it) of sitting home with both parents there.

It is an upside of having older parents that perhaps balances the increased risk we have, but worthless if we don’t make use of it. So we will continue staying home and staying vigilant, and thus deny this virus one small set of hosts to travel with.

Ironically, the parents that this effects most may be teachers with kids in school themselves.³⁰ People in other professions and trades have to figure something out for their kids to be home for two months every summer here in most places in the U.S. But teachers have long enjoyed the perk of summers off, and so haven’t needed to make child care plans for the summers. When they go back to work, their kids go back to school. Well, maybe not this time, if they are in a reopening school district and decide not to subject their kids (and thus, indirectly, themselves) to the risk of infection.

The Decision: A Balance of Harms

Our decision will affect our kids’ future. There are serious consequences to them either way. On one side, this avoids subjecting them to an unacceptably high risk of catching a virus whose likelihood of causing months of illness and disability, long-term damage, and even death is only now being appreciated.³¹ Or (especially regarding the possibility of death) of passing it on to the two people they love most on earth.

On the other side is the sober realization that these teenagers of ours are going to miss out on much of the activities and flirtations and friendship intimacies of the years we remember from when we were them.³² There’s just no substitute for that experience in “remote learning,” whose name it seems to me refers as much to the likelihood of learning occurring as to the physical distance.

This decision is a balance of harms. It was difficult to make, but allowing our kids back onto their high school campus this fall imposes an individual and family risk on them that they cannot be asked to bear.

There are three times as many people who have been infected across the U.S. as there were at the beginning of May. Now the spread of the virus is growing twice as fast as it was then. Yet the general public and its elected officials (far more so in the GOP and its Dear Leader but to some extent true of both parties) have been acting like re-opening everything is simply inevitable no matter how many refrigerated morgue trucks a hospital needs to have stationed outside, no matter how many millions of people young and old wind up with permanent damage to their lungs, kidneys, hearts, and even their brains.³³ No matter how many weeks of suffering many millions have to endure, each successive day growing with the fear and dread of being one of the unlucky long-term sick. Nope, we have to get things re-opened again, no matter what.

Again, I recognize that my wife and I are privileged in not having to leave the house every day for work. But the fact that in-person reopenings of various types are pretty much economic necessities in our dog-eat-dog unfettered capitalism does not make me want to participate in them if I can help it. That’s mostly from a desire for preservation of self and family, but to a small extent to not participate in the mass delusion that everything will be OK.

Notes

This essay isn’t intended to be a collection of scary links, but it’s worthwhile to consider the view of an ICU doctor last week that his patients have gone from having an average age of around 65 to “between 25 to 35, 45 years old” (“Miami Hospital ICU Doctor: New Influx Of Patients Is Younger Than Before,” NPR, July 13. ←
The technical details of the modeling are covered later in this essay. I had spent much of my Python coding time in the past two years working on a free, open-source software project that models power MOSFET devices using Python and a freely available underlying simulation tool called Ngspice. Then a deadly pandemic happened, my family went into a no-nonsense quarantine for months, and frankly it became a bit difficult to concentrate on something as removed from practical life as that. Instead, I took the same nonlinear modeling tools I developed for the simulation project and applied them to time-series data on reported COVID-19 cases. This essay with its plots is the culmination of that effort. ←
The model is implemented in an example file covid19.py that is part of my ADE Python package. It’s free software; with the right software skills, you can install it and try it out for yourself. ←
Earlier I was using data from The Johns Hopkins University, but have switched to the New York Times data both for the detail it offers as well as its simple and open licensing terms, which are co-extensive with the Creative Commons Attribution-NonCommercial 4.0 International license, plus this:
“In general, we are making this data publicly available for broad, noncommercial public use including by medical and public health researchers, policymakers, analysts and local news media.”
“If you use this data, you must attribute it to ‘The New York Times’ in any publication. If you would like a more expanded description of the data, you could say ‘Data from The New York Times, based on reports from state and local health agencies.’” ←
Some of the states and counties I’ve looked at fit well enough to their datasets without all ten parameters used in the national model, and thus have simpler models. Washington State, Spokane County, and Kootenai County have only two growth regimes, seven parameters instead of ten. The AICc metric was used to determine what parameters “earned their keep” and remained in the model. They only do if they result in a lower (better) AICc score. ←
A square-root transform is applied to both the modeled and actual new-case numbers to arrive at the residuals. Then the sum of squared error (SSE) is computed by squaring each residual value and adding them up. The purpose of the transform is to lessen the disproportionate impact of later, higher numbers on the curve fit. ←
The “normal” or Gaussian probability distribution describes many events in nature, from the noise you hear on a radio to the variation in people’s heights. It is what you eventually wind up with when you look at enough related phenomena together, even if the underlying probability distributions of each one are not normal. ←
For some reason, a p of 0.05 is the standard for many statistical tests. So, on that basis, the residuals are well within the acceptable range of normality. ←
This value of L is the largest number of people the state/county population expected to be reported as infected. According to Wikipedia, the lowest estimate for COVID-19 herd immunity of actual (not just reported) cases is 50%.
But there will always be more actual cases than reported. As mentioned in the essay, on June 25 CDC director Robert Redfield indicated there are 10 times as many actual cases as reported. That would reduce a 50% herd-immunity value of L to 0.05 times population. But the 10x ratio could change as the number of cases increases and testing could increase to the point where nearly every American has been tested.
To be conservative about the modeling and hopefully avoid inserting too many shaky assumptions into it, I’ve fixed L at 0.25 times the region’s population. This limits the increase in reported cases (regardless of the growth regime) as the number approaches 1/4 of the state or county’s total population.
It actually doesn’t matter much. At this point in the pandemic, L has almost no impact yet; all of the regions I’ve looked at still have reported case numbers that are single-digit percentages of their populations. ←
The curve fitting algorithm actually follows the model backward from the latest day’s data, to which it fixed its cumulative case result. It does this by backwards-integrating a first-order differential equation xd(t,x) whose gory details are shown in the upper left part of the first plot. Applied mathematics in all its glory, and it works. ←
The exact parameters are shown in the lower right of the first set of subplots, with the third growth regime’s daily rate r3 equal to 0.18863 and the weekly variation aw equal to 0.11354. ←
Extrapolating a nonlinear model is not something to be done lightly even if the underlying phenomena are well understood and predictable. In the case of America’s COVID-19 pandemic, there are a couple of known unknowns (to paraphrase Donald Rumsfeld) worth mentioning in addition to all the unknown unknowns.
The curve could flatten again if those states whose citizens and governors have not been taking this virus seriously might finally start to do so when enough of the people around them get sick and they see a lot of those people staying sick for a long time, or even dying. A second less comforting possibility is that we may soon see the effects of Trump’s stated desire not to have the reported numbers increase quite so fast (“Slow the testing down, please!”), and of his latest move to keep data from reaching the CDC. Doesn’t it seem that a federal agency named “The Center for Disease Control and Prevention” should be informed about just how many cases there currently are of a pandemic affecting millions of Americans? ←
Norah O’Donnel: “Dr. Fauci, do you still think that we could reach 100,000 infections a day?” Fauci: “You know, Norah, I don’t think we will. I hope not. It is conceivable that if we don’t get good control over the current outbreak and we keep spreading into other regions of the country, we could reach 100,000” (CBS News, “Fauci says he doesn’t like ‘to be pitted against the president’ after multiple attacks from the White House,” July 16. ←
Posting about this modeling work of mine on the r/Coronavirus subreddit has resulted in a lot of page views, which has been appreciated since the goal of the writer is after all to be read, and this seems like something important to share. It also results in a few critical comments each time, some of which have actually been helpful in guiding me toward more statistically rigorous modeling. I am finally reminded of the public health official in Argentina who took me to task back in April about not providing any statistical tests about the model’s residuals. It was a valid criticism which I accepted and responded to, and the official gracefully acknowledged frustration and fears about the virus in their country.
Some of the critical comments have just been exasperating snark from people who consider themselves far above something so naive as fitting data to a curve and daring to claim that the curve means something when it fits very well. A recent interloper gazed down from his high horse to proclaim that I am trying to model sampling error with my weekly-variation component. I tried unsuccessfully to point out that my effort here is to model reported cases, not to imagine some expertise about an unknown number of actual cases of infection out there, hidden in the unreported masses. If the time series of case reports has a periodicity to it, as it obviously does for the U.S. overall and also more than half the states and counties I’ve examined, then that is what I’m modeling more faithfully by adding a periodic component. I predict the reported cases will increase by not much more tomorrow–perhaps even less–than they did today, because the periodic component to the rate at which cases are reported reaches the trough of its wave tomorrow. That says nothing about how cases are “actually” increasing, nor do I ask it to.
The same inquisitor muttered something about basis functions and suggested I look into learning about the Fourier series. (I’m an electrical engineer with expertise and a dozen patents related to signal processing; I’ve heard of the Fourier series, thanks.) How silly of me to think that there is anything special about fitting a ten-parameter combination of any functions to within 2.1% over 70+ data points. Why, he (we can all guess it was probably a “he,” can’t we?) could pull off that same trick with just about anything. I suggested it was absurd for him to claim that ten free parameters of a Fourier series (five terms with a real and imaginary component apiece, though I’d forgotten about the DC component) would accomplish the same thing, and he made some assertion that he managed it with seven, presumably seven components. No plots or code or anything was provided, of course. And even if he could, he would find the innate periodicity of the Fourier series bringing his case numbers right back to the simple linear increase of the zero-frequency DC component, back and forth in an absurd roller coaster totally unrelated to the problem at hand as his modeled cases come in faster and then slower and then faster again.
I’ll post this on reddit like my other essays about this model. But this time I think I’ll pass on any further schooling from the boy geniuses there who remember a few textbook cases and fancy terminology but offer no relevant work of their own for comparison. ←
Update 7/23: If I kept updating these blog posts with each new day’s data, I’d never stop. But given how sensitive such a long extrapolation is to changes in recent data, I do feel compelled to offer this link to a plot updated with 7/22 data that shows 1.3% of the state testing positive on August 25 instead of 2%. There is just a hint of flattening showing up in the curve. ←
For the Washington State data, the model has been a bit less accurate than for the U.S. as a whole. With 7/16 data, it projected 48,001 reported cases today when there were actually 47,563, an error of 34% (16% per day) in the projected vs actual increase from 46,268.
With data from seven days ago, it projected there would be 46,950 reported cases today. Again, there were 47,563. That’s an actual increase of 6,237, compared to a projected increase of 5,624. The error was around 11% for the week, or around 1.5% per day. ←
Admittedly, the model starts to deviate quite a bit from what happened when it looks back more than three weeks. It goes from a worst-case modeled vs actual error of 6.9% on 7/11 to 12.8% on 6/28, 14.5% a week earlier, and 21.9% a week before that. There was an obviously artificial glitch in test results being released around 6/27-6/28, which the model will only approach as it smooths its errors over the whole interval of interest.
Update 7/23: A Nightmare Plot for Spokane County updated with 7/22 data is here. The model projects 4,000 reported cases on August 4. ←
I hope this modeling work gives people some insights about the situation we face here in the U.S. But please note this critical disclaimer: First, I disclaim everything that the New York Times does. I’m pretty sure their lawyers had good reason for putting that stuff in there, so I’m going to repeat it. Except think “Ed Suominen” when you are reading “The New York Times”:
“The New York Times has made every effort to ensure the accuracy of the information. However, the database may contain typographic errors or inaccuracies and may not be complete or current at any given time. Licensees further agree to assume all liability for any claims that may arise from or relate in any way to their use of the database and to hold The New York Times Company harmless from any such claims.”
Second, I know very little about biology, beyond a layman’s fascination with it and the way everything evolved. (Including this virus!) I do have some experience with modeling, including using the ADE package of which this COVID-19 modeling is a demonstration file to develop some really cool power semiconductor simulation software that I’ll be releasing in a month or so from when I’m doing the GitHub commit with this COVID-19 example. The software (also to be free and open-source!) has a sophisticated subcircuit model for power MOSFETs that evolves 40+ parameters (an unfathomably huge search space). And it does so with this very package whose example you are now reading.
So–yes, this is still a disclaimer–I am not an expert in any of the actual realms of medicine, biology, etc. that we rely on for telling us what’s going on with this virus. I just know how to fit models to data, in this case a model that is well understood to apply to biological populations.
Don’t even think of relying on this analysis or the results of it for any substantive action. If you really find it that important, then investigate for yourself the math, programming, and the theory behind my use of the math for this situation. Run the code, play with it, critique it, consider how well the model does or does not apply. Consider whether the limiting part of the curve might occur more drastically or sooner, thus making this not as big a deal. Listen to experts and the very real reasoning they may have for their own projections about how bad this could get.
Seriously, experts know stuff. That’s what they do. One of them I recommend paying attention to is Dr. Osterholm at the University of Minnesota’s Center for Infectious Disease Research and Policy (CIDRAP). His interview in the July 1 podcast episode Viral Gravity is sobering but informative and, as we’ve seen in the nearly three weeks since, has been quite accurate about how serious the situation is. ←
Washington Post, “CDC chief says coronavirus cases may be 10 times higher than reported,” June 25. ←
There’s nothing magic about my 1% threshold for reported cases, i.e., 10% of the population actually infected. It is just easy to picture 1/10 of a population. That’s one of your digits from both hands. The Romans understood the power of that concept; “decimation” referred to the Roman Army’s practice of brutally killing every tenth man in a legion to terrorize and punish insubordination. ←
I have just the finger in mind when it comes to the “leadership” of our current president in this crisis, but that’s another blog post entirely. ←
Update 7/23: This projection is unchanged. Here is the Nightmare Plot for the U.S. overall, updated with data through 7/22. ←
Update 7/23: With today’s updated data and the 1.3% figure it now projects for 8/27, the 3:1 multiple would yield 4.2% of the population being infectious. Still more than the 3% figure discussed in the main text. ←
Update 7/23: Now projected as 0.92%. Close enough. ←
Acronym for “scientific wild-ass guess,” an old term of endearment used by engineers who know full well how many shaky assumptions are involved with many a technical endeavor. ←
Update 7/23: The projection updated with 7/22 data now looks like this: around 70,000 cases on 8/7, and around 94,000 on 8/22. The increase is projected to be 24,000 rather than 55,000. So, with all the other assumptions kept the same, we would be looking at 3.1% of Washingtonians capable of infecting you on 8/22. Not as much, but the discussion of 3% still applies. ←
Update 7/23: With 7/22 data, the “no-flattening” long-term projection for Spokane County is for around 2,000 new reported cases, just a bit less than previously projected. ←
There will hopefully be a vaccine early in 2021 but when and for whom? At the rate things are going now, I wouldn’t be surprised if we get to 50% infected before one becomes available for everyone in my family. ←
“[L]oud speech can emit thousands of oral fluid droplets per second . . . there is a substantial probability that normal speaking causes airborne virus transmission in confined environments.” Stadnytskyi, Bax, et al., “The airborne lifetime of small speech droplets and their potential importance in SARS-CoV-2 transmission,” Proceedings of the National Academy of Sciences Jun 2020. ←
Rebecca Martinson, a public school teacher in Washington State, explains her decision to not return to class, no matter what, in this thoughtful essay, “Please Don’t Make Me Risk Getting Covid-19 to Teach Your Child,” New York Times, July 18.
She writes, “My school district and school haven’t ruled out asking us return to in-person teaching in the fall. As careful and proactive as the administration has been when it comes to exploring plans to return to the classroom, nothing I have heard reassures me that I can safely teach in person.” After listing off all the sacrifices teachers have had to make for their students and careers, she says that it “isn’t fair to ask me to be part of a massive, unnecessary science experiment. I am not a human research subject. I will not do it.” ←
The danger of this virus to young people has been downplayed. It is frustrating how little statistical data seems to be available on this, but there is a significant minority of even younger people who suffer for months with the aftermath of a Covid19 infection. As high a figure as 5% has been tossed around, but with little statistical support. That is understandable, given that 2/3 of the number of Americans reported as infected have lived through less than six weeks of their positive test result. The majority of “long haulers” have yet to be. ←
Admittedly, we never were quite like them. Unlike these poor kids, we never had to suffer the downers of a deranged narcissist bigot wannabe authoritarian as president, a deadly pandemic, and an emerging economic depression all at once. ←
“COVID-19: Severe brain damage possible even with mild symptoms,” Deutsche Welle. ←

Friday, April 10, 2020

Marching Towards a Million

[T]he feelings of knowing, correctness, conviction, and certainty aren’t deliberate conclusions and conscious choices. They are mental sensations that happen to us.

—Robert Burton. On Being Certain

Update, 4/12: The curve is continuing to flatten a bit more than the model is able to extrapolate from having its six parameters fitted to the data as it stands each day. (I certainly have no objection to things coming in on the low side, even if it reveals some room for further possible improvement in the model, or just some random-walk stochastic behavior that simply can’t be chased via curve fitting and extrapolation, even with more model parameters.) There’s not been enough of a difference to go through this post and do a bunch of edits, so I’ve just added a few footnotes and this link to the latest version of the Nightmare Plot. Don’t get complacent now, OK?

For just over a week now, I’ve been running and re-running the “logistic growth with growth-regime transition” model I’ve developed for reported cases of Covid-19, evolving populations of individuals whose digital DNA consist of six parameters for that model:

xd(t, x) = r*x*(1-x/L) * flatten(t) + b
flatten(t) = 0.5*rf*(1-tanh(1.1*(t-t0)/th))-rf+1

Two of the parameters, L and r, are for the conventional logistic growth component of the model. L is the maximum number of possible cases, which at worst case is limited by the country’s population. The other parameter r is the initial exponential growth rate, before any curve-flattening becomes apparent.

The “growth-regime transition” component of the model, implemented by flatten(t), has three parameters of its own: There is a fractional reduction rf in growth rate at the conclusion of a lower-growth (flattened curve) regime. My modification to the conventional logistic-growth model uses the tanh function to smoothly transition from the original growth rate r to a lower “flattened” growth rate r*rf over a time interval th (in days). The transition interval is defined as the middle half of the full transition between growth rates, i.e., from still 75% of old vs new down to 25%. The midpoint of the transition is defined by the third parameter t0, in days after 1/22/20.

Finally, a very small constant number of new cases per day is included as a sixth parameter b. This does manage to “earn its keep” in the model by more flexibly allowing the mean of the residuals to be zeroed out.

Here is what the model is saying today, with Johns Hopkins data from this evening (4/10, click here for the latest one with 4/12 data).

April 10, 2020: Reported U.S. Covid-19 cases vs days since 1/22

Extrapolating forward (admittedly a perilous enterprise, as I’ve said before with endlessly repeated disclaimers and caveats, applicable yet again here), the model is projecting three quarters of a million cases¹ one week from today, and over a million the week after that.² The model’s projection–both now and as it stood on 4/5–is that there will be around a million and a half Americans–one in every two hundred–reporting infection with Covid-19 in early May, with the number still climbing faster every day.³

The curve is bending downward a bit, yes, but things still look pretty grim.

Past Performance

You’ve probably heard the phrase “past performance does not indicate future results,” and that’s true if something happens (as it often does) that’s not accounted for by the model. Life is messy and complicated, and that includes pandemics. But shitty performance does tell you not to bother looking any further. And that’s definitely not what’s been happening with my little model.

With parameters evolved to yesterday’s Johns Hopkins data (4/9), it had projected today’s increase in the number of cumulative cases to be 32,920 instead of the 35,098 new cases that actually got reported today. That was off by about 6%. (I calculate the error from the projected vs actual increase from the most recent known value, because the model is anchored to that point and can only be given credit or blame for what it projects from that last known point.)

With data from the day before yesterday (4/8), it projected yesterday’s number of new cases at 32,385. There were 32,305 new cases yesterday, an error of 0.2%.

And with data from 4/7, the model evolved to parameters that had it projecting 30,849 new cases the next day. There were 30,849 new cases on 4/8, a 6% error.

On 4/5, the model projected 32,346 vs the 32,133 that there were on 4/6, an error of just 0.7%.

The model of course does a little worse when it looks further ahead. You’d expect that of any naive model (i.e., not theoretically informed, beyond “the growth rate is going down”) of data from six empirically-optimized parameters being extrapolated with exponential growth. And it’s certainly not anything I’m ashamed of.

On 4/3, the projection was for there to be just over 600,000 cases today, compared to the 496,535 we have had reported at this point in the U.S. Quite a bit off, but remember, that was looking forward a full week.

On 4/5, it was for there to be 510,425 cases today, a total error of less than 8%, again with the error measured from the projected vs actual increase, not the absolute number.⁴ On 4/7, the model projected there would be 486,367 cases as of today, and that was off (in the increase) by 10%.⁵

Evolution in Action

(In more ways than one, unfortunately.)

I’ve been wanting to write a little about the whole process of computer evolution that I use to fit the model’s six parameters to the time-series data. It begins with a population of 240 simulated organisms (not a virus!), digital “individuals” whose randomly chosen⁶ DNA consists of the values of those six parameters, within predefined bounds.

After working on this model for over a week, I’ve refined each of those bounds to a reasonable range of possible values. Updating those ranges as the model makes sometimes failed attempts to find a convincing best fit is my sole remaining human-engineering activity, now that the model is designed and the code implementing it is working.

Each of those 240 individuals is given a chance to show how fit its DNA is by running a backwards integration of the model from the last known data point. The model, you may recall from reading my previous blog post is for the number of new cases each day, not the cumulative number of total cases ever.

The model is a differential equation; xd(t,x), not x(t). So, to have it project the cumulative number of cases x(t), I integrate the differential equation forward or backward from a fixed point represented by the most recent known number of cases, a point to which it is anchored.⁷

The modeled number of reported cases is compared to the actual number that were reported, for each day going back what is now a full five weeks’ worth of data.⁸

The fitness of each individual is measured as the sum of the squared errors (SSE) between each day’s modeled number of reported cases (the value that the model would expect, being integrated backwards) vs. the number of cases there actually were as of that date. The two figures are compared only after they have had a square-root transform applied to them. This limits how much more recent, larger numbers of new daily cases weigh in the fitness calculation vs earlier ones.

Then evolution gets underway, with each population member getting challenged by an individual, which gets spawned from combinations of not two but four population members. These mutant offspring have the model run with their parameters through the integration and SSE calculation. If they are better (lower SSE) than whichever population member is being challenged in its turn, they replace it. When all members of the population have received their challenge, possibly having been replaced, evolution proceeds to the next generation.

The whole spawning process is worth a moment of technical discussion. Differential evolution uses an interesting⁹ sort of mathematical equivalent to some kind of alien four-way sex to create challengers. The trial individual is formed from “crossover” between the a “target” (a rough DE equivalent of a mother) and a “donor” individual (closest thing it has to a father). The donor is formed from the vector sum of a base individual and a scaled vector difference between two randomly chosen other individuals that are distinct from each other and both the target and base individuals.

That is hard to follow in print, but this equation might help. The many-tentacled alien infant we want from this tangled act is ic, the individual challenging:

id = ib + F*(i0 - i1)
ic = crossover(it, id)

The crossover consists of giving each parameter of the donor individual a chance (usually a very good chance) to appear in the challenger, as opposed to using the target’s parameter. Basically, think of a higher number as being more for paternal rather than maternal inheritance. The default value used by my program (apologies for what is becoming an awkward analogy) is 0.7.

I refine the parameter bounds as needed and sit back while my ade Python package (of which this whole Covid-19 modeling is a single example file covid19.py) had 75 generations of these simulated individuals spawn and fight it out on a virtual Darwinian savanna. The software dispatches SSE-calculating jobs asynchronously to worker Python interpeters across the timeslots of my six-core CPU. It takes about two minutes. The software produces a plot with the model’s curves in red and what actually happened in blue. It’s what I’ve been calling the Nightmare Plot.

Singing in the Apocalypse

The Nightmare Plot really is horrifying when you think about what those numbers represent. Their relentless upward march–slowing but by no means stopping–is making everyone’s life suck including my own. The novelty of this whole apocalyptic survival thing is starting to wear off just a bit, even for me.

Maybe I’m a terrible person, but dammit I can’t help experiencing a bit of pride, too. This “little model that could” is proving to be the no-pay, no-fame, no-acceptance academic-mathematical equivalent of, say, some college undergraduate inventing an optimal radio receiver frequency arrangement, now in use by the circuitry of your smartphone, as part of an independent senior project he decided to work on weekend after weekend a quarter century ago. (This hypothetical individual never was much for working in groups.)

So, take it or leave it, folks, you’ve got yourself a six-parameter mathematical model for the number of reported cases of Covid-19. It was never “published” in some elite-accepted overpriced package of specialty information. It wasn’t part of any network of peer reviewers. (Like the aforementioned loner radio geek, I’ve never been one for playing in groups.)

But it does appear to work.

———

A friend of mine told me today that he is selfishly rather enjoying this whole situation. He now has lots of time to learn things on his own that he’s been wanting to work on, time for “driving out to pretty places to take pictures and go for walks in the woods.”

He’d rather be back doing what he does in person. But, he admits, there are upsides.

I assured him that it’s 100% OK to enjoy those upsides, even as I admitted my own feeling of finding it a little weird to derive satisfaction from successfully modeling these awful numbers. But I had the benefit of receiving yesterday some reassurance in this area, as I was talking about this very topic with a friend of mine whose life’s work revolves around how people think and feel about things.

He told me he can’t be as helpful to others in his profession if he isn’t taking care of himself, and that means enjoying life in spite of or even at times because of what is otherwise a horrible situation. Of course he’d rather not be in it, nor would I or you, dear reader with a delicate pair of lungs of your own. But he is, and you are, and I am, and so let’s take what good there is to be had.

Smile and sing and laugh, and take pride in the work that you now have. Even through the Apocalypse.

Notes

With data updated as of 4/12, the projection is now 700,000 by 4/17. A bit lower, and the curve is continuing to flatten, but not by much.
The careful reader may notice that I always refer to “reported U.S. cases” or some alternate variation and wording. I will repeat yet again that I have no expertise in biology, medicine, or the spread of infectious disease, and so I try not to speculate on how many of our fellow citizens have actually gotten infected with this thing without it ever being reported. Again, I’m just a retired engineer who has spent years constructing nonlinear models of mostly electronic things and believes this one is a pretty well grounded for doing the following, and only the following: predicting what will happen if the data continues as it has recently, especially as it has in the past two weeks. ←
With data updated as of 4/12, the projection is now for a bit less than 900,000 cases by 4/24. One redditor cleverly observed that my modeling’s million-case projections have been like those for fusion energy. (The saying is that fusion is 30 years away and always will be.) I won’t dispute that observation at this point; each day’s new data for the past week or so has pushed that projection outward a bit, though never making it look any more implausible to reach eventually. ←
Let’s call it mid-May now, given the recent additional curve flattening (4/12 data). ←
Although the difference between computing the error is less with such a large increase from the 4/5 last-known number of 337,072. In case you really want to know, the absolute error from the model projecting forward five days was 2.8%. ←
The actual number of cases as of 4/12 from Johns Hopkins’ daily data was 555,313, an increase of 58,778. On 4/10, the model was projecting 567,306 cases, or a projected increase of 70,771. The error in the increase was 20% over the two-day interval, or 10% per day. Not as good as previous days’ next-day predictive performance, but not terrible, either. And since there will always be an error when extrapolating from a curve fit to data having a random component, I’m happy the data is lower than projected and not higher, because I have my own delicate pair of lungs that I’ve grown fond of, too. ←
Actually, they’re not quite chosen with uniform randomness: The default is to set up the initial population using a Latin hypercube. That way you get initialize pseudorandom parameter values, but with minimal clustering. You want the six-dimensional search space to be explored as fully as possible. ←
This is known as an “initial value problem,” the initial value here being the last known number of cases. You can go either direction in time from the initial value. For fitting the model parameters, my algorithm goes backwards from the most recent data. To extrapolate forward and make projections, it goes forward from the same “initial value.” ←
Reported cases numbers before March 5 are omitted from both the curve fitting and plots. The modeled (red) curve deviates from the historical data when you go earlier, and I’m not sure a good fit that far back in the history of this thing is relevant to what’s happening now. ←
You claim you wouldn’t find alien four-way procreative sex interesting? Well, I don’t believe you. ←

Thursday, April 2, 2020

Modeling a Slightly Flattening COVID-19 Curve

So last year 37,000 Americans died from the common Flu. It averages between 27,000 and 70,000 per year. Nothing is shut down, life & the economy go on. At this moment there are 546 confirmed cases of CoronaVirus, with 22 deaths. Think about that!

—Donald J. Trump (March 9, 2020)

And it hit the world. And we’re prepared, and we’re doing a great job with it. And it will go away. Just stay calm. It will go away.

—Donald J. Trump (March 10, 2020)

TL;DR: A modification to the logistic growth model that provides a smooth transition to a lower-growth regime is fitting the data for U.S. reported cases obtained on April 2 from Johns Hopkins University with errors of less than 5% each day for the past week, and less than 14% for the week earlier. It indicates that there will be around 275,000 U.S. reported cases tomorrow, around 311,000 the day after that, and half a million in a week from now (4/9). According to the model’s projections (which must be treated with caution as for any extrapolation), with exponential growth tempered by “flattening of the curve” proceeding as it has been, the million-case mark will be reached in mid-April. The projection is for a couple million reported U.S. cases by the end of the month. See the Nightmare Plot and all my Covid-19 blog posts for some important development history, context, disclaimer, and even a few thoughts about putting this thing into perspective.

Update, 4/3/20: The exact projection was that there would be 275,799 U.S. reported cases today if the numbers continued as they have for the past two weeks. Today’s number was 275,586. I see no need to update anything else in this blog post, but if you really want to see the latest version of the Nightmare plot, it is here. There has not been even a 0.1% error between the model and the actual numbers for the past three days.

Over the past two weeks, what began as a simple new usage example to include with my free, open-source evolutionary parameter-finding software wound up turning into something of a ghoulish personal challenge: to identify a statistically plausible mathematical model that would fit decently with the exploding numbers of reported Covid-19 cases in my beloved but increasingly broken country, the United States of America.

I’ve made lots of refinements to the code and its underlying modeling over that time, and discussed that along with some projections (accompanied by important disclaimers, which also apply here) along the way in my blog posts of March 19, March 22, and March 25. Along the way, I tried several different approaches to modeling this thing: the logistic growth model, the power law with exponential decay, and even a linear combination of the two, always accompanied by a small linear term.

What increasingly intrigued and frustrated me, though, was the inability of any of those models to deal with a slight “flattening of the curve” that has become apparent in the U.S. national numbers of recent days. Fitting the logistic growth model xd(t,x)=x*r*(1-x/L) to the data resulted in an unrealistically low value of L, because that’s the best such a simple model could do with the recent reduction in growth. The power-law model with exponential decay¹ xd(t)=a*(t-ts)^n*exp(-(t-ts)/t0) provided interesting results that didn’t seem unrealistic and that proved pretty accurate at predicting the next few days’ reported case numbers. But there was no evidence of any realistic value being found for ts to implement the “decay” part of the model, and so the values beyond just a few days out had a very wide span of uncertainty.

I tried a linear combination of the two models, but that meant a lot of degrees of freedom. The combined model valiantly tried fitting itself with all its parameters to an anomaly in the US reported case data from 3/16 to 3/18, and offered me even less confidence than usual for making extrapolations.

But today an idea occurred to me to try, and it worked. The result is a logistic growth model that does a smooth transition from an earlier growth rate to a later, lower one:

xd(t, x) = r*x*(1-x/L) * flatten(t) + b
flatten(t) = 0.5*rf*(1-tanh(1.1*(t-t0)/th))-rf+1

There are six parameters:

L is the maximum number of possible cases, which at worst case is limited by the country’s population.

r is the initial exponential growth rate, before any curve-flattening becomes apparent.

rf is the fractional reduction in growth rate at the conclusion of the lower-growth (flattened curve) regime.

th is the time interval (in days) over which the middle half of a smooth transition occurs from the original higher-growth behavior to a fully flattened curve with the lower “flattened” growth rate r*rf.

t0 is the time (in days after 1/22/20) at the middle of the transition from regular logistic-growth behavior to a fully flattened curve.

b is a very small constant number of new cases per day. This is mostly there to zero out the mean of the residuals so that I could get decent goodness-of-fit indicators for the model.²

———

Here is the result, my latest (and perhaps last?) revision to what I’ve been calling the Nightmare Plot. The image was produced by the latest commit of covid19.py Python script using data released this evening from Johns Hopkins University. It’s an important image, so go ahead and click on it to see it with full resolution, or even download it if you want.

April 2, 2020: Reported U.S. Covid-19 cases vs days since 1/22

Extrapolating forward (admittedly a perilous enterprise, but I’m curious and so are you), the model is projecting half a million cases one week from today, and over a million the week after that. The end of April is looking pretty scary indeed, with the potential for one in every hundred Americans to be reporting infection with Covid-19³ and the numbers still continuing to climb.

Logistic Growth with Growth-Regime Transition

If it weren’t for the terrifying reality of what is represented by the numbers it spits out, I would be delighted with the beautiful simplicity and performance of this new model. It adds an important feature to the logistic growth model that is well accepted for population-limited reproduction. And it’s a feature uniquely suited to what’s happening right now with the Coronavirus.

This new feature I’m calling “growth-regime transition.” It smoothly reduces the growth rate from its initial unrestrained amount to a lower one. Thus it can account for recent efforts of non-idiot governors and their less ignorant citizenry to “flatten the curve” by staying home and being careful and apart when they have to go out.

And that’s exactly what it’s doing, to a remarkable degree of precision. Even as the numbers have multiplied fivefold since a week ago, the model tracks each day’s cumulative numbers to within 5%. Going back a week further to when there were about 1/70 as many cases (3,499) as today, the model is no more than 13% from each day’s actual numbers. Getting that close that many times in a row with just six parameters, and with the numbers doubling six times, feels a little spooky to me.

It’s not just my gut feeling that indicates this model is really onto something. The residuals (a fancy name for the difference between what the model said a day’s numbers should have been and what they actually were) are shown in the middle of the Nightmare Plot with some information about them. There is no statistically significant evidence that the errors are anything but normal random variation. There may be a little heteroskedasticity to them, but I think they pretty much look like residuals should. The square-root transform I’m applying to the modeled and actual data before calculating the sum of squared errors mostly seems to be dealing with the huge change in magnitude of the numbers over time.

The middle plot includes a computation of the model’s Akaike information criterion, corrected (AICc), score. It’s -8.62 for this “logistic growth with growth-regime transition” model, lower and thus better than for any of the other models I’ve tried. Operating on the same dataset, the AICc was around zero for the 7-parameter linear combination of logistic growth, power law, and linear (projecting 460,791 cases in one week). It was +3.60 for just the logistic growth plus linear model (projecting 384,264 cases in a week).

The power-law model with time-shifting, exponential decay, and a linear term did pretty well as a second runner-up to the new one. Its AICc was -4.58 with today’s Johns Hopkins data, and it is no more than 8% from those actual numbers going back to 3/25. It has nice random-seeming residuals with no evidence whatsoever of non-normality. But it does only track recent data, quickly deviating from what actually happened when you go more than a week back. For what it’s worth, it projects 503,435 cases a week from now.

———

Another promising aspect of the model: Its parameter values lie in nice rounded distributions with respect to the sum of squared error, indicating that they (or most of them, L being a stubbornly hard one to identify at this still-early stage) are all earning their keep.

Here is the distribution of both parameters L and r for the logistic-growth aspect of the model, plotted against SSE (sum of squared error, after the square-root transform is applied to modeled and actual data values). The black dots represent parameter values that challenged some member of the population at some point and failed. The blue dots are values that were part of the population at some point in the 50-200 generations that the differential evolution algorithm ran, but got replaced by somebody better. And the red dots are members of the final population when the evolution algorithm terminated.⁴

It’s hard to make much of L except that it’s probably a bigger number than any of us would like. The initial growth rate r is very well defined, though, at an increase of around 35% per day.

Logistic growth parameters for April 2 data

The parameters rf (r-flattened), th (half-transition time), and t0 (transition midpoint) are all very nicely defined. The growth rate transitions to about 68% of its original value, with the middle (non-tail) half of the transition taking place over the course of a little over six days centered around 63 days after 1/22/20, or around March 24. The linear term b (actually, a constant number of new cases per day) is tiny–tending toward zero–and would be omitted except that it allows the residuals to be centered around zero without impacting the other parameters.

Growth-rate reduction and linear parameters for April 2 data

In Praise of the Hyperbolic Tangent Function

As I’ve explained in previous posts, my programming work (I’m a retired electrical engineer and patent agent turned hobbyist open-source software developer) in regular life about a month ago (seems like more like a year) was wrapping up a two-year-long project that does advanced simulation of power semiconductor circuits. I’ve developed a model for power MOSFETs that evolves optimal values of its 40-odd parameters using an asynchronous version of the differential evolution algorithm that I developed for that project.

A function I have grown to know and love from that work is the hyperbolic tangent. It plays an important role in electronic circuit simulation software to provide smooth transitions from one state to another. You can’t expect software that does stepwise integration to behave well when you suddenly change a zero to a one. The integrator wants to follow a smooth path. If nature abhors a vacuum, then electronic circuits–real or simulated–abhor discontinuity. You have to go gradually from one voltage or number to another, even if “gradually” involves a transition time measured in the billionths of a second.

When it comes to “flattening the curve” of a deadly pandemic by incarcerating ourselves for weeks on end, my layman’s hunch was that the transition time would be in days. Well, it was.

According to what evolved as the best-fitting combination of the six model parameters, the rate of growth in reported U.S. Covid-19 cases transitions halfway from a 35% daily increase to around 23% over the course of six days. The middle of that transition was on March 24. Look at the closeness of fit around that time, and afterward. There’s not even a tenth of a percent error in what the model says yesterday’s numbers should have been. The model’s worst error in the past four days, as the numbers have doubled, has been 1%.

Previous Projections

On March 19, I made my first public post about my modeling efforts. That day, the cumulative number of reported cases of Covid-19 in the United States was 13,677. That is according to data released each day by Johns Hopkins University, on which I have been basing all of my modeling work.

When I refer to “cases” below, it is from this same data set, provided as a CSV file from Johns Hopkins. They’ve done the public a great service by making such a valuable resource freely and openly available. I’m releasing everything from my own 70+ hours of amateur work to the public as well, in a small way of following that spirit.⁵ The Nightmare Plot is dedicated to the public domain; you can copy it without restriction, though I hope you will leave the identifier there that urges reference to my disclaimer language.

I was using a non-differential form of the logistic growth model, with four parameters: an upper reproduction limit L, the growth exponent k (I now call it r, because the programmer in me hates using k for anything but an index variable), plus two of my own, a time-shifting parameter t0 and a linear term a (I now call it b). The function for the modeled number of total cases, cumulative, on a given day t (units are in the number of days after 1/22/20) was f(t)=L/(1+exp(-k*(t-t0)))+a*t.

With those parameters evolved for a best fit with the 3/19 data, using the well-known differential evolution algorithm implemented by my ade Python package, that model predicted that there would be 429,414 cases today. It was 76% too pessimistic; we “only” have about 18 times as many cases as then, not 31.

On March 22, a week and a half ago when there were 33,272 U.S. reported cases, I posted again, this time about results from modeling with a linear combination of the regular logistic growth component, a power-law component with exponential decay, and the linear term (constant differential) b fitted to that day’s Johns Hopkins data. The combination model projected that there would be around 350,000 cases today. That was around 40% too high.

Thankfully, there has been some flattening of the curve since then.⁶ I wasn’t really seeing it show up in my other models, though. My most recent blog post before the one you are now reading, from March 26 was still being pessimistic, with a projection of 359,169 cases for today.

I was waiting and hoping for the model to continue to be shown pessimistic due to some “fancy theoretical factors that I didn’t even bother trying to understand finally emerge in a few days (hell, how about tomorrow?),” and for that curve to finally get “its long-awaited downward bend.” That has happened to some extent, and hopefully the open-source code that runs my latest (and hopefully last!) attempt at a model addressing it will continue to update to an even more lowered growth rate. Making pretty plots of that happy circumstance is left as an exercise to the Python-savvy reader.

A Final Note

Awful implications aside in a time of global crisis, there is a certain illicit thrill to doing my own naive research and thinking in a field supposedly best left to the experts, enjoying the faint flicker of possibility that I might be seeing the equations slide into place in a way that all their formal training somehow missed.

But, this is as good a time as any to repeat, yet again, that I have no expertise in biology, medicine, or infectious disease. See all the disclaimer stuff in my previous posts. I’m just a retired engineer who has always had a passion for modeling interesting data. I’m really looking forward to releasing a free, open-source Python package that does some really cool modeling of circuits using power MOSFETs, and I hope that makes the free Ngspice circuit simulator a lot more interesting to people.)

Also worth repeating:

The model is just the integration of a first-order differential equation that can be extrapolated to predict reported U.S. cases of Covid-19 if the reported cases continue as they have for the past two weeks.

Be well. And, please, as a fellow citizen who is more worried right now about the health of our Republic than that of my body, remember when you vote in November how much your President fucked things up and how he will have doubtless continued to behave like an idiot wannabe dictator in the months ahead.⁷ Personally, as I contemplate a near future consisting mostly of hiding away from friends, family, and neighbors to avoid a disease that otherwise could very well drown me in my own lungs, I am not quite yet sick of winning.

Notes

Modified by me with an additional ts parameter. It shifts the start of both growth and decay to some best-fit date well after the first reported case in the Johns Hopkins data set, on 1/22/20. ←
With every single model I’ve tried thus far, including this latest and hopefully final one, the parameter evolution favors smaller and smaller values of b tending toward zero. ←
The careful reader may notice that I always refer to “reported U.S. cases” or some alternate variation and wording. I will repeat yet again that I have no expertise in biology, medicine, or the spread of infectious disease, and so I try not to speculate on how many of our fellow citizens have actually gotten infected with this thing without it ever being reported. Again, I’m just a retired engineer who has spent years constructing nonlinear models of mostly electronic things and believes this one is a pretty well grounded for doing the following, and only the following: predicting what will happen if the data continues as it has recently, especially as it has in the past week or two. ←
You can create these plots for yourself from the .dat files in the ade/examples directory, or from a .dat file you generate yourself. Just run the pv command that gets installed when you run pip install ade. There are some important command-line options; run pv -h for details. ←
My ade Python package of which the covid19.py file is a part, is free to download, run, modify, and incorporate into other work under the terms of the Apache license.
The terms of use for the Johns Hopkins data (and I expressly incorporate by reference their disclaimer language into this and all my other Covid-19 related blog posts, obviously with my name substituted for theirs) are as follows:
“This GitHub repo and its contents herein, including all data, mapping, and analysis, copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. The Johns Hopkins University hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.” ←
At that point, I had not transitioned to modeling the number of new cases each day and then integrating the model’s first-order differential equation into a plot of cumulative cases each day. Instead, a nonlinear regression version of the logistic growth model was fitted to cumulative case numbers. ←
To say nothing of his pathological narcissism being the driving force for his every decision, from downplaying the significance of the pandemic just weeks ago and unleashing the wave of infection now emerging in MAGA country, to his demands that governors be nice to him before getting much-needed (and taxpayer-funded) federal assistance. ←

Subscribe to: Comments ( Atom )

Privacy & Copyright

This blog is run on Google’s servers, so there is undoubtedly some tracking going on. I use Google Analytics to get an idea of how many of you are real visitors and how many are just log spammers. Fortunately, the Fourth Amendment to the U.S. Constitution makes it illegal for the government to access any of your browsing history. Ha ha.

Click on any image to enlarge it. All original writing, artwork, and photography on this site is Copyright © 2013 -15 Edwin A. Suominen. All Rights Reserved, except that you may freely use my images for non-commercial purposes, with attribution, under the Creative Commons Attribution-NonCommercial 3.0 Unported License. Some postings contain copies or adaptations of Creative Common-licensed photos or text by others; see the fine print at the end of each posting with attribution and licensing information for that derivative or third-party content.