Science and Data and Freedoms
There are millions of these rants around, so you are under no need to read mine. In fact, what I am about to say here should not be taken as anything more than one person’s opinion. OPINION. I have several qualifications (I will get to those in a second), but still, this blog is primarily concerned with, as the title suggest, wrong thought. And yes, thoughts can be flat-out wrong, but that’s another topic for another time, yes? I primarily abandoned this blog when tumblr decided to advocate for censorship, and well, if you don’t think that was very bad thinking, then I can’t help you and you certainly should stop reading now. But mostly, I find myself needing a little bit of a platform to rant, so here it is. This is not for you. This is for me. But maybe, if you read it, and you learn something, then it was a little bit more than that, and that’s entirely unnecessary but I’ll be fine with it. Don’t worry, I’ll keep it a secret.
My qualifications
1) I have a Ph.D. from a major research institution in America. What that means, most importantly, is actual training in how to read and understand academic writing.
2) I teach statistics, among other things, and I teach in a public health college at another major research institution in America.
3) I work with epidemiologists, though I don’t claim that title myself (I describe myself a psychometrician with an expertise in educational measurement), and I am currently working on several projects using epidemiological methods.
4) A portion of my work in educational measurement focuses on critical thinking, particularly the development of critical thinking and problem-solving skills.
Premises
So, let’s organize this in a logical manner. To do so, we generally start with a series of premises. Here are some of mine.
1) Most people are afraid of dying.
2) The fear of dying plays some part in how people live their lives.
3) People are willing to make some tradeoffs between Safety and Liberty.
4) There is an inverse relationship between Safety and Liberty. The more liberty, the less safety. This is only a unidirectional inverse relationship (as liberty ascends, safety decreases), and NOT true in the opposite direction (as safety ascends, liberty must decrease). This is VERY IMPORTANT.
5) People are poor estimators of their own odds of death, and especially how certain events (say, getting drunk at a party or smoking a hallucinogenic drug or driving recklessly) contribute to their risk of death.
6) There is much unknown about the “novel coronavirus” or SARS-COV_2 or Covid-19 (use whatever term you are comfortable with, the distinction between all of these is arbitrary and unimportant... the root of communication is exchange of messaging between two parties, and all these terms work fine in most cases, since we’re hardly in a lab where it is very important to separate out disease, virus, symptoms, and classifications).
7) Action has been taken by governments and individuals exceeding their statutory authorities.
8) Some of the actions taken by governments and individuals makes no difference in the ability of people to live disease-free, but does have other impacts.
9) The “other impacts” in Premise 8 can directly cause loss of life, as well as other ramifications (lack of social mobility, inability to secure safe food supplies, increase in spousal/partner/child abuse, lack of ability to achieve an education, etc.) that have social and personal consequences for potentially many years, if not generations. This is the most controversial premise, because it has a tendency to operate on some slippery-slope type logic, which is exactly what I am going to rant against in a second. Be wary of this one! But it is important too.
Statistical Problem #1: Never Believe a Point Estimate
If you take (my) Stats101 class, and hopefully anybody else’s similar course, one thing that should be a key takeaway is “NEVER BELIEVE A POINT ESTIMATE.” That’s huge. Never. Believe. A. Point. Estimate.
So, for the people who haven’t had a Stats class recently, what is a point estimate?
When you see something like “an estimated 2.2 million Americans will die from the coronavirus if action is not taken,” that “2.2 million” is a point estimate. It is a single point. And point estimates are a hallmark of bad reporting of often bad science. In statistics, any time we make an estimate, we generate a confidence interval: that is, the range around which we believe that estimate to be actually correct. This is because we don’t measure everybody; we measure a small sample, and use math to make estimates. Since we didn’t measure everybody, there is some degree of uncertainty, and so we calculate a range that we think is very likely to contain the actual number. This is called a confidence interval. The wider the confidence interval, the LESS confident you are. The narrower the confidence interval, the more confident you are.
An example. The New York Yankees hit 306 home runs last year, and had 5561 at-bats over 162 games, meaning they hit a home run about once every 20 at-bats. Let’s say I believe the season will be cut in half (so, 81 games instead of 162). So, I want to know how many home runs the Yankees will hit in this shortened season. Let’s work through several examples.
The worst example (okay, not actually the absolute worst, because I could just guess, but pretty bad).
In half the games, the Yankees will hit half the home runs. So that’s 306/2, so that’s 153.
Here’s another BAD example, but it does look legit, doesn’t it?
Half of 162 is 81. So in half the games, they will have half the at-bats, so that’s 2780.5 at-bats. They hit a home run previously in 5.5026% of their at-bats, and 5.5062% of 2780.5 is 153. The Yankees will hit 153 home runs next year.
A much better example
The Yankees averaged 1.8888 home runs a game (306 / 162) last season. If we take the low-end of 1.5 home runs per game (or three home runs every two games), and a high end of 2.25 home runs per game (or 9 home runs every 4 games), we expect the Yankees to hit between 121.5 and 182.25 home runs in the shortened 81 game season.
Is there a perfect example?
No. This is a great question. Introductory statistics students will start to add all sorts of great considerations to this question: in the shortened season, won’t pitchers have less time to get warmed up, so home runs will go up? But the same is true for batters, so home runs might go down? If the shortened season starts later, and is played in more colder weather, are there fewer home runs? How did the Yankees roster change? Are they playing against more fly-ball or ground-ball pitchers? Who changed in the rotations of the teams they will play most? Will the rule change about facing three batters or the end of an inning increase the amount of home runs? What about conditioning of athletes who are homebound? No statistical estimate can take into account all factors. And we don’t try to. We just play the games and then call it history.
So, what are the problems with the “much better example” besides not adding in all those other things?
There is nothing wrong with it, it is just not very precise. A range between 121.5 and 182.25 is more than 60, which is basically half of the low-end. We could be like, 50% wrong from our low end and still be in the range! That’s not very precise!
So, what does this have to do with the current issues?
Mostly, I want you to very carefully consider any number you hear without a confidence interval. If you hear a number like “2.2 million,” realize that without a stated confidence interval, the interval could be ANYTHING. Something like, oh, I don’t know... 2.199 million. Yep. In other words, the only thing you could take away from that number is “anywhere between 1 person and 5 million people. And how much are you willing to give up for that particular risk?
Statistical Problem #2: Confidence Intervals WITHIN models
So, to this point, hopefully I’ve described all the things that can go wrong if you don’t use a confidence interval in your ANSWER. But what about in the MODEL (or the prediction) itself? Let’s say that, in the above example, we wanted to know how many home runs the Yankees will hit, and we know that MLB will shorten the season. But we don’t know by how much.
So, let’s say that I estimate the season will be between 60 and 100 games. That’s a pretty big margin. Using my earlier estimates, now my confidence interval expands again: 1.5 x 60 for the low end is only 90 home runs, and 2.25 x 100 is 225 home runs! Now my range is [60:225]. That is VERY imprecise!
The important part is that this problem compounds each time we don’t know something. You get a wider and wider range, the less you know. So, the more you want to put into a formula, the more you need to know... and the less you know, the wider your estimate.
Statistical Problem #3: The Missing Denominator
None of the math here is particularly difficult, especially with the aid of computers and a bit of training. So, if somebody is presenting it to you like it is super complex, think of them like a stage magician: distract, watch the glitter, and you will never notice my hand pulling the pigeon out of my coat pocket and putting it into my hat.
So, what have models been hiding from you?
The big missing piece is the denominator, or in this case, “how many people have the virus.” That’s a VERY important number. We need several things to build an epidemiological model, and without even an estimate of “how many people have it,” then all the rest of this is pretty much pointless. This is because “how many people have it” is needed for at least the following:
1) Transmission Rate
2) Infection Rate
3) Fatality Rate
Luckily... we’re actually getting close to having that number! Or at least, a confidence interval for that number.
Understanding recent data
https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1.full.pdf
Basically, that paper says that in one county with a lot of cases, they estimate there are somewhere between 2.49% and 4.16% of the population infected, and they wouldn’t be surprised if those numbers are between 1.80% and 5.70%. There are about 1.93 MILLION people in Santa Clara county. 1,930,000, and between 2.49 and 4.16 are ALREADY infected. So, let’s math that out, and I’m using their narrower confidence interval here.
Low End (2.49%): 48057 already infected
High End (4.16): 80288 already infected.
So, now we have an actual denominator! Or at least, RANGES of one. They’re pretty confident the actual number is somewhere between those.
The date is important here. The data here is April 1. That range (48000-80000) the number of infected people as of April 1. As of April 17th (over two weeks later), Santa Clara had reported 73 deaths. 63 of those had one comorbidity, and only 5 had no comorbidities. Here’s the source.
https://www.sccgov.org/sites/covid19/Pages/dashboard.aspx
So, what’s the fatality rate?
LOW pop prev: No comorbidities: 5 / 48000 = .0001041666.
LOW: One or no comorbidities: 68 / 48000 = .00141666
HIGH pop prev: No comorbidities: 5 / 80200 = .000062344
HIGH: One or no comorbidities: 68 / 80200 = .00084788
We’ll go broad here, and assume one comorbidity. Hey, a lot of us have something that is an issue, right? But let’s apply those number to the American Population of approximately 330,000,000 people.
LOW (zero or one comorbidity) pop prev: 330mil * .00141666 = 467,497.8
HIGH (zero or one comorbidity) pop prev: 330mil * .00084788 = 279,800.4
There’s your number. WOW, you say! Wow! A QUARTER TO HALF A MILLION PEOPLE MIGHT DIE! That seems shocking!
It is, super shocking. Remember, that’s the zero-case scenario. The scenario where we do nothing. Worst-case. No vaccine, no medication, no treatment, no social distancing, nada.
Oh, let’s go ahead and go over some other numbers. Not scenarios, actual data.
Motor Vehicle Deaths (2018): 36,560
Medical Error Deaths (2011): Between 210,000 and 400,000
https://journals.lww.com/journalpatientsafety/Fulltext/2013/09000/A_New,_Evidence_based_Estimate_of_Patient_Harms.2.aspx
Accidents (2017): 169,936
Diabetes (2017): 83,564
Influenza/Pneumonia (2017): 55,672
Suicide/Self-harm complications: 47,173
https://www.cdc.gov/nchs/fastats/deaths.htm
((Note, because somebody will inevitably ask: The “Death by Guns” rate is a tough one to count, because the majority of gun deaths are also suicides. The Gun Homicide+Accident fatality rate is likely between about 10,000 and 13,000 per year (about a third of the car accident fatality rate). If you’re interested in that number, be sure to look at the data split by category, or if you are interpreting suicides with guns in your gun death count, just be explicit about it, don’t be a pigeon-holding magician.))
Interpretation: Doing nothing at all, we would expect Covid to jump the rates of Influenza/Pneumonia deaths from 7th to 3rd in America, with somewhere between about 340,000 and 530,000 deaths. I arrive at that number by adding 60,000 to the estimates above, for other non-Covid related Flu/Influenza deaths. That would put Influenza/Pneumonia above the estimates of death due to medical errors, and well behind the two leading causes of death in the US (CVD and Cancer). This is provided that there is no emergent medical option.
So, what’s the downside? Why not do all these drastic things (like shelter-in-place orders and be forced to shut down your business) if it prevents between 1/4 and 1/2 of a million deaths?
That’s a good question! The point here is that orders have consequences, and most of them are unknown at the time of the order. For example, let’s take a pretty simple policy: requiring every driver to car insurance. Seems like a fundamental thing, right? Well, now you’ve also driven the price of car ownership up. More rural areas (which are often poorer) now have an additional cost burden, that is not shared by people who live in major cities with large public transportation networks. And you’ve created a secondary market (insurance agents) who now have incentives to raise prices, and huge potential for collusion. And what about people who defy that order? Well, that’s tricky-- in some places there are additional policies for covering wrecks involving uninsured drivers, and in those places, car insurance costs more. So you’re paying more, out of your pocket, because somebody else didn’t follow a policy. And that means you have less money to go shopping or go out to eat, which means fewer people at stores have jobs. All of this ties together.
So, what are the unintentional consequences of the shelter-in-place and business-shuttering orders? The most obvious ones are the losses of income, including jobs, and the 10 million accompanying jobless claims. But is that such a big problem? Think about what is happening in homes without jobs... and remember, you are still legally required to pay car insurance. So that’s the direct one.
But there are multitudes of indirect ones. For example, this is not an academic article, but...
https://www.usatoday.com/story/news/investigations/2020/03/21/coronavirus-pandemic-could-become-child-abuse-pandemic-experts-warn/2892923001/
And remember, a lot of children who are subject of abuse are from low-income families. And what did they normally get? Free and reduced-price lunch at schools. Now, they aren’t getting those. Sure, in a few places here and there, some schools are delivering similar meals. But the vast, vast majority of elementary and high-school aged students on free/reduced lunches are not getting them. So that leaves parents (or caretakers) to pick up the burden. Those same parents and caretakers who are filing the 10 million unemployment claims. Uh-oh. Sounds stressful.
Guess what stress does to people? It makes them sick. And you know what happens when you get an ulcer? Hopefully not much, but bad ones can end you up in a hospital. Where there are many procedures, but most of them minor. Unfortunately, hospitals right now are being forbidden from doing elective surgeries. And elective surgeries helped pay for other services, like necessary surgeries and emergency care. So, the ER is literally understaffed, even in regions where there are no COVID patients, because the state has forbidden the tummy tucks that pay the salaries of ER nurses.
You see the tumble here? This is where I cautioned earlier about the slippery slope argument, and it is an absolutely valid critique of what I’m putting here. But we’ve gone past speculation territory and are now in data territory. And (again, work in health care education), I know some people who are starting to see these effects. One of the faculty at my school (teaches our Law course) is a lawyer for a rural hospital service. He has watched them lay off or furlough over 60% of workers. And they have had... wait for it... 0 covid cases. The few that were suspected, they flew down to a much larger hospital. At high cost, because they can’t charge for COVID services.
Meanwhile, you’re talking a rural system that was one of the top employers in four different counties. Laying off or furloughing 60% of workers. The guy was so upset telling me about this that he almost cried, especially because he knew the families of so many of the people his board had just let go.
Any caveats to add?
The big caveat that I place on the interpretation here (basically, that’s we’ve VASTLY oversold the risk of this thing) is that we don’t know about secondary infections. If you can get infected twice, and that second infection is harmful or make you able to spread the disease to others who are then harmed, then all these numbers are too low.
Bottom-line it for me, WT.
Fear leads to the dark side, where you have no freedoms. Don’t give up things because you were scared and because somebody showed you a point of data that you should not believe.