This short note makes one simple point. If you are interested in estimating the proportion of Corona infected people in some country or region, there is a simple and better (more precise) estimate than the one you obtain by computing the sample proportion. You can also read this in German here (and here).
Setup
Consider taking a (completely) random sample of individuals in some population in order to estimate the proportion of people in this population who have the Corona virus. Let
denote this true proportion. I here assume that we already know, through potentially non-random medical testing, that there is a certain fraction
of the population who definitely have the virus (or have had it). I will refer to these people as those that were declared to have the virus. I assume that whatever medical test was used to obtain this number was perfect, at least in one direction: anyone who has been declared to have the virus this way also actually has it. As, thus, necessarily
we can write
where we interpret
as the multiplier or ratio of actual virus cases relative to the declared virus cases. I am here interested in estimating
from the random sample knowing
If we have an estimate for
we get one for
by multiplying the
-estimate with
.
When we take the random sample, we collect two pieces of information from each person. One, we check (again, for the sake of simplicity, with a perfect medical test) whether or not they have the virus. Two, we ask them (and the subject answers truthfully) whether they have already been declared as having the virus. I will call the total number of virus cases in the sample and
the total number of already declared virus cases in the sample.
Estimator
Many people would probably be tempted to use as the standard estimator for
and, thus, indirectly
as the standard estimator for
. It turns out that there is a better estimator that uses all available information. Let me call it the alternative estimator
. It is given by
In the Appendix below I derive (in a few simple steps) this estimator as an approximation of the maximum-likelihood estimator for the present problem. It, therefore, does have all the nice properties that maximum likelihood estimators have. But even if you are a maximum likelihood skeptic, we can actually just directly compare the precision (for all sample sizes) of the two estimators, by looking at their variances.
First note that, like the standard estimator, the alternative estimator is unbiased as
The variance of the two estimators are
and, as is binomially distributed with number of trials
and success probability
where the approximation is good when and
are sufficiently small.
In this case the ratio of the two variances is given by
Thus, especially, if is not much larger than 1, the alternative estimator is quite a bit more precise. Note also, that the alternative estimator can never be below 1.
Austrian Corona cases
In Austria, from 1st to 6th of April, a random sample of was checked for the Corona virus. I will here ignore the disturbing sample selection problem that actually 2000 people were supposed to participate and 456 did not participate. Of those who participated the number of cases found,
was 5 and the number of already declared cases among them,
was either 2 or 3. There was some weighting in these numbers which I am not fully informed about. I will ignore these issues here, but at least will look at both cases for
At the same day the proportion
(11383 declared cases among 8,636.364 people in Austria).
Using the, here also easily applicable, Clopper-Pearson method to compute 95\% confidence bounds, we get the following estimates and bounds derived from the two different estimators.
As you can see, the confidence bounds are much narrower for the alternative estimator than for the standard estimator.
A Thought
If we could assume, which sadly we often probably cannot, that the proportionality factor is the same in all regions of interest, while
is observably not, then one could take a specific random sample that would even be much better than a random sample of all people. In Austria, for instance, the
for Landeck in Tirol is about
while in Neusiedl am See in Burgenland it is about
Then a random sample of people in Landeck would produce a much more precise estimate for than a random sample of people in Neusiedl. The variance for the Neusiedl estimator would be 20 (the ratio of
) times as large as that for Landeck.
Another Thought
Of course, there is nothing specific about the setup here that makes it only applicable to counting virus cases. This estimator could be used in all cases in which we are interested in the true proportion of some attribute A in some population, when we know that only A’s can also have attribute B and we know how many B’s there are. Looking at it like that I am sure this estimator is known. So I am here just reminding you all about it.
Appendix
We here derive the alternative estimator as an approximation to the maximum likelihood estimator. Taking a truly random sample, we know that is binomially distributed with number of trials
and success probability
Conditional on
we know that
is binomially distributed with number of trials
and success probability
The likelihood function is, therefore, given by
The log-likelihood function is then proportional to
The maximum likelihood estimator, thus, has to satisfy
If is small, we can approximate
by 1. We then get
If is, in expectation, much smaller than
we can approximate this further to get