This short note makes one simple point. If you are interested in estimating the proportion of Corona infected people in some country or region, there is a simple and better (more precise) estimate than the one you obtain by computing the sample proportion. You can also read this in German here (and here).

**Setup**

Consider taking a (completely) random sample of individuals in some population in order to estimate the proportion of people in this population who have the Corona virus. Let denote this true proportion. I here assume that we already know, through potentially non-random medical testing, that there is a certain fraction of the population who definitely have the virus (or have had it). I will refer to these people as those that were declared to have the virus. I assume that whatever medical test was used to obtain this number was perfect, at least in one direction: anyone who has been declared to have the virus this way also actually has it. As, thus, necessarily we can write where we interpret as the multiplier or ratio of actual virus cases relative to the declared virus cases. I am here interested in estimating from the random sample knowing If we have an estimate for we get one for by multiplying the -estimate with .

When we take the random sample, we collect two pieces of information from each person. One, we check (again, for the sake of simplicity, with a perfect medical test) whether or not they have the virus. Two, we ask them (and the subject answers truthfully) whether they have already been declared as having the virus. I will call the total number of virus cases in the sample and the total number of already declared virus cases in the sample.

**Estimator**

Many people would probably be tempted to use as the standard estimator for and, thus, indirectly as the standard estimator for . It turns out that there is a better estimator that uses all available information. Let me call it the alternative estimator . It is given by

In the Appendix below I derive (in a few simple steps) this estimator as an approximation of the maximum-likelihood estimator for the present problem. It, therefore, does have all the nice properties that maximum likelihood estimators have. But even if you are a maximum likelihood skeptic, we can actually just directly compare the precision (for all sample sizes) of the two estimators, by looking at their variances.

First note that, like the standard estimator, the alternative estimator is unbiased as

The variance of the two estimators are

and, as is binomially distributed with number of trials and success probability

where the approximation is good when and are sufficiently small.

In this case the ratio of the two variances is given by

Thus, especially, if is not much larger than 1, the alternative estimator is quite a bit more precise. Note also, that the alternative estimator can never be below 1.

**Austrian Corona cases**

In Austria, from 1st to 6th of April, a random sample of was checked for the Corona virus. I will here ignore the disturbing sample selection problem that actually 2000 people were supposed to participate and 456 did not participate. Of those who participated the number of cases found, was 5 and the number of already declared cases among them, was either 2 or 3. There was some weighting in these numbers which I am not fully informed about. I will ignore these issues here, but at least will look at both cases for At the same day the proportion (11383 declared cases among 8,636.364 people in Austria).

Using the, here also easily applicable, Clopper-Pearson method to compute 95\% confidence bounds, we get the following estimates and bounds derived from the two different estimators.

As you can see, the confidence bounds are much narrower for the alternative estimator than for the standard estimator.

**A Thought**

If we could assume, which sadly we often probably cannot, that the proportionality factor is the same in all regions of interest, while is observably not, then one could take a specific random sample that would even be much better than a random sample of all people. In Austria, for instance, the for Landeck in Tirol is about while in Neusiedl am See in Burgenland it is about

Then a random sample of people in Landeck would produce a much more precise estimate for than a random sample of people in Neusiedl. The variance for the Neusiedl estimator would be 20 (the ratio of ) times as large as that for Landeck.

**Another Thought**

Of course, there is nothing specific about the setup here that makes it only applicable to counting virus cases. This estimator could be used in all cases in which we are interested in the true proportion of some attribute A in some population, when we know that only A’s can also have attribute B and we know how many B’s there are. Looking at it like that I am sure this estimator is known. So I am here just reminding you all about it.

**Appendix**

We here derive the alternative estimator as an approximation to the maximum likelihood estimator. Taking a truly random sample, we know that is binomially distributed with number of trials and success probability Conditional on we know that is binomially distributed with number of trials and success probability The likelihood function is, therefore, given by

The log-likelihood function is then proportional to

The maximum likelihood estimator, thus, has to satisfy

If is small, we can approximate by 1. We then get

If is, in expectation, much smaller than we can approximate this further to get