# Estimating the proportion of Corona cases

This short note makes one simple point. If you are interested in estimating the proportion of Corona infected people in some country or region, there is a simple and better (more precise) estimate than the one you obtain by computing the sample proportion. You can also read this in German here (and here).

Setup

Consider taking a (completely) random sample of $n$ individuals in some population in order to estimate the proportion of people in this population who have the Corona virus. Let $p$ denote this true proportion. I here assume that we already know, through potentially non-random medical testing, that there is a certain fraction $q$ of the population who definitely have the virus (or have had it). I will refer to these people as those that were declared to have the virus. I assume that whatever medical test was used to obtain this number was perfect, at least in one direction: anyone who has been declared to have the virus this way also actually has it. As, thus, necessarily $q \le p,$ we can write $p=\mu q,$ where we interpret $\mu \ge 1$ as the multiplier or ratio of actual virus cases relative to the declared virus cases. I am here interested in estimating $\mu$ from the random sample knowing $q.$ If we have an estimate for $\mu$ we get one for $p$ by multiplying the $\mu$-estimate with $q$.

When we take the random sample, we collect two pieces of information from each person. One, we check (again, for the sake of simplicity, with a perfect medical test) whether or not they have the virus. Two, we ask them (and the subject answers truthfully) whether they have already been declared as having the virus. I will call $X$ the total number of virus cases in the sample and $Y \le X$ the total number of already declared virus cases in the sample.

Estimator

Many people would probably be tempted to use $\hat{p}=\frac{X}{n}$ as the standard estimator for $p,$ and, thus, indirectly $\hat{\mu}_S=\frac{X}{qn}$ as the standard estimator for $\mu$. It turns out that there is a better estimator that uses all available information. Let me call it the alternative estimator $\hat{\mu}_A$. It is given by $\hat{\mu}_A=1+\frac{X-Y}{qn}.$

In the Appendix below I derive (in a few simple steps) this estimator as an approximation of the maximum-likelihood estimator for the present problem. It, therefore, does have all the nice properties that maximum likelihood estimators have. But even if you are a maximum likelihood skeptic, we can actually just directly compare the precision (for all sample sizes) of the two estimators, by looking at their variances.

First note that, like the standard estimator, the alternative estimator is unbiased as $\mathbb{E}\left[\hat{\mu}_A\right]=1+\frac{\mu qn - qn}{qn} = \mu.$

The variance of the two estimators are $\mathbb{V}\left[\hat{\mu}_S\right] = \frac{\mu(1-\mu q)}{qn} \approx \frac{\mu}{qn},$
and, as $X-Y$ is binomially distributed with number of trials $n$ and success probability $\mu q \left(1-\frac{1}{\mu}\right),$ $\mathbb{V}\left[\hat{\mu}_A\right] = \frac{(\mu-1)(1-q(\mu-1))}{qn} \approx \frac{\mu-1}{qn},$
where the approximation is good when $q$ and $\mu$ are sufficiently small.

In this case the ratio of the two variances is given by $\frac{\mathbb{V}\left[\hat{\mu}_A\right]}{\mathbb{V}\left[\hat{\mu}_S\right]} = \frac{\mu-1}{\mu} < 1.$

Thus, especially, if $\mu$ is not much larger than 1, the alternative estimator is quite a bit more precise. Note also, that the alternative estimator can never be below 1.

Austrian Corona cases

In Austria, from 1st to 6th of April, a random sample of $n=1544$ was checked for the Corona virus. I will here ignore the disturbing sample selection problem that actually 2000 people were supposed to participate and 456 did not participate. Of those who participated the number of cases found, $X,$ was 5 and the number of already declared cases among them, $Y,$ was either 2 or 3. There was some weighting in these numbers which I am not fully informed about. I will ignore these issues here, but at least will look at both cases for $Y.$ At the same day the proportion $q=1/758$ (11383 declared cases among 8,636.364 people in Austria).

Using the, here also easily applicable, Clopper-Pearson method to compute 95\% confidence bounds, we get the following estimates and bounds derived from the two different estimators. $\begin{array}{c|ccc} & \hat{\mu}_S & \hat{\mu}_A (Y=3) & \hat{\mu}_A (Y=2) \\ \hline \mbox{estimate } \mu & 2,46 & 1,98 & 2,47 \\ \mbox{lower bound } \mu & 0,87 & 1,12 & 1,30 \\ \mbox{upper bound } \mu & 5,72 & 4,54 & 5,30 \\ \mbox{lower bound cases } & 9866 & 12738 & 14845 \\ \mbox{estimated cases } & 27.968 & 22570 & 28164 \\ \mbox{upper bound cases } & 65126 & 51726 & 60331 \\ \end{array}$

As you can see, the confidence bounds are much narrower for the alternative estimator than for the standard estimator.

A Thought

If we could assume, which sadly we often probably cannot, that the proportionality factor $\mu$ is the same in all regions of interest, while $q$ is observably not, then one could take a specific random sample that would even be much better than a random sample of all people. In Austria, for instance, the $q$ for Landeck in Tirol is about $q_L=1/50,$ while in Neusiedl am See in Burgenland it is about $q_N=1/1000.$

Then a random sample of people in Landeck would produce a much more precise estimate for $\mu$ than a random sample of people in Neusiedl. The variance for the Neusiedl estimator would be 20 (the ratio of $q_L/q_N$) times as large as that for Landeck.

Another Thought

Of course, there is nothing specific about the setup here that makes it only applicable to counting virus cases. This estimator could be used in all cases in which we are interested in the true proportion of some attribute A in some population, when we know that only A’s can also have attribute B and we know how many B’s there are. Looking at it like that I am sure this estimator is known. So I am here just reminding you all about it.

Appendix

We here derive the alternative estimator as an approximation to the maximum likelihood estimator. Taking a truly random sample, we know that $X$ is binomially distributed with number of trials $n$ and success probability $\mu q.$ Conditional on $X$ we know that $Y$ is binomially distributed with number of trials $X$ and success probability $\frac{1}{\mu}.$ The likelihood function is, therefore, given by $\mathcal{L}(\mu;X,Y) = {n \choose X} (\mu q)^X\left(1-\mu q\right)^{(n-X)} {X \choose Y} \left(\frac{1}{\mu}\right)^Y \left(1-\frac{1}{\mu}\right)^{(X-Y)}.$

The log-likelihood function is then proportional to $\ell(\mu;X,Y)=X \ln(\mu q) + (n-X) \ln\left(1-\mu q\right) - Y\ln(\mu) + (X-Y) \ln\left(1-\frac{1}{\mu}\right).$

The maximum likelihood estimator, thus, has to satisfy $\begin{array}{lll} \frac{X}{\mu q}q + \frac{n-X}{1-\mu q} (-q) - \frac{Y}{\mu} + \frac{X-Y}{1-\frac{1}{\mu}} \frac{1}{\mu^2} & = & 0 \\ \frac{X-Y}{\mu} - \frac{q(n-X)}{1-\mu q} + \frac{X-Y}{\mu(\mu-1)} & = & 0. \end{array}$

If $\mu q$ is small, we can approximate $1-\mu q$ by 1. We then get $\mu=1+\frac{X-Y}{q(n-X)}.$
If $X$ is, in expectation, much smaller than $n,$ we can approximate this further to get $\hat{\mu}_A=1+\frac{X-Y}{qn}.$