Next article: Estimation Crash Course II: Fisher information
Statistics and Estimators: Definitions
Here we focus on the problem of estimating a deterministic parameter, $\theta$ (can be scalar or vector-valued), given a collection of observations. Note that $\theta$ is assumed to be a deterministic parameter, so we will not assume a prior distribution on $\theta$. In other words, we shall follow a frequentist rather than a Bayesian approach here.
We assume that we measure the values of $N$ random variables, $X_1,\ldots, X_N$, that take values in a set $E$ (e.g., can be $E\subseteq{\rm I\!R}$ or $E\subseteq{\rm I\!R}^n$) which are typically independently identically distributed according to a parametric distribution with pdf $p_\theta$. This collection of random variables is referred to as our (random) sample.
The parameter $\theta$ is assumed to be drawn from a set $\Theta$ called the parameter set. The collection $\{p_\theta; \theta\in\Theta\}$ is known as the statistical model. A desirable property of the statistical model is that different parameter values correspond to different distributions, that is, $p_{\theta} = p_{\theta'} \Rightarrow \theta = \theta'.$ If this holds we say that the statistical model is identifiable.
As an example, consider the case of $\R$-valued random variables, $X_1,\ldots, X_N{}{}\overset{\text{iid}}{\sim}{}{}\mathcal{N}(\mu, \sigma^2)$, where $\theta=(\mu,\sigma^2)$ and $\Theta=\R\times (0, \sigma^2)$. It can be easily checked that this statistical model is identifiable.
We define a statistic as a (measurable) function of our sample, that is, $T=r(X_1,\ldots, X_N)$. Note that since $X_1,\ldots, X_N$ are random variables, $T$ is a random variable as well. A well known and widely used statistic is the sample mean
$$\begin{aligned}\bar{X}_N = \tfrac{1}{N}\sum_{i=1}^{N}X_i.\tag{1}\end{aligned}$$
Another popular statistic is the sample variance defined as
$$\begin{aligned}s^2 = \tfrac{1}{N}\sum_{i=1}^{N}(X_i-\bar{X}_N)^2.\tag{2}\end{aligned}$$
Other statistics can be $T=\max\{X_1,\ldots, X_N\}$, $T=\min\{X_1,\ldots, X_N\}$, $T=X_1$ or even $T=100$ - there is an infinite choice of statistics given a sample.
A statistic that is used to estimate a parameter $\theta$ is referred to as an estimator of $\theta$; an estimator is a statistic that is ``close to'' $\theta$ in some sense.
We define the bias of an estimator $\widehat{\theta} = \widehat{\theta}(X_1, \ldots, X_N)$ to be the expectation of the estimation error, $\widehat{\theta}-\theta$, given $\theta$, that is
$$\begin{aligned} {\rm bias}(\widehat{\theta}) {}={} {\rm I\!E}[\widehat{\theta} - \theta {}\mid{} \theta],\tag{3} \end{aligned}$$
if the expectation exists.
Note: Strictly speaking, in this context, the bias is not a conditional expectation since $\theta$ is not a random variable. We write it like this to underline the that that the bias is a function of $\theta$. We could write ${\rm bias}(\widehat{\theta}){}={}{\rm I\!E}[\widehat{\theta} - \theta]$.
In principle the bias of $\widehat{\theta}$ can depend on $\theta$ - our estimator can have a different bias for different values of the (unknown) parameter $\theta$. If ${\rm bias}(\widehat{\theta})=0$ for all $\theta$ we say that the estimator is unbiased.
The variance of an estimator $\widehat{\theta}$ is defined as
$$\begin{aligned} {\rm var}(\widehat{\theta}) {}={} {\rm Var}\left[{}\widehat{\theta} {}\mid{} \theta{}\right] {}={} {\rm I\!E}\left[{}(\widehat{\theta}-{\rm I\!E}[\widehat{\theta} {}\mid{} \theta])^2{}\mid{}\theta{}\right],\tag{4} \end{aligned}$$
if the expectation exists.
We also define the mean square error (MSE) of $\widehat{\theta}$ as
$$\begin{aligned} {\rm mse}(\widehat{\theta}) {}={} {\rm I\!E}[(\widehat{\theta}-\theta)^2 {}\mid{} \theta],\tag{5} \end{aligned}$$
if the above expectation exists.
Theorem 1 (MSE, Variance and Bias). The mean square error of an estimator is given by
$$\begin{aligned} {\rm mse}(\widehat{\theta}) {}={} {\rm var}(\widehat{\theta}) {}+{} {\rm bias}(\widehat{\theta})^2.\tag{6} \end{aligned}$$
Proof. The proof relies on the definition of the MSE in Equation (6) and the formula ${\rm Var}[X{}\mid{}Y]={\rm I\!E}[X^2{}\mid{}Y]-{\rm I\!E}[X{}\mid{}Y]^2$. This proof is left to the reader as an exercise. $\Box$
Note that according to Equation (6), if an estimator $\widehat{\theta}$ is unbiased, its MSE is equal to its variance.
In practice it is typically desirable for an estimator to exhibit minimum MSE; at the same time, we want it to be unbiased. These requirements are sometimes impossible to reconcile. In some cases, no unbiased estimators exist. Let us give a few examples to understand the above concepts.
Example 1 (Sample mean is unbiased). Suppose that $X_1, \ldots, X_N$ are iid samples with mean $\mu$. The sample mean is given in Equation (1) and is an estimtor of $\mu$. We can easily see that the bias of $\bar{X}_N$ is
$$\begin{aligned} {\rm bias}(\bar{X}_N) {}={} {\rm I\!E}[\bar{X}_N - \mu {}\mid{} \mu] {}={} {\rm I\!E}\left[\left. \tfrac{1}{N}\sum_{i=1}^{N}X_i\right|\mu\right] {}={} \tfrac{1}{N}\sum_{i=1}^{N}{\rm I\!E}[X_i{}\mid{}\mu] = \mu. \end{aligned}$$
This proves that $\bar{X}_N$ is an unbiased estimator of $\mu$. $\heartsuit$
Example 2 (MSE of sample mean). Assume that the variance of $X_i$ is known and equal to $\sigma^2$ for all $i\in\N_{[1, N ]}$. According to Equation (4), the variance of the sample mean is
$$\begin{aligned} {\rm var}(\bar{X}_N) {}={} & {\rm Var}[\bar{X}_N {}\mid{} \mu] \\ {}={}& {\rm Var}\left[\tfrac{1}{N}\sum_{i=1}^{N}X_i\right] \\ {}={} & \tfrac{1}{N^2}{\rm Var}\left[\sum_{i=1}^{N}X_i\right] {}\overset{\text{indep.}}{=}{} \tfrac{1}{N^2}\sum_{i=1}^{N}{\rm Var}[X_i] {}={} \frac{\sigma^2}{N}. \end{aligned}$$
Since the estimator is unbiased, according to Theorem 1, its MSE is equal to its variance. We see that as $N\to\infty$, the MSE of this estimator converges to zero (at a rate of $\mathcal{O}(1/N)$). $\heartsuit$
Example 3 (Sample variance is biased). The sample variance given in Equation (2) is a biased estimator of $\sigma^2$. Indeed,
$$\begin{aligned} {\rm I\!E}[s^2 {}\mid{} \sigma^2] {}={} & {\rm I\!E}\left[\tfrac{1}{N}\sum_{i=1}^{N}(X_i-\bar{X}_N)^2\right] \\ {}={} & {\rm I\!E}\left[\tfrac{1}{N}\sum_{i=1}^{N}((X_i-\mu)^2-(\bar{X}_N-\mu)^2)^2\right] \\ {}={} & \tfrac{1}{N}\sum_{i=1}^{N}\underbrace{{\rm I\!E}[(X_i-\mu)^2]}_{{\rm Var}[X_i]} - \tfrac{1}{N}\sum_{i=1}^{N}\underbrace{{\rm I\!E}[(\bar{X}_N-\mu)^2]}_{{\rm Var}[\bar{X}_N]} {}={} \sigma^2 - \frac{\sigma^2}{N}, \end{aligned}$$
therefore, the bias of $s^2$ is
$$\begin{aligned} {\rm bias}(s^2) {}={} {\rm I\!E}[s^2 {}\mid{} \sigma^2] - \sigma^2 = -\frac{\sigma^2}{N}. \end{aligned}$$
We see that $s^2$ is a biased estimator, but the bias converges to $0$ as $N{}\to{}\infty$. We say that $s^2$ is an asymptotically unbiased estimator of $\sigma^2$. $\heartsuit$
Example 4 (Unbiased estimator of the variance). Following the same procedure as above we can show that the following estimate of $\sigma^2$ is an unbiased estimator
$$\begin{aligned} s^2_{\rm corr} = \frac{1}{N-1}\sum_{i=1}^{N}(X_i-\bar{X}_N)^2. \end{aligned}$$
This estimator is known as Bessel's correction. $\heartsuit$
Example 5 (Lack of unbiased estimator). There are cases where there are no unbiased estimators. For example, suppose that $X{}\sim{}{\rm Exp}(\lambda)$, $\lambda>0$. Recall that the pdf of the exponential distribution, ${\rm Exp}(\lambda)$, is
$$\begin{aligned} p_X(x; \lambda) = \lambda e^{-\lambda x}, x \geq 0. \end{aligned}$$
Suppose that $\widehat{\lambda}=\widehat{\lambda}(X)$ is an unbiased estimator of $\lambda$. Then, by definition the following needs to hold for all $\lambda>0$
$$\begin{aligned} {\rm bias}(\widehat{\lambda}) = 0 {}\Leftrightarrow{} & {\rm I\!E}[\widehat{\lambda} {}\mid{} \lambda] = \lambda \\ {}\Leftrightarrow{} & \int_0^\infty \widehat{\lambda}(x)p_X(x){\rm d} x = \lambda \\ {}\Leftrightarrow{} & \int_0^\infty \widehat{\lambda}(x)\lambda e^{-\lambda x}{\rm d} x = \lambda \\ {}\Leftrightarrow{} & \int_0^\infty \widehat{\lambda}(x) e^{-\lambda x}{\rm d} x = 1 \\ {}\Leftrightarrow{} & \{\mathscr{L}\widehat{\lambda}(x)\}(\lambda) = 1, \end{aligned}$$
where the last equality is due to the fact that $\int_0^\infty \widehat{\lambda}(x)\lambda e^{-\lambda x}{\rm d} x$ is the Laplace transform of $\widehat{\lambda}$ evaluated at $\lambda$. However such a function, $\widehat{\lambda}(x)$ cannot exist; the reason is that the property $\lim_{\lambda\to\infty}\{\mathscr{L}\widehat{\lambda}(x)\}(\lambda)=0$ should be satisfied. $\heartsuit$
Lastly, we define a uniformly minimum-variance unbiased estimator (UMVUE) to be an unbiased estimator such that there is no other unbiased estimator with a lower variance. Formally, let $X_1,\ldots, X_N$ be a random sample and $\widehat{\theta}(X_1,\ldots, X_N)$ is an unbiased estimator of a parameter $\theta$. Then, $\widehat{\theta}$ is UMVUE if
$${\rm var}[\widehat{\theta}] \leq {\rm var}[\widetilde{\theta}],$$
for all $\theta$ and for all unbiased estimators $\widetilde{\theta}$.
Bias-Variance Tradeoff: UMVUEs are considered good estimators. But sometimes, we may decide to choose an estimator that has some small nonzero bias, but comes with a lower variance. This is known as the bias-variance tradeoff problem or dilemma.