Read first: Kalman Filter IV: Application to position estimation
Read next: Kalman Filter VI: Further examples
Minimum variance estimators
Firstly, we will show that the conditional expectation is a minimum variance estimator.
Theorem V.1. Suppose $X$ and $Y$ are jointly distributed continuous real-valued random variables and we measure $Y=y$. Let $\hat{x}=\hat{x}(y)$ be a minimiser of the problem.
$$\operatorname*{minimise}_{z}{\rm I\!E}[\|X-z\|^2 {}\mid{} Y=y],$$
that is, $\hat{x}$ is a minimum variance estimate of $X$ given $Y=y$. Then, $\hat{x}$ is unique and
$$\hat{x} = {\rm I\!E}[X {}\mid{} Y=y].$$
Does this theorem look too good to be true? Well... it is. We know that if $(X, Y)$ are jointly normally distributed and we know their mean and variance, we can compute the conditional expectation of $X$ given $Y=y$. However, in practice we often encounter cases where the random variables are not normally distributed. But even when they are approximately jointly normal, we won't know their means and variances.
In the context of the Kalman filter, it makes sense that if the noises ($w$ and $v$) are not normally distributed, the Kalman filter will not be the best estimation methodology. The good news is that the Kalman filter remain the best linear filter in any case!
KF is BLUE
In this section we will drop the normality assumptions. The Kalman filter is a linear (affine) estimator. By combining the measurement and time updates of the Kalman filter, we can see that
$$ \hat{x}_{t+1{}\mid{}t} {}={} \underbrace{A_t \hat{x}_{t{}\mid{}t-1}}_{\text{system dynamics}} {}+{} K_t\underbrace{(y_t - C_t\hat{x}_{t{}\mid{}t-1})}_{\text{prediction error}},$$
where $K_t{}={}A_t\Sigma_{t{}\mid{}t-1}C_t^\intercal (C_t\Sigma_{t{}\mid{}t-1}C_t^\intercal + R_t)^{-1}.$ The Kalman filter is an affine filter. It is in fact the best affine filter (will explain). Recall that, by definition, $\hat{x}_{t+1{}\mid{}t} {}={} {\rm I\!E}[x_{t+1} {}\mid{} y_0,\ldots,y_t]$, therefore, the KF is unbiased, i.e.,
$${\rm I\!E}[\hat{x}_{t+1{}\mid{}t}{}-{}x_{t+1}] {}={} {\rm I\!E}[{\rm I\!E}[x_{t+1} {}\mid{} y_0, \ldots, y_t]-x_{t+1}] {}={} 0.$$
The conditional expectation of a random variable $X$ given $Y=y$ minimises the following function
$$f(z; y) {}={} {\rm I\!E}\left[\|X - z\|^2 {}\mid{} Y=y\right].$$
This means that
$$f\left({{\rm I\!E}[X{}\mid{}Y=y]}; y\right) {}\leq{} f(z; y),$$
for all estimators $z(y)$.
Moreover, define the function $F(z; y) {}={} {\rm I\!E}[(X-z)(X-z)^\intercal {}\mid{}Y=y].$ Then,
$$F\left({{\rm I\!E}[X{}\mid{}Y=y]}; y\right) {}\preccurlyeq{} F(z; y),$$
However, ${\rm I\!E}[X{}\mid{}Y=y]$ can be difficult to determine (without the normality assumption); in general, it is a nonlinear function of $y$.
Question: What is the best linear (affine) estimator we can construct?
Suppose that $X$ and $Y$ are jointly distributed. Then the best (minimum variance) estimator of $X$ given $Y=y$ is ${\rm I\!E}[X{}\mid{}Y=y]$. What is the best affine estimator?
Problem statement: Without assuming that $X$ and $Y$ are (jointly) normally distributed, suppose we are looking for estimators of the form $\widehat{X}(y)=A y+b$, i.e., affine estimators. We seek to determine $A=A^\star$ and $b=b^\star$ so that $\widehat{X}(y)=A^\star y+b^\star$ is the best affine estimator, i.e., the best estimator among all affine ones, i.e.,
$$ {\rm I\!E}\left[\|X - A^\star y + b^\star\|^2\right] {}\leq{} {\rm I\!E}\left[\|X - Ay + b\|^2\right],$$
for any $A$ and $b$, and its conditional counterpart
$${\rm I\!E}\left[\|X - \left(A^\star y + b^\star\right)\|^2 {}\mid{} Y {}={} y\right] {}\leq{} {\rm I\!E}\left[\|X - \left(A y + b\right)\|^2 {}\mid{} Y {}={} y\right],$$
also holds for any $A$ and $b$.
Theorem V.2 (KF is BLUE). Suppose that $(X, Y)$ are jointly distributed with means ${\rm I\!E}[X]=m_x$, ${\rm I\!E}[Y]=m_y$ and covariance matrix
$${\rm Var}\begin{bmatrix}X\\Y\end{bmatrix} {}={} \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix},$$
with $\Sigma_{yy} {}\succ{} 0$. The best affine estimator of $X$ given $Y$ is $\widehat{X}(Y) = A^\star Y + b^\star$ with
$${A^\star {}={} \Sigma_{xy}\Sigma_{yy}^{-1}}, \text{ and } {b^\star = m_x - A^\star m_y}.$$
In particular, ${\rm I\!E}\left[\|X-\widehat{X}(Y)\|^2\right] {}\leq{} {\rm I\!E}\left[\|X-(AY+b)\|^2\right],$ for any parameters $A$ and $b$.
Remarks: The estimator can be written as $\widehat{X}(Y) = m_x + A^\star(Y-m_y)$. The Kalman filter is the best affine filter in the sense that it minimises the mean square error, ${\rm I\!E}[\|X-\widehat{X}\|^2].$ Theorem 2 does not require that $X$ or $Y$ be Gaussian. We can prove a similar result for the covariance matrix ${\rm I\!E}[(X-\widehat{X})(X-\widehat{X})^\intercal]$ (guess what...). Later we will show that KF is the best affine filter.
Before the Proof. We will use the following observations: for an $n$-dimensional random variable $Z$:
$${\rm I\!E}[\|Z\|^2] {}={} {\rm I\!E}[Z^\intercal Z] {}={} {\rm I\!E}[{\rm trace}(ZZ^\intercal)] {}={} {\rm trace} {\rm I\!E}[ZZ^\intercal].$$
Secondly, \({\rm Var}[Z] = {\rm I\!E}[ZZ^\top] - {\rm I\!E}[Z]{\rm I\!E}[Z]^\intercal\), so ${\rm I\!E}[ZZ^\top] = {\rm Var}[Z] + {\rm I\!E}[Z]{\rm I\!E}[Z]^\intercal,$ therefore,
$${\rm I\!E}[\|Z\|^2] {}={} {\rm trace} {\rm I\!E}[ZZ^\top] {}={} {\rm trace} {\rm Var}[Z] + {\rm trace} \left({\rm I\!E}[Z]{\rm I\!E}[Z]^\intercal\right).$$
Note also that ${\rm I\!E}[X-AY-b] = m_x - Am_y - b.$
Proof. It is
$$\begin{aligned} {\rm I\!E}[\|X-AY-b\|^2] & \quad {}={} {\rm trace} {\rm Var} [X-AY-b] + {\rm trace}({\rm I\!E}[X-AY-b]({}\cdot{})^\intercal) \\ & \quad {}={} {\rm trace} {\rm Var} [X-AY-b] + \|m_x - Am_y - b\|^2, \end{aligned}$$
where
$$\begin{aligned} {\rm Var}(X-AY-b) {}={} & {\rm I\!E}[(X-m_x-A(Y-m_y))({}\cdot{})^\intercal] \\ {}={} & {\rm I\!E}[(X-m_x)(X-m_x)^\intercal] {}+{} {\rm I\!E}[A(Y-m_y)(Y-m_y)^\intercal{}A^\intercal] \\ & \qquad {}-{} A(Y-m_y){\rm I\!E}[X-m_x]^\intercal {}-{} {\rm I\!E}[X-m_x](Y-m_y)^\intercal A^\intercal \\ {}={} & \Sigma_{xx} + A\Sigma_{yy}A^\intercal - A\Sigma_{yx} - \Sigma_{xy}A^\intercal, \end{aligned}$$
therefore,
$${\rm I\!E}[\|X-AY-b\|^2] {}={} {\rm trace} \left[ \Sigma_{xx} {}+{} A\Sigma_{yy}A^\intercal - A\Sigma_{yx} - \Sigma_{xy}A^\intercal \right] {}+{} \|m_x-Am_y-b\|^2.$$
Now observe that
$$(A-\Sigma_{xy}\Sigma_{yy}^{-1}) \Sigma_{yy} (A-\Sigma_{xy}\Sigma_{yy}^{-1})^\intercal {}={} { A\Sigma_{yy}A^\intercal -\Sigma_{xy}A^\intercal -A\Sigma_{yx} } {}+{} \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}.$$
The mean square error can be written as
$$\begin{aligned} {\rm I\!E}[\|X-AY-b\|^2] {}={}& \underbrace{ {\rm trace}[\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}] }_{\text{independent of } A \text{ and } b} \\ &\quad{}+{} {\rm trace}[(A-\Sigma_{xy}\Sigma_{yy}^{-1}) \Sigma_{yy} (A-\Sigma_{xy}\Sigma_{yy}^{-1})^\intercal] \\ &\qquad{}+{} \|m_x - Am_y - b\|^2.\end{aligned}$$
All terms are nonnegative. The term $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}$ is the Schur complement of $\Sigma$, so it is positive semidefinite. The first term is independent of $A$ and $b$. The second term can be made $0$ by taking $A = \Sigma_{xy}\Sigma_{yy}^{-1}$ and the third term vanishes if we take $b = m_x - Am_y$. $\Box$
Exercises
Exercise V.1. Suppose that $(X, Y)$ are jointly distributed with means ${\rm I\!E}[X]=m_x$, ${\rm I\!E}[Y]=m_y$ and covariance matrix
$${\rm Var}\begin{bmatrix}X\\Y\end{bmatrix} {}={} \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix},$$
with $\Sigma_{yy} {}\succ{} 0$. Show that the best affine estimator of $X$ given $Y$ is $\widehat{X}(Y) = A^\star Y + b^\star$ with
$${A^\star {}={} \Sigma_{xy}\Sigma_{yy}^{-1}}, \text{ and } {b^\star = m_x - A^\star m_y},$$
in the sense that
$${\rm I\!E}\left[(X-\widehat{X}(y))(X-\widehat{X}(y))^\intercal\right] {}\preccurlyeq{} {\rm I\!E}\left[(X-Ay-b)(X-Ay-b)^\intercal\right],$$
for any parameters $A$ and $b$.
Hint: follow the steps of the proof of the theorem we just stated, omitting the trace.
Exercise V.2 (Estimator bias and variance). Show that the best linear estimator, $\widehat{X}(Y)= A^\star{}Y + b^\star$, with
$$\begin{aligned} A^\star {}={} & \Sigma_{xy}\Sigma_{yy}^{-1}, \\ b^\star {}={} & m_x - A^\star m_y. \end{aligned}$$
is unbiased, i.e., ${\rm I\!E}[X-\widehat{X}(Y)]=0$ and its variance (the variance of the estimator error, $X-\widehat{X}(Y)$) is
$${\rm Var}[X-\widehat{X}(Y)] {}={} \Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}.$$
Recovering the Kalman Filter
Assumptions: $x_0$, $(w_t)_t$ and $(v_t)_t$ are mutually independent random variables (not necessarily Gaussian) with ${\rm I\!E}[w_t]=0$, ${\rm I\!E}[v_t]=0$, ${\rm I\!E}[x_0] = \tilde{x}_0$, and ${\rm Var}[w_t] = Q_t$, ${\rm Var}[v_t] = R_t$, ${\rm Var}[x_0] = P_0$. Then \((x_0, y_0)\) is jointly distributed with mean
$${\rm I\!E} \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} {}={} \begin{bmatrix} {\tilde{x}_0} \\ {C_0\tilde{x}_0} \end{bmatrix},$$
and variance-covariance matrix
$${\rm Var} \begin{bmatrix} x_0 \\ y_0 \end{bmatrix} {}={} \begin{bmatrix} {P_0} & {P_0C_0^\intercal} \\ {C_0P_0} & {C_0P_0C_0^\intercal + R_0} \end{bmatrix}.$$
By Theorems V.1 and V.2, the best affine estimator of $x_0$ given $y_0$ is
$$\hat{x}_{0}(y_0) {}={} {\tilde{x}_0} {}+{} {P_0C_0^\intercal} ({C_0P_0C_0^\intercal + R_0})^{-1} (y_0 - {C_0\tilde{x}_0}).$$
and the error covariance is
$${\rm Var}[x_0 - \hat{x}_{0}(y_0)] {}={} P_0 {}-{} {P_0C_0^\intercal} ({C_0P_0C_0^\intercal + R_0})^{-1} {C_0P_0}.$$
These are the same formulas as in the Kalman filter!
We can easily show recursively that the Kalman filter is the Best (minimum variance, minimal covariance matrix) Linear (actually affine) Unbiased Estimator (BLUE). "Best linear" means that it is the best among all linear estimators - however, there may be some nonlinear estimator that leads to a lower variance. Without the normality assumption, the Kalman filter is not a minimum variance estimator.
Read next: Kalman Filter VI: Further examples