In this introductory post we will present some fundamental results that will pave the way towards polynomial chaos expansions. This will be done in a series of blog posts.

## 0. Introduction

### 0.1. Preliminaries

Mercer's theorem is a fundamental theorem in reproducing kernel Hilbert space^{[ref]}. Before we proceed we need to state a few definitions. Note certain authors use different notation and definitions. We start with the definition of a symmetric function.

**Definition 1 (Symmetric).** A function \(K:[a,b]\times[a,b] \to {\rm I\!R}\) symmetric if $K(x, y) = K(y, x)$ for all $x,y\in[a,b]$.

Next we give the definition of a positive definite function \(K:[a,b]\times[a,b] \to {\rm I\!R}\).

**Definition 2 (Positive definite function).** A function \(K:[a,b]\times[a,b] \to {\rm I\!R}\) is called *nonnegative definite* if \(\sum_{i,j=1}^{n} f(x_i) f(x_j) K(x_i,x_j) \geq 0\), for all \((x_i)_{i=1}^{n}\) in \([a,b]\), all \(n\in\N\) and all functions \(f:[a, b] \to {\rm I\!R}\).

We can now give the definition of a kernel.

**Definition 3 (Kernel).** A function \(K:[a,b]\times[a,b] \to {\rm I\!R}\) is called a *kernel* if it is (i) continuous, (ii) symmetric, and (iii) nonnegative definite.

Some examples of kernels are (for $x,y\in [a, b]$)

- Linear, $K(x, y) = x y$
- Polynomial, $K(x, y) = (x y + c)^d$, with $c\geq 0$
- Radial basis function, $K(x, y) = \exp(-\gamma (x-y)^2)$
- Hyperbolic tangent, $K(x, y) = \tanh(xy + c)$

### 0.2. Hilbert-Schmidt integral operators

We give the definition of the Hilbert-Schmidt integral operator of a kernel.

**Definition 4 (HS integral operator).** Let \(K\) be a kernel. The *Hilbert-Schmidt integral operator* is defined as a $T_K : \mathcal{L}^2([a,b]) \to \mathcal{L}^2([a,b])$ that maps a function $\phi \in \mathcal{L}^2([a,b])$ to the function

$$T_K \phi = \int_{a}^{b}K({}\cdot{}, s)\phi(s)\mathrm{d} s.$$

which is an $\mathcal{L}^2([a,b])$-function.

The Hilbert-Schmidt integral operator of a kernel is linear, bounded, and compact. Moreover, $T_K\phi$ is continuous for all $\phi\in\mathcal{L}^2([a,b])$. We also have that $T_K$ is *self adjoint*, that is, for all $g\in\mathcal{L}^2([a,b])$,

$$\langle T_K \phi, g\rangle = \langle \phi, T_K g\rangle.$$

Recall that given a linear map $A:U\to U$ on a real linear space $U$, an eigenvalue-eigenvector pair $(\lambda, \phi)$, with $\lambda\in\mathbb{C}$ and $\phi\in U$ is one that satisfies $A\phi = \lambda \phi$.For the Hilbert-Schmidt integral operator of a kernel

- There is at least one eigenvalue-eigenfunction pair
- The set of its eigenvalues is countable
- Its eigenvalues are (real and) nonnegative
- Its eigenfunctions are continuous
- Its eigenfunctions span $\mathcal{L}^2([a,b])$

From the spectral theorem for linear, compact, self-adjoint operators, there is an orthonormal basis of $\mathcal{L}^2([a,b])$ consisting of eigenvectors of $T_K$.

### 0.3. Mercer’s Theorem

We can now state Mercer's theorem.

**Theorem 5 (Mercer's theorem).** Let \(K\) be a kernel. Then, there is an orthonormal basis \((e_i)_{i}\) of $\mathcal{L}^2([a,b])$ of eigenfunctions of $T_K$, and corresponding (nonnegative) eigenvalues \((\lambda_i)_i\) so that

$$K(s,t) = \sum_{j=1}^{\infty} \lambda_j e_j(s)e_j(t),$$

where convergence is absolute and uniform in \([a,b]\times [a,b]\).

The fact that $(\lambda_i, e_i)$ is an eigenvalue-eigenfunction pair of $T_K$ means that $T_K e_i = \lambda_i e_i$, that is,

$$\int_a^b K(t, s)e_i(s)\mathrm{d}s = \lambda_i e_i(t).$$

This is a Fredholm integral equation of the second kind.

## 1. The Kosambi–Karhunen–Loève Theorem

We can now state the Kosambi–Karhunen–Loève (KKL) theorem^{[3]}. The Kosambi–Karhunen–Loève Theorem is a special case of the class of polynomial chaos expansions we will discuss later. It is a consequence of Mercer's Theorem.

**Theorem 6 (KKL Theorem).**^{[3]} Let \((X_t)_{t\in [a,b]}\) be a centred, continuous stochastic process on \((\Omega, \mathcal{F}, \mathrm{P})\) with a covariance function $R_X(s,t) = {\rm I\!E}[X_s X_t]$. Then, for $t\in[a, b]$

\[ X_t(\omega) = \sum_{i=1}^{\infty}Z_i(\omega) e_i(t),\tag{1} \]

with

\[ Z_i(\omega) = \int_{T}X_t(\omega)e_i(t) \mathrm{d}t, \tag{2} \]

where $e_i$ are eigenfunctions of the Hilbert-Schmidt integral operator of $R_X$ and form an orthonormal basis of $\mathcal{L}^2(\Omega, \mathcal{F}, \mathrm{P})$.

Since $(Z_i)_i$ is an orthonormal basis, \({\rm I\!E}[Z_i Z_j] = 0\) for \(i\neq j\). Moreover, \({\rm I\!E}[Z_i]=0\). The variance of $Z_i$ is ${\rm I\!E}[Z_i^2] = \lambda_i$ — the corresponding eigenvalue to $Z_i$.

In Equation (1), the series converges in the mean-square sense to $X_t$, uniformly in $t\in[a,b]$.

What is more, \((e_i)_i\) and \((\lambda_i)_i\) are eigenvectors and eigenvalues of the Hilbert-Schmidt integral operator \(T_{R_X}\) of the auto-correlation function \(R_X\) of the random process, that is, they satisfy the Fredholm integral equation of the second kind

\[ T_{R_X}e_i = \lambda_i e_i \]

or equivalently

\[ \int_{a}^{b} R_X(s,t) e_i(s) \mathrm{d} s = \lambda_i e_i(t), \]

for all $i\in\N$.

Let \((X_t)_{t\in T}\), \(T=[a,b]\), be a stochastic process which satisfies the requirements of the Kosambi-Karhunen-Loève theorem. Then, there exists a basis \((e_i)_i\) of \(\mathcal{L}^2(T)\) such that for all \(t\in T\)

\[ X_t(\omega) = \sum_{i=1}^{\infty} \sqrt{\lambda_i}\xi_i(\omega)e_i(t), \]

in \(\mathcal{L}^2(\Omega, \mathcal{F}, \mathrm{P})\), where \(\xi_i\) are centered, mutually uncorrelated random variables with unit variance and are given by

\[\xi_i(\omega) = \tfrac{1}{\sqrt{\lambda_i}}\int_{a}^{b} X_{\tau}(\omega) e_i(\tau) \mathrm{d} \tau.\]

## 2. Orthogonal polynomials

Let \(\mathsf{P}_N([a,b])\) denote the space of polynomials \(\psi:[a,b]\to{\rm I\!R}\) of degree no larger than \(N\in\N\).

Consider the space \(\mathcal{L}^2([a,b])\) and a positive continuous random variable with pdf $w$ (or, in general, a positive function $w:[a, b] \to [0, \infty)$). For two functions \(\psi_1,\psi_2: [a,b] \to {\rm I\!R}\), we define the scalar product

\[ \left\langle\psi_1, \psi_2\right\rangle_{w} = \int_{a}^{b} w(\tau) \psi_1(\tau) \psi_2(\tau) \mathrm{d} \tau. \]

and the corresponding norm \( \|f\|_w = \sqrt{\left\langle f,f\right\rangle_w}\). Define the space

\[ \mathcal{L}^2_w([a,b]) = \{f:[a,b]\to{\rm I\!R} {}:{} \|f\|_w < \infty\}. \]

For a random variable \(\Xi \in \mathcal{L}^2(\Omega, \mathcal{F}, \mathrm{P}; {\rm I\!R})\) with PDF \(p_\Xi>0\), we define

\[ \left\langle\psi_1, \psi_2\right\rangle_{\Xi} {}\coloneqq{} \left\langle\psi_1, \psi_2\right\rangle_{p_{\Xi}}. \]

If the random variable $\Xi$ is not continuous (so, it does not have a pdf), then given its distribution $F_{\Xi}(\xi) = \mathrm{P}[\Xi \leq \xi]$, we can define the inner product as

$$\langle \psi_1, \psi_2\rangle_{\Xi} {}={} \int_{a}^{b}\psi_1 \psi_2 \mathrm{d}F_{\Xi}.$$

Let $\Xi$ be a real-valued random variable with probability density function $p_\Xi$. Let \(\psi_1,\psi_2:\Omega\to{\rm I\!R}\) be two polynomials. We say that \(\psi_1,\psi_2\) are orthogonal with respect to (the pdf of) \(\Xi\) if \(\left\langle\psi_1, \psi_2\right\rangle_{\Xi} = 0\).

The random variable $\Xi$ the induces the above inner product and norm is often referred to as the **germ**.

Let \(\psi_0, \psi_2,\ldots\), with \(\psi_0=1\), be a sequence of orthogonal polynomials with respect to the above inner product. Then,

\[ 0 = \left\langle \psi_0, \psi_1 \right\rangle_\Xi = \int_{-\infty}^{\infty} \psi_{1}(s) p_{\Xi}(s) \mathrm{d} s = {\rm I\!E}[\psi_1(\Xi)], \]

by virtue of LotUS. Likewise, \({\rm I\!E}[\psi_i(\Xi)] = 0\) for all \(i\).

### 2.1. Popular polynomial bases

(**Hermite polynomials**). Let \(\Xi\) be distributed as \(\mathcal{N}(0,1)\). Then, its pdf is

$$p_{\Xi}(s) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}.$$

and the polynomials \(\psi_0,\psi_1,\ldots\) are the Hermite polynomials, the first few of which are \(H_0(x)=1\), \(H_1(x)=x\), \(H_2(x)=x^2-1\), \(H_3(x) = x^3 - 3x\). These are orthogonal with respect to \(\Xi\), that is

\[ \left\langle{}H_i, H_j\right\rangle_\Xi = \int_{-\infty}^{\infty}H_i(s) H_j(s) e^{-\frac{s^2}{2}}\mathrm{d} s = 0, \]

for \(i\neq j\).

(**Legendre polynomials**). If \(\Xi\sim U([-1,1])\), then \(\psi_0,\psi_1,\ldots\) are the
Legendre polynomials. If, instead, \(\Xi\sim U([-1,1])\), the coefficients of the
Legendre polynomials can be modified.

(**Laguerre polynomials**). If the germ is an exponential random variable on \([0,\infty)\),
then \(\psi_0,\psi_1,\ldots\) are the Laguerre polynomials.

### 2.2. Projection and Best Approximation

On a normed space $X$, let us recall that the projection of a point $x \in X$ on a closed set $H\subseteq X$ is defined as

$$\Pi_H(x) = \argmin_{y\in H}\|y-x\|.$$

In other words,

$$\|\Pi_H(x) - x\| \leq \|y-x\|, \text{ for all } y\in H.$$

**Proposition 7 (Projection on subspace).** Let $X$ be an inner product space and $H \subseteq X$ is a linear subspace of $X$. Let $x\in X$. The following are equivalent

- $x^\star = \Pi_{H}(x)$
- $\|x^\star - x\| \leq \|y-x\|$, for all $y\in H$
- $x^\star - x$ is orthogonal to $y$ for all $y\in H$

*Proof.* (1 $\Leftrightarrow$ 2) is immediate from the definition of the projection.

(1 $\Leftrightarrow$ 3) The optimisation problem that defines the projection can be written equivalently as

$$\operatorname*{Minimise}_{y}\tfrac{1}{2}\|y-x\|^2 + \delta_H(y).$$

So the optimality conditions are $0 \in \partial(\tfrac{1}{2}\|x^{\star}-x\|^2 + \delta_H(x^{\star}))$. Equivalently

$$0 \in x^{\star}-x + N_H(x^{\star}) \Leftrightarrow x^{\star}-x \in H^{\perp},$$

and this completes the proof. $\Box$

**Proposition 8 (Projection on subspace).** Let $X$ be an inner product space and $H \subseteq X$ is a finite-dimensional linear subspace of $X$ with an orthogonal basis $\{e_1, \ldots, e_n\}$*. Then,

$$\Pi_H(x) = \sum_{i=1}^{n}c_i e_i,$$

where

$$c_i = \frac{\langle x, e_i \rangle}{\|e_i\|^2},$$

for all $i=1,\ldots, n$.

* If the basis is not orthogonal, we can produce an orthogonal basis using the Gram-Schmidt orthogonalisation procedure^{[ref]}

*Proof.* It suffices to show that Condition 3 of Proposition 7 is satisfied. Equivalently, it suffices to show that

$$\left\langle \sum_{i=1}^{n} \frac{\langle x, e_i \rangle}{\|e_i\|^2}e_i - x, e_j\right\rangle=0,$$

for all $j=1,\ldots,n$. This is true because $\langle e_i, e_j\rangle = \delta_{ij}\|e_j\|^2$. $\Box$

Let $\{\psi_0, \ldots, \psi_N\}$ be an orthogonal basis of the set of polynomials on $[a, b]$ with degree up to $N$. This is denoted by $\mathsf{P}_N([a, b])$. Orthogonality is meant with respect to an inner product \(\left\langle{}\cdot{}, {}\cdot{}\right\rangle_w\). Suppose that the $\psi_i$ are unitary, i.e., $\|\psi_i\|_w = 1$.

Following Proposition 8, the projection operator onto \(\mathsf{P}_N([a, b])\) is

\[ \Pi_N:\mathcal{L}^2_w([a, b]) \ni f \mapsto \Pi_N f \coloneqq \sum_{j=0}^{N}\hat{f}_j \psi_j \in \mathsf{P}_N, \]

where $\hat{f}_j = \left\langle{}f, \psi_j\right\rangle_w.$

For \(f\in\mathcal{L}^2_w([a, b])\), define \(f_N = \Pi_N f\). Then,

\[ \lim_{N\to\infty}\|f - f_N\|_{w} = 0. \]

### 2.3. Example

Here we will approximate the function

$$f(x) = \sin(5x) \exp(-2x^2),$$

over $[-1, 1]$ using Legendre polynomials^{[ref]}. The Legendre polynomials, $L_0, L_1, \ldots$ are orthogonal with respect to the standard inner product of $\mathcal{L}^2([-1, 1])$. The first four Legendre polynomials are

$$\begin{aligned} L_0(x) {}={}& 1, \\ L_1(x) {}={}& x, \\ L_2(x) {}={}& \tfrac{1}{2}(3x^2 - 1), \\ L_3(x) {}={}& \tfrac{1}{2}(5x^3 - 1). \end{aligned}$$

Using Proposition 8, we can compute the coefficients of the approximation $f \approx \Pi_N f$. In Figure 1 we see a comparison of $f_5$, $f_7$ and $f_9$ with $f$. We see that as $N$ increases, the approximation comes closed to $f$.

**Figure 1.** Approximation of $f(x) = \sin(5x) \exp(-2x^2)$ using Legendre polynomials.

**Figure 2.** Plot of $\|f-f_N\|_{\mathcal{L}^2([-1, 1])}$ against $N$. As $N\to\infty$, the approximation error goes to zero.

Is this better compared to the Taylor expansion? TL;DR: Yes! Big time! Taylor's approximation is a *local* approximation.

### 2.4. Focused approximation

Let's take the same function as in the example of Section 2.3, but now suppose we want to focus more close to $-1$ and less so close to $1$. We want to treat errors close to $-1$ as more important and errors closer to $1$ as less important. To that end, we use the weight function

$$w(x) = (1-x)^a(1+x)^b,$$

with $a=2$ and $b=0.1$.

**Figure 3.** Approximation of $f(x) = \sin(5x) \exp(-2x^2),$ with polynomials which are orthogonal with respect to $w$.

The Jacobi polynomials^{[ref]} are orthogonal with respect to $w$.

In Figure 3 we see that the approximation using an expansion with Jacobi polynomials is better close to $-1$ and less so as we move to $1$.

## Read next

## References

- G.E. Fasshauer, Positive Definite Kernels: Past, Present and Future, 2011
- Marco Cuturi, Positive Definite Kernels in Machine Learning, 2009
- R.B. Ash, Information Theory, Dover Books on Mathematics, 2012
- D. Xiu, Numerical methods for stochastic computations: a spectral method approach, Princeton University Press, 2010