Spectral mixture kernel


Squared Exponential

To show the difference between more commonly used kernels, such as the squared exponential, it’s helpful to first define a base kernel type. From this type, the squared exponential can then easilly be defined. This will be structured by each kernel having its own struct in which the kernel-specific parameters can be defined. Using this struct, a function will be defined that accepts a kernel struct as one argument, along with a pair of inputs along which the kernel itself will act.

The squared exponential kernel accepts two parameters; a lengthscale $\ell$ and variance $\sigma$. Together, these two parameters define the squared exponential kernel $$k(x, x') = \sigma^2\exp\left(-\frac{||x-x'||^2}{2\ell^2} \right).$$

using Plots, Distributions, StatsPlots

abstract type Kernel end

struct SqExp <: Kernel
    ℓ:: Float64
    σ:: Float64
end

SqExp(;ℓ, σ) = SqExp(ℓ, σ)
kern(κ:: SqExp, x:: Real, y:: Real) = κ.σ*exp(-0.5*(x-y)^2/(2*κ.^2))

All that’s left is define a function that can construct the Gram matrix given a pair of arrays. It’ll be convenient (and easy through multiple dispatch) to define two functions here where in addition to computing the Gram matrix between all pairs of an array, a second function will compute the Gram matrix between two distinct arrays. The fundamental operation is the same, the only difference will be the resultant shape of the Gram matrix. Note, for computational stability, a small amount of noise will be added to the diagonal of the Gram matrix to enure the matrix is positive-definite.

function build_K(x::AbstractArray, y::AbstractArray, k::Kernel; noise::Float64=1e-6)
    K = Array{Float64}(undef, size(x, 1), size(y, 1))
    for (idx, i) in enumerate(x)
        for (jdx, j) in enumerate(y)
            if idx == jdx
                σ = noise
            else
                σ = 0
            end
            K[idx, jdx] = kern(k, i, j) + σ
        end
    end
    return K
end

build_K(x::AbstractArray, k::Kernel; noise::Float64=1e-6) = return build_K(x, copy(x), k; noise=noise)

Visualise the squared exponential

It’s useful to now visually inspect the squared exponential and its resultant Gram matrix. To achieve the former, the vector arising from the computation of $k(x, \cdot)$ is plotted. For the latter, the Gram matrix of $k(x, x')$ is computed and plotted. For both cases, values of $x\in\left[-1, 1\right]$ are considered.

Hyperparameter values of 0.5 and 1.0 are selected for the lengthscale and variance, respectively. Intuitively, the lengthscale controls the correlation between inputs, meaning that two inputs closer in value are therefore assumed to be more correlated than two inputs far apart. The kernel variance determines the function values deviance from the mean. TLDR: the lengthscale determines horizontal deviation, the variance vertical.

x = collect(-1:0.01:1)
K = build_K(x, SqExp(ℓ=0.5, σ=1.0))

p1 = plot(x, build_K(x, [0], SqExp(ℓ=0.1, σ=1.0)), leg=false)
p2 = heatmap(x, x, K, color=:viridis)
plot(p1, p2, size=(1000, 350))

Spectral Mixture Kernel

The spectral mixture kernel was first proposed in Wislon et. al., and it fundamentally relies upon Bochner’s theorem, which states that any positive definite function (such as a kernel $k$) can be represented by the Fourier transform a positive finite measure. Should this measure have a density, then it is called the spectral density of $k$. By the Wiener-Khintchine theorem, it is then known that the power spectrum has a Fourier dual in the autocorrelation function of a random process

$$ k(\tau)=\int_{\mathbb{R}^{D}} e^{2 \pi i s^{\top} \tau} S(\mathbf{s}) \mathrm{d} \mathbf{s} \qquad \text{and} \quad S(\mathbf{s})=\int_{\mathbb{R}^{D}} e^{-2 \pi i \mathbf{s}^{\top} \tau} k(\tau) \mathrm{d} \tau.$$

It is known that the spectral density of a squared exponential kernel is a Gaussian. Wilson and Adams showed that in the spectral space, if we consider a mixture of $ Q $ Gaussians with mixture weights $w=\lbrace w_{i} \rbrace_{i=1}^Q$, then the kernel expression in the Wiener-Khintchine theorem can be computed analytically as

$$ k(\tau)=\sum_{q=1}^{Q} w_{q} \cos \left(2 \pi \tau^{\top} \mu_{q}\right) \prod_{p=1}^{P} \exp \left(-2 \pi^{2} \tau_{p}^{2} v_{q}^{(p)}\right) $$

for $ \tau=\lVert x-y\rVert_{2}^{2}$, and Gaussian mean and variance values $ \lbrace \mu_{i} \rbrace_{i=1}^{Q} $ and $\lbrace v{i}\rbrace_{i=1}^{Q}$.

This can be implemented using the framework defined above, where a kernel struct is first defined, and then the function required to compute the Gram matrix’s value for a given pair of inputs can be done through dispatching the kern function.

struct SpectralMixture <: Kernel
    Q:: Int64 # Number of Gaussians
    μ:: AbstractArray # Vector of means
    Σ:: AbstractArray # Vector of variances
    ω:: AbstractArray # Mixture weightings
end

function SpectralMixture(Gaussians:: AbstractArray, weights:: AbstractArray)
    μs = [m.μ for m in mixtures]
    Σs = [m.σ for m in mixtures]
    @assert length(μs) == length(Σs) == length(weights)
    @assert sum(weights) == 1
    return SpectralMixture(length(mixtures), μs, Σs, weights)
end

function kern(κ:: SpectralMixture, x:: Real, y:: Real)
    τ = (x-y)^2
    return sum(κ.ω.*cos.(2*π*τ*κ.μ).*exp.(-2*π^2*τ^2*κ.Σ))
end

The parameters of the individual Gaussian distributions can be related back to the squared exponential kernel’s parameters, as the inverse means $1/\mu_q$ is equivalent to a periodicity of the function, and the inverse variance $1/v_q$ is the kernel’s lengthscale, and thus dictates how quickly inputs become uncorrelated with one another.

It’s possible to visualise the spectral mixture kernel using the framework defined for the squared exponential kernel.

mixtures = [Normal(1., 10.2), Normal(20.1, 5.1), Normal(0.2, 4.5)]
κ = SpectralMixture(mixtures, [0.2, 0.4, 0.4])

Ksm = build_K(x, κ; noise=1e-6)
p1 = plot(x, build_K(x, [0], κ), leg=false)
p2 = heatmap(x, x, Ksm, color=:viridis)
plot(p1, p2, size=(1000, 350))