“We approached the case, you remember, with an absolutely blank mind, which is always an advantage. We had formed no theories. We were simply there to observe and to draw inferences from our observations.” Sherlock Holmes, "The Adventure of The Cardboard Box"

This is intended as the first of a series of articles going through core texts for Singular Learning Theory (henceforth SLT) - "Algebraic Geometry and Statistical Learning Theory"(Watanabe, 2009) and "Mathematical Theory of Bayesian Statistics" (Watanabe 2018) both written by Sumio Watanabe. This is perhaps mainly my attempt at digesting the text not for the first time disguised as a series of lectures. The target audience I have in mind is anyone familiar with undergraduate level mathematics and is interested in theoretical / mathematical aspect of statistical learning and AI. That being said, the nature of the subject is such that it draws upon tools and concepts from a wide array of mathematical disciplines, spanning algebra to analysis. Some, like probability and statistics, are crucial in the sense that they are the objects of study. Some, like manifold theory, are only required to make sure that the mathematical objects we manipulate are well defined and cover a sufficient generality for the theory to be useful. Others, like algebraic geometry and Schwartz distribution theory, exports crucial theorems that we shall use to prove and understand the central results of SLT. Yet others, like statistical mechanics, are topics where we might find unexpected connections and possible cross-pollination. We shall introduce these topics in their own time when they come up naturally when we explore SLT. Our modest aim regarding these prerequisites is to understand them with sufficient depth to understand the proofs of various results in SLT and to at least understand their significance like why they are needed and what happen when we can't borrow from them. Of course each of them are profound fields of study in their own right, and if time and energy permit, we shall delve beyond strictly necessary to see the wonder they contain.

What do we aspire to study?
Statistical Background
Singular Models
Model Selection
MISC
References

What do we aspire to study?

Before making definitions and carving out our domain, let’s list a few informal ways we think and talk about our field of study If nothing else, it will help with triangulating what we aspire to study, where the definitions and tools we shall develop are helpful and where there are not. .

We are studying systems that can learn from examples.
We are studying algorithms or physical devices that perceive the world and learn its patterns.
We are studying ways to reason and make decision with imperfect information and in the presence of randomness.
We are studying models of systems with randomness and how to find the best approximation for the particular part of the world under observation.
We are studying, or at least trying to study what we think is, intelligence. In particular, machine intelligence, of how machine can learn patterns and make intelligent decisions.

Examples of objects of study that falls under these aspirations.

A device that implement the calculation $\sum_{i = 1}^6 f(i)/6$ can make decision on whether a given payout $f$ of a game using a six-sided dice is worth playing.
Perhaps a better, more intelligent, device / machine would actually uses observations on 10000 dice throw and implement $\sum_{i = 1}^6 f(i) \frac{\mathrm{freq}(i)}{10000}$ instead.
A statistician can use a sample of human height measurements $\set{h_1, \dots, h_n}$ and estimate the distribution of heights of comparable human population, or to make a good guess on the probability that a child would grow to be taller than 150cm.
The natural language model GPT-3 is a massive and complicated approximation to human languages.
AlphaGo is a machine intelligence that were able to learn from millions of generated examples of games of Go, and learn the pattern of winning strategies so well that it was able to defeat the reigning world champion.
Brains of human and other intelligent species seems to implement complicated and as yet poorly understood system that learn the pattern of their surroundings, able to triage tasks and resources and make decisions critical to the survival of its species.

Statistical Background

The Task of Learning

Let’s come back down to earth with some definitions. In typical statistical learning scenario, we are given a set of examples coming from a data generating process.

We are given a data set $$ D_N = \set{X_1, \dots, X_N} $$ of $N \in \N$ training samples consisting of independently and identically distributed random variables $X_i$ i.e. i.i.d. samples drawn from what we shall call the true distribution specified by a probability density $q(x)$.

Unless otherwise specified, the random variables take values in a fixed Euclidean space $\R^n$. For clarity, we shall often have rather strong regularity assumptions on any probability distributions that occurs throughout the development of SLT. Not only do we assume that densities exist for probability distributions we study, we shall also assume at least continuity and likely analyticity full set of conditions will be specified later on. This setting allow us to leverage heavy duty machineries from algebraic geometry to explicate salient features of learning machines before persuing further generality. Though this seems rather restrictive, it does cover a large class of distributions we encounter in practice like normal mixtures, neural networks, Bayesian networks and much more.

The truth $q$ is what a learning machine wants to access, but can only see the shadow of via the data set $D_N$. For our purpose, we shall formulate the task of a statistical learning machine as one that approximates the true distribution $q$ from the given set of training samples $D_N$.

A statistical inference or a learning machine is a measurable function taking any data set $D_N$ to probability distribution again, we shall work with distribution on $\R^n$. specified by $\hat{p}(x)$, $$ \begin{align*} \mathrm{LearningMachine}: \set{\text{data set of any size $N$}} &\to \set{\text{probability distributions}} \\ D_N &\mapsto \hat{p}(x). \end{align*} $$

Among the concerns of a statistical learning theory includes studying

how to construct such a learning machine
how to obtain $\hat{p}$ efficiently efficiency in terms of computation resources and size of data set.
and how well does $\hat{p}$ thus constructed approximates $q$.

Model-Truth-Prior

A useful and general way of constructing and studying the approximation $\hat{p}$ to a given truth $q(x)$ is to first looking at a large family of distributions, have a way of measuring how well each of them approximates $q$, then come up with a procedure to find the best performing approximation.

A statistical model we might sometimes call this the space of hypothesis is a family of probability distribution $p(x \mid w)$ parameterised by a subset of $d$-dimensional parameter space $W \subset \R^d$ for some $d \in \N$. We assume that $p(x \mid w)$ at all $w \in W$ have the same support as the true distribution $q(x)$, $$ \overline{\set{x \in \R^n \wh p(x | w) > 0}} = \overline{\set{x \in \R^n \wh q(x) > 0}}. $$ Given a set of i.i.d. training samples $D_N = \set{X_1, \dots, X_N}$ as before, we call probability of observing $D_N$ for a particular parameter $$ L_N(w) := \P(X_1, \dots, X_N \mid w) = \prod_{i = 1}^N p(X_i \mid w) $$ the likelihood function.

Taking a Bayesian perspective, we shall endow the parameter space $W$ with a prior probability distribution with density $\varphi(w)$ so that, applying Bayes Theorem, we get the posterior distribution on $W$ $$ p(w \mid D_N) = \frac{L_N(w) \varphi(w)}{Z_N} $$ where the normalising constant in the denominator $$ Z_N := \int_W L_N(w) \varphi(w) dw = \int_W \prod_{i = 1}^N p(X_i \mid w) \varphi(w) dw $$ is an important quantity known as the model evidence.

(Quite) a few things to say about $Z_N$

This quantity comes with a few different namestestament to its importance and ubiquity in mathematical sciences.. In Bayesian statistics, it goes by the names "model evidence" or "marginal likelihood". In statistical mechanics, it goes by "partition function" or, when taken a log-transform, the "free energy", denoted as $$ F_N = -\log Z_N $$

To justify its name, we observe that $Z_N$ depends on the set of random variables $D_N$, hence it is itself a random variable. In fact, it is precisely the probability of observing the data $\set{X_1, \dots, X_N}$ given the model and prior. Perhaps it is clearer in the basic form of Bayes Theorem $P(H|X) = P(X|H) P(H) / P(X)$ thinking of $H$ as a particular hypothesis in a class of hypotheses and $X$ as the observed data. The denominator $P(X) = \sum_{h \in \text{all hypothesis}} P(X|h)P(h)$ is the probability of observing $X$, $P(X)$, given the hypothesis class. It is there for the likelihood of the given model-prior pair given the data, hence "model evidence". It is computed by marginalise out the model parameter $w$, hence "marginal likelihood".

We shall also consider a generalisation of the posterior distribution and model evidence where introduce a continuous parameter $\beta > 0$ which we shall interpret as "inverse temperature" $\beta = \frac{1}{T}$: $$ \begin{align*} p(w \mid D_N) &= \frac{\varphi(w) \prod_{i = 1}^N p(X_i \mid w)^\beta}{Z_N(\beta)}\\ Z_N(\beta) &= \int_W \varphi(w) \prod_{i = 1}^N p(X_i \mid w)^\beta dw. \end{align*} $$ Note that

$\beta = 1$ reduces down to the case of usual Bayes posterior distribution and evidence.
As $\beta \to \infty$, the posterior distribution converges the delta distribution concentrated at the maximum likelihood estimate of $w$, $\delta(w - \hat{w}_{ML})$, if it exist and is unique in the support of $\varphi(w)$. We can think of this case as taking the temperature down to zero and having the system considers only the ground state specified by the observed data.
As $\beta \to zero$, the posterior collapsed to the prior. This is the case where we have infinite temperature and observed data has no effect on the system exploring all allowed state at the frequency specified by the prior.

To make the analogy to statistical mechanics even tighter, we can introduce the analog to the potential field by considering quantities normalised by the true probabilities of observed data $q(X_i)$. $$ \begin{align*} Z^0_N(\beta) &= \frac{Z_N(\beta)}{\prod_{i = 1}^N q(X_i)^\beta} \\ F^0_N(\beta) &= -\log Z^0_N(\beta) = F_N(\beta) - N\beta S_N \\ S_N &= -\frac{1}{N} \sum_{i = 1}^N \log(q(X_i)) \end{align*} $$ where $S_N$ is the empirical entropy. "Empirical" because it is an empirical estimate of the entropy of the true distribution given by $$ S(q) = - \int q(x) \log q(x) dx = \E[-\log q(x)] = \E[S_N]. $$

Some Important Statistical Estimation Methods

With the above objects defined, we can now specify a few ways to construct $\hat{p}$.

First, there is the maximum likelihood method that constructs an approximation to the truth $q$ by finding, among the model parameter space $W$, the parameter $w \in W$ that maximises the likelihood function. The assumption is that, since the data set $D_N$ is generated from $q(x)$, a good approximation of $q$ is the model $p(x \mid w)$ that have the highest likelihood to have generated the same data.

The Maximum Likelihood Estimator (ML) is given by the model parameter that maximises the likelihood function. $$ \hat{w}_{ML} = \mathrm{argmax}_{w \in W} L_N(w) $$ It turns out that it is easier in practice for numerical reasons and more convenient for theoretical discussions to introduce the negative log-likelihood function $l_N(w) = -log L_N$ and the MLE is given by minimising $l_N(w)$, i.e. $\hat{w}_{MLE} = \mathrm{argmin}_{w \in W} l_N(w)$. With the MLE, the approximate distribution is given by the model at $\hat{w}_{MLE}$, $$ \hat{p}_{ML}(x) = p(x \mid \hat{w}_{ML}). $$

MLE is a common point estimation method. Meaning an objective function, the likelihood function in this case, is constructed and then an optimisation procedure is employed to find the optimum which gives the desired estimate. It is often assumed that the model space is parametrised in a such a way that the likelihood function has desirable functional properties suitable for optimisation such as differentiability, convexity or, like in the case of large neural networks, convenient form to compute gradient for gradient descent.

Bringing some Bayesian perspective into the picture, we can construct another point estimate by biasing the likelihood function with prior belief of what parameters $w \in W$ are a priori more likely to occur without having observe any data. This gives us the Maximum a Posteriori method.

The Maximum a Posteriori (MAP) estimate is obtained by maximising the posterior distribution $$ \hat{w}_{MAP} = \mathrm{argmax}_{w \in W} p(w \mid D_N) = \mathrm{argmin}_{w \in W} l_N(w) - \log \varphi(w) $$ where $\varphi(w)$ is the prior distribution. The approximation to $q(x)$ is then given by $$ \hat{p}_{MAP}(x) = p(x \mid \hat{w}_{MAP}). $$

In extreme cases, if we believeon sound theoretical grounds or just sheer human pig-headedness that the model parameters lies in a certain region and nowhere else, then even if the MLE method claim that the data is most likely to be generated by the model at $\hat{w}_{MLE}$, the MAP estimate might completely disregard that possibility.

Instead of considering one “best” parameter in the model, a fully Bayesian perspective will advocate for taking distributions at all parameters $w \in W$ into account, with the importance of each $p(x \mid w)$ weighted by their posterior probability.

The Bayes predictive distribution for the approximation of $q(x)$ is given by $$ \hat{p}_{B} = p(x | D_N) = \E_{w}\sqbrac{p(x \mid w)} = \int_{W} p(x \mid w) p(w \mid D_N) dw $$ where $\E_w$ denote taking expectation with respect to the posterior.

Another method of constructing statistical estimation $\hat{p}$ perhaps rare in practice but holds theoretical importance is that of Gibbs estimation. With Gibbs estimation, every timeI am a little unsure about whether we draw $\hat{w}$ once and for all or perform multiple draws as described. one wish to estimate the probability of $X = x$, one starts by randomly drawing a parameter $\hat{w}$ from the posterior distribution $p(w \mid D_N)$ and answer with $p(X = x \mid \hat{w})$.

The Gibbs estimation method is given by randomly drawing a parameter from the posterior distribution $$ \hat{w}_{gibbs} ~ p(w \mid D_N) $$ and give the approximation to $q(x)$ by $$ \hat{p}_{gibbs} = p(x \mid \hat{w}_{gibbs}). $$

Kullback-Leibler Divergence

Next come the problem of how to measure the success or failure of our approximation. We need a quantity that measures the difference between two density functions. Enter the Kullback-Leibler divergence.

The Kullback-Leibler Divergence or KL-divergence between two probability density functions $q(x)$ and $p(x)$ on an open set $A \subset \R^n$ is given by $$ K(q \mid\mid p) := \int_A q(x) \log \frac{q(x)}{p(x)} dx. $$

KL-divergence on general probability measure

KL-divergence can be defined in much greater generality between any two probability measures $P$ and $Q$ over the same measure space $(X, \sigma)$ by $$ K(Q \mid\mid P) := \int_X \log \frac{dQ}{dP} dQ $$ whenever the Radon-Nikodymn derivative $\frac{dQ}{dP}$ exists i.e. $Q$ is absolutely continuous with respect to $P$.. If there is another measure $\mu$ on $X$ for which $p = \frac{dP}{d\mu}$ and $q = \frac{dQ}{d\mu}$ exists, then by the chain rule, we recover a more familiar form $$ K(Q \mid\mid P) = \int_X \log\brac{\frac{\frac{dQ}{d\mu}}{\frac{dP}{d\mu}}} \frac{dQ}{d\mu} d\mu = \int_X q \log \frac{q}{p} d\mu. $$ In the definition we gave above, we have been using the standard Lebesgue measure on $\R^n$. As another example, if we have the counting measure $\mu_c$ on a discrete measure space $X$ and probability measures $P$ and $Q$ on $X$, then the expression becomes, $$ K(Q \mid\mid P) = \sum_{x \in X}Q(x) \log \frac{Q(x)}{P(x)}. $$

I find it helpful to understand a little about Shanon’s information entropyand we shall need the concept of information entropy in our study anyway. in order to get an intutitive sense of KL-divergence. Informally, Shanon’s information quantifies how surprised one should be when an event that was thought to have probability $p \in [0, 1]$ occurs. Intuitively, the observation “the sun rose in the east this morning” has less information content than “it was raining this morning in Perth” or “the Dow Jones index rose by 3% this morning”. Think about which “insider knowledge” - knowledge that dispels prior uncertainty by increasing the event probability to 1 - will you pay more to obtain to give an edge in a bet. Stated another way, the information content quantifies ignorance about an event prior to observation.

A little more formally, Shanon information is another way of encoding probability with the following properties

Information content is a monotonically decreasing non-negative function $I: [0, 1] \to \R_{\geq 0}$ of the probability $p$ of the event.
Events with complete certainty have zero information.
Information of independent events adds. These properties together with some smoothness assumptions on $I$ forces

\[I(p) = -\log p.\]

If we have a set of events following a probability distributionmore precisely, a sigma algebra of events with a probability measure, i.e. a probability space, we can quantify the expected amount of information, or the expected amount of surprise if we were to observe a large amount events drawn from the distribution. See Shanon’s original 1948 paper (Shanon, 1948) that started off information theory for formal derivation and discussion.

The Shanon entropy of a random variable $X$ taking on finite number of values ${x_1, \dots, x_n}$ each with probability $p_i = p(X = x_i)$ is given by the following expectation expressionwe use the letter $S$ for entropy here, consistent with statistical physics. However, Shanon originally uses the letter $H$ for this quantity, named after Boltzmann's H-theorem for thermodynamics. $$ S(p) = \E_p[-\log p(X)] = -\sum_{x \in X} p(x) \log p(x) $$

For continuous distribution $p(x)$, we replace the sum with integral $$ S(p) = \int p(x) \log p(x) dx. $$

Observe that even though low probability events has high information content $-\log p(x)$, they occur with less frequency $p(x)$. This push and pull, together with the fact that probability sum to unity, means that the distribution with the highest entropy is the uniform distributionif there is no other constraints and if a uniform distribution can be defined on the space.

Theory of Statistical Learning

$-\frac{1}{N} \log L_N(w) \to K(w) - S(q)$

But minimisation of $-\log L_N(w)$ is not equivalent to minimisation of $K(w)$ (which would give the closest estimate to $q(x)$). So MLE is not the only valid or the best statistical estimation method.

Fisher Information Matrix

Singular Models

Various equivalent definitions of singular models (via FIM and K(w)). Some implications.

Model Selection

MISC

Seminar blackboard draft

References

Sumio Watanabe. 2009. "Algebraic Geometry and Statistical Learning Theory." Cambridge University Press.

Sumio Watanabe. 2018, "Mathematical Theory of Bayesian Statistics." CRC Press, Chapman and Hall/CRC.

Shannon, C. E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.

Singular Learning Theory - Part 1

What do we aspire to study?

Statistical Background

The Task of Learning

Model-Truth-Prior

Some Important Statistical Estimation Methods

Kullback-Leibler Divergence

Theory of Statistical Learning

Fisher Information Matrix

Singular Models

Model Selection

MISC

References

Trending Tags

Singular Learning Theory - Part 1

What do we aspire to study?

Statistical Background

The Task of Learning

Model-Truth-Prior

Some Important Statistical Estimation Methods

Kullback-Leibler Divergence

Theory of Statistical Learning

Fisher Information Matrix

Singular Models

Model Selection

MISC

References

Further Reading

Introduction to Singular Learning Theory

Statistical Learning Theory Via A Coin Flipping Demon

Ladder to Free Energy Asymptotics

Trending Tags