import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
from ipywidgets import interact

import torch

Important: Run this notebook in the cloud by clicking any of the options at the top of the page (e.g. google colab).

Recommended Educational Videos related to this topic

These are a series of videos by 3blue1brown on youtube that I HIGHLY RECOMMEND for understanding neural networks with pretty animations:

Introducing neural network:

Gradient descent:

Backpropagation to effectively calculate gradient:

Details about backpropagation:

Some visualisation

Activation functions

Important: Run this notebook in the cloud by clicking any of the options at the top of the page (e.g. google colab).
The following are visualisations of some common activation function used in ANN.
  • ReLU (rectified linear unit). Just a fancy name for $\mathrm{ReLU}(x) = \max(0, x)$
  • Sigmoid: $\sigma(x) = 1 / (1 + e^{-x})$
  • tanh: hyperbolic tangent.
@interact(w=(-3, 3), b=(-3, 3))
def visualise(w=1, b=0, activation=["relu", "tanh", "sigmoid"]):
    x = np.linspace(-5, 5, num=100)
    s = w * x + b
    if activation == "relu":
        y = np.clip(s, a_min=0, a_max=np.inf)
    elif activation == "tanh":
        y = np.tanh(s)
    elif activation == "sigmoid":
        y = 1 / (1 + np.exp(s))
    plt.plot(x, y)

Composing activation functions at different parameters

This is an interactive visualisation of a neural network implementing the following function: $$ f_w(x) = \sigma(w_1 \sigma(v_1 x + a_1) + w_2 \sigma(v_2 x + a_2) + b) $$ which is a fully connected (feedforward) network with

  • 1 input node.
  • 2 node in the hidden layer.
  • 1 output node.

You can count that,

  • from input to first hidden layer, there are $ 1 \times 2 + 2 = 4$ parameters, they are $(v_1, v_2, a_1, a_2)$.
  • from first hidden layer to output layer, there are $2 \times 1 + 1 = 3$ parameters, they are $(w_1, w_2, b)$.
  • total num parameters $ = 4 + 3 = 7$, they are $w = (v_1, v_2, a_1, a_2, w_1, w_2, b)$.

Play with the slider to change the paramters and see the changes to $f_w$. How closely Can you get $f_w$ (blue curve) to match the given orange curve?

@interact(w1=(-3., 3.), w2=(-3., 3.), w3=(-3., 3.), w4=(-3., 3.), b1=(-3., 3.), b2=(-3., 3.), b3=(-3., 3.), b4=(-3., 3.))
def visualise_1_hidden_layer(w1=1, w2=1, w3=0, w4=0, b1=1, b2=0, b3=0, b4=0, activation=["relu", "tanh", "sigmoid"]):
    x = np.linspace(-5, 5, num=100)
    s1 = w1 * x + b1
    s2 = w2 * x + b2
    
    if activation == "relu":
        f = lambda s: np.clip(s, a_min=0, a_max=np.inf)    
    elif activation == "tanh":
        f = lambda s: np.tanh(s)
    elif activation == "sigmoid":
        f = lambda s: 1 / (1 + np.exp(s))
    a1 = f(s1)
    a2 = f(s2)
    y = f(w3 * a1 + b3 + w4 * a2 + b4)
    plt.plot(x, y)
    plt.plot(x, -x**2)

Visualise gradient descent path

This might be a better representation for what SGD does in more complicated (therefore closer to reality) scenario than what was usually shown in a 1D-parameter case with unique minima.

  • The countours in the picture shows the "loss landscape", meaning it is the contour plot of some loss function $L(w_1, w_2)$ with 2 parameters -- $(w_1, w_2)$ are the axes.
  • The lighter the colour of the contour, the higher the loss (the worse the model is at the particular $(w_1, w_2)$. The goal of any optimisation algorithm -- like SGD -- is to "slide" down to a low loss region (darker red region).
  • The red dots are various random initialisation.
  • The colour trajectories starting from the red dots are SGD trajectories. Their end point would the output of the optimisation algorithm (i.e. the training output).

Question:

  • Could there be multiple optima?
  • Could there be multiple global optima?
  • Do the trajectories always find an optima?
  • Do the trajectories always find a global optima?
  • What happen if we increases the learning rate $\eta$?
  • What happen if we stop the trajectories earlier?


Finding mean using gradient descent

Important: Run this notebook in the cloud by clicking any of the options at the top of the page (e.g. google colab).
This is an illustration of a (very) simple machine learning problem, where the goal is to find out an unknown mean $\mu^*$ of a Gaussian that gives us an i.i.d. data set $D_n = \{x_1, x_2, \dots, x_n\}$, where the $x_i$ are drawn from normal distribution $N(\mu^*, \sigma=3)$.

Let's generate the data first. (Change n as you like. Try a low n ~ 5 first and then a high n ~ 100 later). We shall plot the randomly generated data on a histogram. Does the data look like a Gaussian for you choice of n?

true_mean = 2 # would be nice if we know this. 
sigma = 3
n = 10
x = sigma * np.random.randn(n) + true_mean
plt.hist(x)
(array([1., 0., 1., 1., 3., 0., 2., 0., 1., 1.]),
 array([-5.23554849, -3.74630945, -2.2570704 , -0.76783135,  0.7214077 ,
         2.21064675,  3.69988579,  5.18912484,  6.67836389,  8.16760294,
         9.65684199]),
 <BarContainer object of 10 artists>)

Gradient descent algorithm for finding empirical mean

We will now develop a gradient descent algorithm to estimate the mean. We first define a loss function $$ \ell(\mu) = \frac{1}{n} \sum_{i = 1}^n (\mu - x_i)^2. $$

Why do we choose this "sum-of-square-error" loss function? Well, because it has a (global) minimum that coincide with the empirical mean $$ \hat{\mu} = \frac{1}{n} \sum_{i = 1}^n x_i = \mathrm{argmin}_\mu \ell(\mu). $$

So, if we gradient descent against this loss, we will find $\hat{\mu}$, and hopefully that is a good estimate of $\mu^*$. Remember, $\mu^*$, the thing we don't know, is our actual goal, $\hat{\mu}$ is but a good guess of what $\mu^*$ is .

We plot the "loss landscape" below. Observe that the minimum (which we shall find using gradient descent later), might not line up with the true mean.

Question: What if you increase your data set size to n > 100?

def loss(mu, x):
    return 1/2 * np.sum((x - mu)**2)

mu_range = np.linspace(-5, 5, num=100)
plt.plot(mu_range, [loss(mu, x) for mu in mu_range])
[<matplotlib.lines.Line2D at 0x135e9aac0>]

Here we perform a few steps of gradient descent:

  • The blue curve is the loss function given the data points.
  • The red line and stars represent the gradient descent trajectories. The endpoint of this trajectory is our guess $\tilde{\mu}$.
  • The blue star is the empirical mean, $\hat{\mu}$.
  • The green star is the true mean (the $\mu^*$ that generated the data in the first place).
  • Notice that (particularly if you pick a low n),
    • $\tilde{\mu}$ might not equal $\hat{\mu}$, because the optimisation might not actually find the global minimum.
    • $\hat{\mu}$ might not equal $\mu^*$, because the empirical mean is subjected to sampling error.
def grad(mu, x):
    return np.sum(mu - x)

def step(mu, x, rate=0.001):
    return mu - rate * grad(mu, x)

num_epoch = 5
learning_rate = 0.05
mu = -0.5 #np.random.rand()
mus = [mu]
for e in range(num_epoch):
    mu = step(mu, x, rate=learning_rate)
    mus.append(mu)

    
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

mu_range = np.linspace(np.min(mus), 3, num=100)
ax.plot(mu_range, [loss(mu, x) for mu in mu_range])
ax.plot(mus, [loss(mu, x) for mu in mus], 'r*', linestyle="dotted", markersize=15)

ax.plot([true_mean], loss(true_mean, x), 'g*', label="true mean", markersize=15)
ax.plot([np.mean(x)], loss(np.mean(x), x), 'b*', label="empirical mean", markersize=15)
[<matplotlib.lines.Line2D at 0x13626c520>]