Week 4 notebook
This notebook provide some interactive visualisation related to components of Artificial Neural network for you to play with.
- Recommended Educational Videos related to this topic
- Some visualisation
- Finding mean using gradient descent
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ipywidgets import interact
import torch
Recommended Educational Videos related to this topic
These are a series of videos by 3blue1brown on youtube that I HIGHLY RECOMMEND for understanding neural networks with pretty animations:
Introducing neural network:
Gradient descent:
Backpropagation to effectively calculate gradient:
Details about backpropagation:
Some visualisation
Activation functions
- ReLU (rectified linear unit). Just a fancy name for $\mathrm{ReLU}(x) = \max(0, x)$
- Sigmoid: $\sigma(x) = 1 / (1 + e^{-x})$
- tanh: hyperbolic tangent.
@interact(w=(-3, 3), b=(-3, 3))
def visualise(w=1, b=0, activation=["relu", "tanh", "sigmoid"]):
x = np.linspace(-5, 5, num=100)
s = w * x + b
if activation == "relu":
y = np.clip(s, a_min=0, a_max=np.inf)
elif activation == "tanh":
y = np.tanh(s)
elif activation == "sigmoid":
y = 1 / (1 + np.exp(s))
plt.plot(x, y)
Composing activation functions at different parameters
This is an interactive visualisation of a neural network implementing the following function: $$ f_w(x) = \sigma(w_1 \sigma(v_1 x + a_1) + w_2 \sigma(v_2 x + a_2) + b) $$ which is a fully connected (feedforward) network with
- 1 input node.
- 2 node in the hidden layer.
- 1 output node.
You can count that,
- from input to first hidden layer, there are $ 1 \times 2 + 2 = 4$ parameters, they are $(v_1, v_2, a_1, a_2)$.
- from first hidden layer to output layer, there are $2 \times 1 + 1 = 3$ parameters, they are $(w_1, w_2, b)$.
- total num parameters $ = 4 + 3 = 7$, they are $w = (v_1, v_2, a_1, a_2, w_1, w_2, b)$.
Play with the slider to change the paramters and see the changes to $f_w$. How closely Can you get $f_w$ (blue curve) to match the given orange curve?
@interact(w1=(-3., 3.), w2=(-3., 3.), w3=(-3., 3.), w4=(-3., 3.), b1=(-3., 3.), b2=(-3., 3.), b3=(-3., 3.), b4=(-3., 3.))
def visualise_1_hidden_layer(w1=1, w2=1, w3=0, w4=0, b1=1, b2=0, b3=0, b4=0, activation=["relu", "tanh", "sigmoid"]):
x = np.linspace(-5, 5, num=100)
s1 = w1 * x + b1
s2 = w2 * x + b2
if activation == "relu":
f = lambda s: np.clip(s, a_min=0, a_max=np.inf)
elif activation == "tanh":
f = lambda s: np.tanh(s)
elif activation == "sigmoid":
f = lambda s: 1 / (1 + np.exp(s))
a1 = f(s1)
a2 = f(s2)
y = f(w3 * a1 + b3 + w4 * a2 + b4)
plt.plot(x, y)
plt.plot(x, -x**2)
Visualise gradient descent path
This might be a better representation for what SGD does in more complicated (therefore closer to reality) scenario than what was usually shown in a 1D-parameter case with unique minima.
- The countours in the picture shows the "loss landscape", meaning it is the contour plot of some loss function $L(w_1, w_2)$ with 2 parameters -- $(w_1, w_2)$ are the axes.
- The lighter the colour of the contour, the higher the loss (the worse the model is at the particular $(w_1, w_2)$. The goal of any optimisation algorithm -- like SGD -- is to "slide" down to a low loss region (darker red region).
- The red dots are various random initialisation.
- The colour trajectories starting from the red dots are SGD trajectories. Their end point would the output of the optimisation algorithm (i.e. the training output).
Question:
- Could there be multiple optima?
- Could there be multiple global optima?
- Do the trajectories always find an optima?
- Do the trajectories always find a global optima?
- What happen if we increases the learning rate $\eta$?
- What happen if we stop the trajectories earlier?
Finding mean using gradient descent
Let's generate the data first. (Change n
as you like. Try a low n ~ 5
first and then a high n ~ 100
later). We shall plot the randomly generated data on a histogram. Does the data look like a Gaussian for you choice of n
?
true_mean = 2 # would be nice if we know this.
sigma = 3
n = 10
x = sigma * np.random.randn(n) + true_mean
plt.hist(x)
Gradient descent algorithm for finding empirical mean
We will now develop a gradient descent algorithm to estimate the mean. We first define a loss function $$ \ell(\mu) = \frac{1}{n} \sum_{i = 1}^n (\mu - x_i)^2. $$
Why do we choose this "sum-of-square-error" loss function? Well, because it has a (global) minimum that coincide with the empirical mean $$ \hat{\mu} = \frac{1}{n} \sum_{i = 1}^n x_i = \mathrm{argmin}_\mu \ell(\mu). $$
So, if we gradient descent against this loss, we will find $\hat{\mu}$, and hopefully that is a good estimate of $\mu^*$. Remember, $\mu^*$, the thing we don't know, is our actual goal, $\hat{\mu}$ is but a good guess of what $\mu^*$ is .
We plot the "loss landscape" below. Observe that the minimum (which we shall find using gradient descent later), might not line up with the true mean.
Question: What if you increase your data set size to n > 100
?
def loss(mu, x):
return 1/2 * np.sum((x - mu)**2)
mu_range = np.linspace(-5, 5, num=100)
plt.plot(mu_range, [loss(mu, x) for mu in mu_range])
Here we perform a few steps of gradient descent:
- The blue curve is the loss function given the data points.
- The red line and stars represent the gradient descent trajectories. The endpoint of this trajectory is our guess $\tilde{\mu}$.
- The blue star is the empirical mean, $\hat{\mu}$.
- The green star is the true mean (the $\mu^*$ that generated the data in the first place).
- Notice that (particularly if you pick a low
n
),- $\tilde{\mu}$ might not equal $\hat{\mu}$, because the optimisation might not actually find the global minimum.
- $\hat{\mu}$ might not equal $\mu^*$, because the empirical mean is subjected to sampling error.
def grad(mu, x):
return np.sum(mu - x)
def step(mu, x, rate=0.001):
return mu - rate * grad(mu, x)
num_epoch = 5
learning_rate = 0.05
mu = -0.5 #np.random.rand()
mus = [mu]
for e in range(num_epoch):
mu = step(mu, x, rate=learning_rate)
mus.append(mu)
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
mu_range = np.linspace(np.min(mus), 3, num=100)
ax.plot(mu_range, [loss(mu, x) for mu in mu_range])
ax.plot(mus, [loss(mu, x) for mu in mus], 'r*', linestyle="dotted", markersize=15)
ax.plot([true_mean], loss(true_mean, x), 'g*', label="true mean", markersize=15)
ax.plot([np.mean(x)], loss(np.mean(x), x), 'b*', label="empirical mean", markersize=15)