0004_fastai_how_neuralnet_work

Official course site: for lesson 3

Official notebooks repo, on nbviewer

Official how neuralnet work notebook on kaggle

Important: The interactive features of this notebook don’t work in Kaggle’s Reader mode. They only work in Edit mode. Therefore, before starting reading this, please click “Copy & Edit” in the top right of this window, then in the menu click Run and then Run all. Then you’ll be able to use all the interactive sliders in this notebook.

Fitting a function with gradient descent

Is neuralnet just a math function? what does the function look like?

A neural network is just a mathematical function. In the most standard kind of neural network, the function:

  1. Multiplies each input by a number of values. These values are known as parameters
  2. Adds them up for each group of values
  3. Replaces the negative numbers with zeros

why neuralnet is random at first and how to make neuralnet useful

This represents one “layer”. Then these three steps are repeated, using the outputs of the previous layer as the inputs to the next layer. Initially, the parameters in this function are selected randomly. Therefore a newly created neural network doesn’t do anything useful at all – it’s just random!

To get the function to “learn” to do something useful, we have to change the parameters to make them “better” in some way. We do this using gradient descent. Let’s see how this works…

plot_function: how to plot a function with plt; how to create x input with torch.linspace; how to plot x, y, color and title with plt;

from ipywidgets import interact
from fastai.basics import *

plt.rc('figure', dpi=90)

def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
    x = torch.linspace(min,max, 100)[:,None]
    if ylim: plt.ylim(ylim)
    plt.plot(x, f(x), color)
    if title is not None: plt.title(title)

how to create a particular quadratic function

To learn how gradient descent works, we’re going to start by fitting a quadratic, since that’s a function most of us are probably more familiar with than a neural network. Here’s the quadratic we’re going to try to fit:

def f(x): return 3*x**2 + 2*x + 1

plot_function(f, "$3x^2 + 2x + 1$")

how to write a function quad to create any quadratic function

This quadratic is of the form \(ax^2+bx+c\), with parameters \(a=3\), \(b=2\), \(c=1\). To make it easier to try out different quadratics for fitting a model to the data we’ll create, let’s create a function that calculates the value of a point on any quadratic:

def quad(a, b, c, x): return a*x**2 + b*x + c

how does partial and quad work to modify quad to a slightly different func?

If we fix some particular values of a, b, and c, then we’ll have made a quadratic. To fix values passed to a function in python, we use the partial function, like so:

def mk_quad(a,b,c): return partial(quad, a,b,c)

So for instance, we can recreate our previous quadratic:

f2 = mk_quad(3,2,1)
plot_function(f2)

how to add noise to both mult and add of the neuralnet/function; how to create noise using np.random.normal

Now let’s simulate making some noisy measurements of our quadratic f. We’ll then use gradient descent to see if we can recreate the original function from the data.

Here’s a couple of functions to add some random noise to data:

def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)

how to create a random seed to ensure x and y are the same each run

Let’s use the now to create our noisy measurements based on the quadratic above:

np.random.seed(42)

x = torch.linspace(-2, 2, steps=20)[:,None]
y = add_noise(f(x), 0.15, 1.5)

Here’s the first few values of each of x and y:

x[:5],y[:5]
(tensor([[-2.0000],
         [-1.7895],
         [-1.5789],
         [-1.3684],
         [-1.1579]]),
 tensor([[11.8690],
         [ 6.5433],
         [ 5.9396],
         [ 2.6304],
         [ 1.7947]], dtype=torch.float64))

Automating gradient descent

how derivatives automatically guide params to change for a lower loss

The basic idea is this: if we know the gradient of our mae() function with respect to our parameters, a, b, and c, then that means we know how adjusting (for instance) a will change the value of mae(). If, say, a has a negative gradient, then we know that increasing a will decrease mae(). Then we know that’s what we need to do, since we trying to make mae() as low as possible.

So, we find the gradient of mae() for each of our parameters, and then adjust our parameters a bit in the opposite direction to the sign of the gradient.

To do this, first we need a function that takes all the parameters a, b, and c as a single vector input, and returns the value mae() based on those parameters:

how to create a mean absolute error function on any quadratic model

def quad_mae(params):
    f = mk_quad(*params)
    return mae(f(x), y)

Let’s try it:

quad_mae([1.1, 1.1, 1.1])
tensor(2.4219, dtype=torch.float64)

how to create an random tensor with 3 values as initialized params

Yup, that’s the same as the starting mae() we had in our plot before.

We’re first going to do exactly the same thing as we did manually – pick some arbritrary starting point for our parameters. We’ll put them all into a single tensor:

abc = torch.tensor([1.1,1.1,1.1])

how to calc gradients of params? 1. tell PyTorch to get ready for calculating gradients for these params; 2. calc loss; 3. calc the gradients with loss.backward(); 4. how to access params’ gradients;

To tell PyTorch that we want it to calculate gradients for these parameters, we need to call requires_grad_():

abc.requires_grad_()
tensor([1.1000, 1.1000, 1.1000], requires_grad=True)

We can now calculate mae(). Generally, when doing gradient descent, the thing we’re trying to minimise is called the loss:

loss = quad_mae(abc)
loss
tensor(2.4219, dtype=torch.float64, grad_fn=<MeanBackward0>)

To get PyTorch to now calculate the gradients, we need to call backward()

loss.backward()

The gradients will be stored for us in an attribute called grad:

abc.grad
tensor([-1.3529, -0.0316, -0.5000])

how to change params with gradients properly to lower loss¶

According to these gradients, all our parameters are a little low. So let’s increase them a bit. If we subtract the gradient, multiplied by a small number, that should improve them a bit:

with torch.no_grad():
    abc -= abc.grad*0.01
    loss = quad_mae(abc)
    
print(f'loss={loss:.2f}')
loss=2.40

why with torch.no_grad(): when updating params with gradients

Yes, our loss has gone down!

The “small number” we multiply is called the learning rate, and is the most important hyper-parameter to set when training a neural network.

BTW, you’ll see we had to wrap our calculation of the new parameters in with torch.no_grad(). That disables the calculation of gradients for any operations inside that context manager. We have to do that, because abc -= abc.grad*0.01 isn’t actually part of our quadratic model, so we don’t want derivitives to include that calculation.

We can use a loop to do a few more iterations of this:

how to do 10 iterations of updating params with gradients

for i in range(10):
    loss = quad_mae(abc)
    loss.backward()
    with torch.no_grad(): abc -= abc.grad*0.01
    print(f'step={i}; loss={loss:.2f}')
step=0; loss=2.40
step=1; loss=2.36
step=2; loss=2.30
step=3; loss=2.21
step=4; loss=2.11
step=5; loss=1.98
step=6; loss=1.85
step=7; loss=1.72
step=8; loss=1.58
step=9; loss=1.46

As you can see, our loss keeps going down!

If you keep running this loop for long enough however, you’ll see that the loss eventually starts increasing for a while. That’s because once the parameters get close to the correct answer, our parameter updates will jump right over the correct answer! To avoid this, we need to decrease our learning rate as we train. This is done using a learning rate schedule, and can be automated in most deep learning frameworks, such as fastai and PyTorch.

How a neural network approximates any given function

But neural nets are much more convenient and powerful than this example showed, because we can learn much more than just a quadratic with them. How does that work?

The trick is that a neural network is a very expressive function. In fact – it’s infinitely expressive. A neural network can approximate any computable function, given enough parameters. A “computable function” can cover just about anything you can imagine: understand and translate human speech; paint a picture; diagnose a disease from medical imaging; write an essay; etc…

The way a neural network approximates a function actually turns out to be very simple. The key trick is to combine two extremely basic steps:

  1. Matrix multiplication, which is just multiplying things together and then adding them up
  2. The function \(max(x,0)\), which simply replaces all negative numbers with zero.

how to combine a linear func with max(x, 0) into a rectified linear function; how to use torch.clip(y, 0.) to perform max(y, 0.)

In PyTorch, the function \(max(x,0)\) is written as np.clip(x,0). The combination of a linear function and this max() is called a rectified linear function, and it can be implemented like this:

def rectified_linear(m,b,x):
    y = m*x+b
    return torch.clip(y, 0.)

how to use partial to wrap rectified_linear to create a specific rectified_linear func

Here’s what it looks like:

plot_function(partial(rectified_linear, 1,1))

how to use F.relu to replace torch.clip to create a rectified linear func;

BTW, instead of torch.clip(y, 0.), we can instead use F.relu(x), which does exactly the same thing. In PyTorch, F refers to the torch.nn.functional module.

import torch.nn.functional as F
def rectified_linear2(m,b,x): return F.relu(m*x+b)
plot_function(partial(rectified_linear2, 1,1))

To understand how this function works, try using this interactive version to play around with the parameters m and b:

@interact(m=1.5, b=1.5)
def plot_relu(m, b):
    plot_function(partial(rectified_linear, m,b), ylim=(-1,4))

create double and quaduple relu func/neuralnet

As you see, m changes the slope, and b changes where the “hook” appears. This function doesn’t do much on its own, but look what happens when we add two of them together:

def double_relu(m1,b1,m2,b2,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))

If you play around with that for a while, you notice something quite profound: with enough of these rectified linear functions added together, you could approximate any function with a single input, to whatever accuracy you like! Any time the function doesn’t quite match, you can just add a few more additions to the mix to make it a bit closer. As an experiment, perhaps you’d like to try creating your own plot_triple_relu interactive function, and maybe even include the scatter plot of our data from before, to see how close you can get?

This exact same approach can be expanded to functions of 2, 3, or more parameters.

def double_relu(m1,b1,m2,b2,m3,b3,m4,b4,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x) + rectified_linear(m3,b3,x) \
+ rectified_linear(m4,b4,x) 

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5, m3=3, b3=3, m4=-3, b4=-3)
def plot_double_relu(m1, b1, m2, b2, m3, b3, m4, b4):
    plot_function(partial(double_relu, m1,b1,m2,b2,m3,b3,m4,b4), ylim=(-1,6))

How to recognise an owl

deep learning basically is drawing squiggly lines infinitely given computation and time

OK great, we’ve created a nifty little example showing that we can drawing squiggly lines that go through some points. So what?

Well… the truth is that actually drawing squiggly lines (or planes, or high-dimensional hyperplanes…) through some points is literally all that deep learning does! If your data points are, say, the RGB values of pixels in photos of owls, then you can create an owl-recogniser model by following the exact steps above.

This may, at first, sound about as useful as the classic “how to draw an owl” guide:

image.png

Students often ask me at this point “OK Jeremy, but how do neural nets actually work”. But at a foundational level, there is no “step 2”. We’re done – the above steps will, given enough time and enough data, create (for example) an owl recogniser, if you feed in enough owls (and non-owls).

The devil, I guess, is in the “given enough time and enough data” part of the above sentence. There’s a lot of tweaks we can make to reduce both of these things. For instance, instead of running our calculations on a normal CPU, as we’ve done above, we could do thousands of them simultaneously by taking advantage of a GPU. We could greatly reduce the amount of computation and data needed by using a convolution instead of a matrix multiplication, which basically means skipping over a bunch of the multiplications and additions for bits that you’d guess won’t be important. We could make things much faster if, instead of starting with random parameters, we start with parameters of someone else’s model that does something similar to what we want (this is called transfer learning).

And, of course, there’s lots of helpful software out there to do this stuff for you without too much fuss. Like, say, fastai.

Learning these things is what we teach in our course, which, like everything we make, is totally free. So if you’re interested in learning more, do check it out!

As always, if you enjoyed this notebook, please upvote it to help others find it, and to encourage me to write more. If you upvote it, be careful you don’t accidentally upvote your copy that’s created when you click “Copy & Edit” – you can find my original at this link.