I’m going to use the notation to indicate the weight of the connection from neuron in layer to neuron in layer .
The activation for the neuron in layer is going to be:
Where for every level, to account for the bias term.
For example, you can have:
So you can write in vectorized form as:
When calculating the activation for one layer, you first need to calculate the quantity:
This is called the weighted input to the neurons in layer .
So much for the feed forward phase.
We can think of neural networks as a class of parametric non linear functions from a vector of input variables to a vector of output variables. So we can find weights (parameters) as in polynomial curve fitting by minimizing the sum-of-squares error function. So we define a cost function J as
Where N is the number of training examples, is the desired output for the training sample n, and is the calculated output for the corresponding sample. Notice as the cost function is considered dependent only on weights and not on inputs and/or ground truth, as they are given and cannot be changed.
To use the gradient descend, we need to calculate the partial derivatives for a single training example and then recover the by averaging over the entire training set. The backpropagation algorithm will be used to calculate such derivatives.
Consider the idea to change the weighted input for the neuron in layer of a small quantity . The neuron will output instead of , causing an overall change to cost J of the amount
If the rate of change of the cost w.r.t. is small, then for a small the cost won’t change too much. In this case we say the neuron is nearly optimal. So the quantity measures, somehow, how much the neuron is not optimized, and we call it the error of the neuron. So by definition we have:
Varying while keeping all other things fixed has some repercussions on the next layer. For a neuron k in layer (l + 1) one can write using the chain rule:
Which is to say: the contribution of neuron k at the rate of change of cost J caused by neuron j (in the previous layer) is how the cost changes with regards to the weighted sum of neuron k, multiplied for how that said weighted sum changes with regards to the weighted sum of neuron j (chain rule differentiation of function composition).
We can then sum up all the contributions at the level l+1 and state the following:
This is the most important equation of backpropagation, as it put in relation the error with the errors in the next layer. We can rewrite it as:
Regarding the quantity , we can calculate it starting from the following equation:
If we differentiate w.r.t. we obtain
So we can finally write