Deep learning: A non-mathematical intuition of how a neural network learns
First, we make the data conducive to computers — that is, we convert it into vectors and matrices. The data also needs to be organized as samples and features. Then we split it into training, validation, and test sets. For the training set, for each sample, we have a target or the value to predict, called y.
y — the actual value of the target in the training set.
y_hat — the value that will be predicted through forward propagation.
Now, for each sample, the inputs are sent through the neural network, and outputs computed for each neuron in each layer based on the perceptron formula. These outputs are passed as inputs into the next layer. And at the end, y_hat is obtained in the output layer.
As more and more samples are sent to the neural network, y_hat values for each sample are collected. This process is repeated for all samples in the training dataset. After that, y and y_hat are compared, and error computed with a cost/loss function.
There is a minute difference between cost and loss function. A loss function measures the prediction error for a single sample, a cost function measures the error across a set of samples.
A variety of loss functions are used in Machine Learning based on the problem at hand. A loss function helps determine error (aka “the loss”) between the output of an algorithm and the expected value. In layman’s terms, it represents the gap between the predicted values and the actual values of the target variables.
The prediction error is estimated with this cost function, and using this error, the weights and biases in the network are updated/adjusted with backpropagation.
Each node in the neural network contributes to the prediction error. This contribution is enumerated in the values of the weights and biases it has. Different nodes contribute differently, based on how well their weights and biases model the relationship between the feature and target variables.
Training means adjusting those weights and biases for each node in order to lower its contribution to the error.
We use a loss function to calculate this error at the output layer. Heavily based on our task, the loss function is basically a quantitative measure of the distance between the two tensors- the ground truth and output. The tensors can be image labels, bounding boxes in an image, translated text in another language, etc.
Backpropagation is about gradually minimizing this loss by updating the parameters of the network. In simple terms, it is about understanding how changing the weights (parameters) in a network changes the loss function by computing the partial derivatives.
When we have the error, we compute its gradient (partial derivatives of the loss function with respect to the weight values) and subtract is from them.
This delta value is an adjustment that is applied to all the weights and biases in the layer. This results in new values of weights and biases. Then we derive a new delta value for the previous layer based on the new values in the current layer; and it is then applied to the weights and biases in the previous layer.
The process of computing deltas, applying it to the weights and biases, and then backpropagating, continues until we reach the input layer. After multiple iterations of backpropagation, the loss is reduced.
In the image above, the deltas D1 to D4 are computed at each layer and applied to their weights and biases. They also propagate to the previous layer and influence their deltas. At the end of the backpropagation process, we have an updated set of weights and biases that should reduce the overall prediction error.
The purpose of back propagation is to adjust weights and biases based on the estimated error.
It’s really worth noticing that all these values are often less than 1, independent of the sign. In order to propagate the gradient to the earlier layer’s, backpropagation uses multiplication of the partial derivatives (as in the chain rule). In general, multiplication with absolute value less than 1 is nice because it provides some sense of training stability, although there is not a strict mathematic theorem about that. However, one can observe that for every layer that we go backwards in the network the gradient of the network gets smaller and smaller.
Gradient descent is a method of finding the optimal weights for a model. We use the gradient descent algorithm to find the best machine learning model, with the lowest error and highest accuracy. A common explanation of gradient descent is the idea of standing on an uneven baseball field, blindfolded, and you want to find the lowest point of the field. Naturally, you will use your feet to inch your way to the lowest point on the field. Looking for any downward slope. Conceptually, this is what we are doing to minimize our error and find our best performing machine learning model.
Gradient descent is the process of repeating forward and backward propagations in order to reduce error and move closer to the desired model. In gradient descent, we repeat the learning process of forward propagation, estimating error, backward propagation, and adjusting weights and biases. As we do this, the overall error estimated by the cost function first oscillates around and then starts moving closer to zero.
That is our aim. We want to minimize this error — the distance between the predicted and true values. We keep measuring the error and computing deltas that would minimize the error contribution of individual nodes.
Batches and Epochs
Batches and epochs help control the number of passes through the neural network.
A batch is a set of training samples that are sent through the neural network in one pass. The training data set is divided into one or more batches. The neural network receives one batch at a time and does forward propagation. Cost functions are executed, and the weights and biases are updated after each batch. So, when the next batch comes in, it has a new set of weights and biases to work with.
A training data set is passed through the neural network multiple times during the learning process. The total number of times the entire training data set is sent through the neural network is called epoch. An epoch obviously would have one or more batches.
As more epochs ‘happen’, the same batch is sent repeatedly through the neural network but gets to work with a different set of weights and biases each time. When all epochs are completed, the training process is complete.
Batch sizes and epochs are considered as hyperparameters. They are tuned during the model learning process to improve model accuracy.