Deep learning and neural networks explained. In this article, we’ll also look at supervised learning and convolutional neural networks.
Last week, we saw that deep learning algorithms always consist of the same bricks. These elements are the input, hidden, and output layers, as well as the neurons, i.e. the computing units. Once assembled, they can answer specific questions. First, however, the algorithm must be trained through either supervised or unsupervised learning. This week we’ll look at supervised learning, which entails providing the algorithm with labeled data to learn from. But how exactly does our algorithm learn from these examples? Let’s go back to the neurons.
How neurons work
Each neuron in an algorithm is unique, but the role of each neuron is essentially the same. It receives, transforms, and passes information from the previous layer of connected neurons to the next. Each layer can handle more and more complex information. In this way, the input features are fed forward through from one layer to the next until the output layer. Here, the result, the answer to your question, is transferred to the outside world.
The transmission of the signal from one neuron occurs by combining the information received from the connected neurons in the previous layer and simultaneously weighing them. The new weighted inputs are transformed in a non-linear fashion to an output signal the next neuron can process. The non-linearity is important since without it the learning of complex features would be impossible.
The gradient descent approach
But how does the neuron know which weight to apply? It learns which features are important by making mistakes. First, the weights are chosen randomly. Then, they adjust depending on the more training examples it sees. This is done in two steps:
- The evaluation of the mismatch between the algorithm’s prediction and the ground truth. In other words, the output of the defined loss function is calculated.
- The error then propagates backward from the output layer to the first hidden layer through the entire network. Subsequently, the corresponding weight correction is calculated for each neuron.
The entire process runs over the whole training dataset for several iterations, called epochs. To find the optimal weights for each neuron, we use the backward propagation of errors together with an optimization method to minimize the loss function. The gradient descent approach is the basis of the most common optimization methods. The gradient , i.e. the slope , of the loss function is estimated in relation to all the current weights in the network. In turn, these are used to update the weights.
The desolation of a data scientist
We now have everything to build and train a deep learning algorithm. The success of your training depends on how well your algorithm can generalize. Generalization means whether or not it can provide the correct answer when presented with unseen examples. You can control the algorithm’s behavior by playing around with its ingredients, the so-called hyperparameters. On one hand, you have the number of bricks you can stack. These are the number of hidden layers and neurons in each hidden layer. You also have to carefully choose the type of activation function for each neuron to learn the complex features. Finally, you can also influence the learning process itself by slowing or speeding it up. Be careful! If the algorithm learns too fast it might miss the minimum of the loss function and will, therefore, be less accurate. Conversely, if it learns too slowly, it might never find the optimal weights.
Tweaking the deep learning algorithm
Tweaking your deep learning algorithm is an iterative process. You start with some configuration values and you use them to train a model. Depending on the initial output performance of your model, you will change the values to train a new one. This cycle can be quite long. To efficiently find the best values for your algorithm, the best approach is to split your dataset into three independent sets.
- A training dataset for your algorithm.
- A validation or development dataset to evaluate your trained algorithm which the training algorithm does not observe.
- A test dataset of unseen examples for the evaluation of the final algorithm.
The percentage of the different splits is entirely up to you and your data. First, you will need sufficient data in your validation and test dataset for statistically representative results. In addition, you will need enough examples to teach the algorithm of your complex problems.
Once trained a model, you can compare the error of your training dataset to pure chance and human performance to understand if the algorithm learned the task correctly. Let’s assume we trained an algorithm to classify cats and dogs. Let’s consider also that it falsely classified 15% of the training dataset. Here the algorithm seems to have not correctly learned to distinguish cats from dogs. Compared to a human with an error rate of 0%, it under-fitted the data. If this occurs, you can try a bigger network, i.e. add more hidden layers and/or neurons in each layer. Increasing the number of layers, and therefore neurons, allows your algorithm to capture more complexity. Sometimes the problem is simply that you haven’t given the algorithm enough time to correctly learn the features. So, by increasing the number of epochs, you just might get lucky!
If your training error is sufficiently small, you can compare it to the error from the validation set to see if your algorithm is a victim of overfitting. The algorithm learned every quirk and noise from the training dataset and became overspecialized. Let’s assume that the algorithm has an error of only 1% on the training dataset. Conversely, the validation dataset has an error of 10%. This large discrepancy is most likely due to the algorithm’s overspecialization. Therefore, when the algorithm sees unknown examples from the validation dataset, it becomes confused and misclassifies the images. In this case, you can try to increase, diversify, and clean your training dataset. This allows your algorithm to learn more general features instead of becoming too specialized to outliers. Another option would be to reduce the number of neurons used and/or regularise the loss function.
Convolutional neural networks
Artificial neural networks and deep learning techniques have been around for some time now. So why is there so much fuzz about deep learning at the moment? The answer is convolutional neural networks! Deep neural networks with fully connected layers are computationally expensive. In the case of image related problems, this is a HUGE problem. Imagine you want to input 200 x 200-pixel color images. Then, every single neuron would need to have 120,000 weights!
What is different with convolutional neural networks? First, they arrange their neurons in 3-dimensional layers (width, height, and depth), and transform a 3-dimensional input to a 3-dimensional output. They consist of different types of layers with a convolutional layer at its core. Instead of connecting to each neuron, they only process input from a local region of the input volume. The spatial extent of this connectivity is called the receptive field of the neuron.
The feature map
Hence, the convolution layer reduces the number of free parameters, allowing the network to be deeper with fewer parameters. For instance, regardless of the image size, tiling regions of 5 x 5 pixels, each with the same shared weights — filter — , require only 25 learnable parameters. The output of a convolutional layer is a so-called feature map. Stacking a lot of such layers together leads to a network that first creates representations of small parts of the input, then from them assembles representations of larger areas.
In practice, a convolutional neural network learns the values of these filters on its own during the training process. The higher the number of filters we have, the more we can extract image features and the better our network becomes at recognizing patterns in unseen images. Three hyper-parameters control the size of each feature map:
- Depth: number of stacked filters.
- Stride: number of pixels by which we slide our filter matrix over the input matrix.
- Zero-padding: sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix.
Do you want to shed some light into the deep learning black box? Then have a look at the visualization tool of Adam Harley.
Woah that’s a lot of information for you to digest so we’re going to leave it at that for this week. We have seen the inner workings of deep learning algorithms and we hope that you feel confident in building your own network now. If you have any questions or comments don’t hesitate to write us in the comments section. In Part 4, we will have a look at how to use deep learning in practice, including its necessary computing resources, dataset creation, model training and finally its deployment.