Foundations of deep learning
Today, we’ll introduce on deep learning, one approach of machine learning. Last week, we focused on computer vision, the art of making computers understand images, and briefly went over how we used to do it. Click here to read part 1 of AI for Dummies.
The idea of machine learning is to map some kind of input into an output. In other words, we ask a question – input – and the algorithm provides us with an answer – output. It sounds simple, doesn’t it? As you might have guessed, it isn’t actually that easy. When building a machine learning algorithm, the first and most important step is to accurately formulate the question to get the desired answer.

Let’s consider Barry’s picture above. We can now formulate different questions depending on the task that we want the algorithm to perform:
- Is there a dog in this image?
- Is this a dog or a cat?
- Which of the following objects are in this image: dog, cat, plane, or duck?
- Where and how many dogs are in this image?
To answer these questions, we use artificial neural networks.
The artificial neural network
Inspired by Mother Nature, researchers have been trying to imitate the inner workings of a biological brain for years. The resulting mathematical representation is the artificial neural network . The latter is a system of nodes, called neurons, that receive inputs and send outputs to each other.
Input layer
The input layer consists of a list of input features. For a single image, the input layer consists of a single array of 3 dimensions: width and height of the image (measured as the number of pixels) and the number of color channels, i.e. 1 channel for black and white and 3 or more channels for color images.
Hidden layers and neurons
Hidden layer(s) are the secret sauce of your network. They allow you to model complex data thanks to their nodes/neurons. They are “hidden” because the true values of their nodes are unknown in the training dataset. In fact, we only know the input and output. Each neural network has at least one hidden layer. Otherwise, it is not a neural network. Networks with multiple hidden layers are called deep neural networks. The most common type of hidden layer is the fully-connected layer. Here, each neuron is connected to all the others in two adjacent layers. It is not connected to the ones in the same layer. Convolutional layers are another type of hidden layers that are very prominent when dealing with images.
Neurons are the processing units of the network. Each neuron weighs and sums the different inputs and passes them through an activation function. The role of the activation function is to buffer the data before it is fed to the next layer. You can change the activity of your neuron.
Output layer
The output layer is the final layer with neurons. This is where the data comes out of your model. So the number of neurons needs to be exactly the number of outputs you want, i.e. the questions you want to answer. If we want to know if Barry is a dog or a cat, the number of output neurons is exactly 2. One is for the probability of being a dog and the other for being a cat.
OK, now we have nearly everything we need. What’s missing? Well, the learning part of course! The ability to judge which input is more important for identifying Barry as a dog in the image can be learned. That is the beauty of artificial neural networks: no manual feature extraction is required, such as specific shapes, colors, edges, etc.
The only question remaining is: How should the algorithm learn? The answer depends on you and your data. Do you want the algorithm to learn from your precious labeled data (i.e., supervised learning)? Or do you want the algorithm to figure out by itself what makes your data special without any feedback from you (i.e., unsupervised learning)? In both cases, you need to define a loss function which is then minimized during the learning process.
Supervised learning
Back to Barry. With supervised learning, we need to give the algorithm examples of what a dog and a cat look like for it to correctly identify Barry as a dog and not as a cat. These examples are called labeled training data.
The next thing to do is to define our loss function, which allows the algorithm to learn to distinguish a dog from a cat. Basically, it is simply the mismatch between the correct label, i.e. the ground truth, and the predicted label from the algorithm. The loss is minimal when the predicted label corresponds to the ground truth label.
Supervised learning is the most common learning scheme used in computer vision due to its simplicity. But you might have already guessed the major issue with this approach: you need good annotated training examples in large numbers. If you wish to train generic models there are a number of open access databases that exist such as ImageNet or OpenImages.
Unsupervised learning
For unsupervised learning, the algorithm needs to figure out what the most characteristic features of a dog and a cat are by itself. Basically, we would give the algorithm the above pictures and it would group them by characteristics. This means, instead of a dog/cat classification we could end up with an animal in a sun chair and/or an animal wearing sunglasses categories. To avoid this, the cost function needs to be properly formulated according to the question you want to ask.
Conclusion
Long story short, deep learning algorithms are always made up of the same elemental bricks: input, hidden, and output layers as well as your computing units — the neurons. What makes your algorithm unique is the way you stack and train them according to the problem you want to solve.
That’s it for part 2 of AI for Dummies! Next week we’ll dive even further into the learning details and strategies of deep learning. We will take a look at how to minimize the loss function, how the parameters of each layer are adjusted, and convolutional layers. Stay tuned!
If you like what you read this week why not sign up to receive part 3 of AI for Dummies directly in your inbox next week!