Introduction to Image Recognition – AI for Dummies (1/4)

byVincent Delaitre

Read our introduction to image recognition and computer vision and discover the most promising field of deep learning.

When it comes to images, artificial intelligence has existed under different names since the 60s: computer vision and image recognition. But what exactly is computer vision?

Computer vision is the art and science of making computers understand images.

You might not realize it, but your brain is indeed a beautiful machine. From one single picture, it can retrieve more information than we know what to make of. Have a look at the picture below.

Picture to train image recognition algorithm.
Barry is a cool dog. He likes surfing in Hawaii.

If asked what it’s in the image, you would probably tell me there’s a dog, on the beach, with some kind of bodyboard, wearing red sunglasses and a Hawaiian necklace made of fake flowers…

Spoiler alert! The day a computer will be able to get to this level of both precision and generality at the same time, has not come yet. Fortunately for us — otherwise we’d be out of business — there are already some practical use cases where computer vision proves highly valuable.

Tell Me What You See

So, what do we teach computers then? Simple: recognize, identify and locate objects with different degrees of precision. Barry and his friend Ducky will show you what I mean. For the sake of simplicity, I will illustrate the four main tasks used today in real-world applications:

  1. Classification.
  2. Tagging.
  3. Detection.
  4. Segmentation.

Classification and tagging

Image recognition: classification and tagging.
Classification (left): we are pretty sure there are only a dog and no cat. Tagging (right): there are both a dog and a duck.

The first and most straightforward task we can accomplish is to identify what is in an image and how sure we are about it, i.e. the probability percent in the two pictures above. There are two main points to consider:

  1. What is the list of objects you want to detect?
    This is called the ontology. In the first image, it’s cats and dogs. To keep it (very) simple, you need to tell the algorithm what classes of objects it should identify beforehand. And as with all things simple… it’s actually more complicated than that. You don’t always have to list all the objects. However, this is an open area of research called unsupervised learning, so we’ll steer clear of it for the time being.
  2. Are there multiple objects in the same picture?
    If there is only one item at the same time, we call it classification (left). Otherwise, when several objects are in the same picture, it’s known as tagging (right).

Detection and segmentation

Image recognition: detection and segmentation.
Detection (left): we know in which box in the image Ducky and Barry are. Segmentation (right): we have the information at the pixel level.

Now that we’ve answered the What, the question becomes: Where are the objects we’re looking for? There are two ways to do it:

  1. Detection outputs the rectangle, or bounding box, on the image  where the objects are. It can be prone to small errors and imprecisions on the position, but it’s a very robust technology.
  2. Segmentation goes one step further. For each pixel , the most atomical element of information in an image,  we identify to which, if any, objects it belongs to. The result is a very precise map, though it requires a lot of carefully annotated data. That’s a tedious task when you have to do it for every pixel, but it’s one that can deliver impressive results. This is one of the reasons why use-cases in healthcare, particularly in cancer detection, are becoming more and more widespread.

These were the four main building blocks of computer vision. However, you also have instance identification, face key-points detection, action recognition, tracking, optical character recognition, image generation, style-transfer, denoising, depth estimation, 3D reconstruction, motion estimation, optical flow, etc. You got the idea; there’s a lot to do!

Traditional computer vision vs deep learning

Arthur C. Clarke , who wrote 2001: A Space Odyssey ,  said it better than anyone else: “Any sufficiently advanced technology is indistinguishable from magic.” My spin on this quote is that until you explain exactly how something works, you will never understand and accept it. It’s especially true when it comes to AI. Once you start peeling the onion, you realize it’s just another technology with its strengths and weaknesses. You shouldn’t be scared of it any more than you are of electricity.

The real game-changer and the most fundamental difference between traditional computer vision and what’s now called deep learning lays in how you build the algorithms.

  • The new ways.
    With deep learning, everything relies on examples. You need a collection of several dog and cat images; then the algorithm will build on its knowledge of the images given to make predictions on pictures it’s never seen before. This is what’s called generalization. Warning ⚠️️ ????. You should always be very suspicious of people who speak of algorithms as if they were sentient beings and had motives, which is what I just did. Just because they appear to learn as we do, it doesn’t mean they are actually able to think.
  • The old ways.
    On the other hand, traditional computer vision is mostly rule-based. This means that you will look at images of what you want to detect and then use your imagination and logical thinking. The objective? Design a set of rules and instructions that will lead to the result you’re looking for.

Rules and instructions

Let’s have a look at examples of rules and instructions.

Wouldn’t it be sweet if you didn’t have to halt at the highway toll to pay your fare? What about ensuring that your crazy neighbor stops speeding down the road when your kids are playing? And if your garage door could automatically recognize you and open itself? What you need is first to detect license plates, and then to be able to read them.

For now, let’s focus on the detection part. There are six main steps that I’m going to illustrate using our company car, the Deepomobile:

Image recognition: rules and instructions.
  • Step 1. Nothing much to say here except that it’s very convenient to go from point A to point B and get all the heads to turn.
  • Step 2. First, we transform the image to black and white by merging the red green and blue channels. Then, we blur it to remove the small artifacts and detect more general shapes.
  • Step 3. The gradient magnitude is computed. Put simply, the gradient is the difference between two adjacent pixels. The higher it is, the more different the pixels are which is why it’s used to detect edges.
  • Step 4. Non-maximum suppression ensures that even if one edge spans multiple pixels, we only consider the most likely line.
  • Step 5. Hysteresis thresholding reinforces this and provides clean-cut edges.
  • Step 6. The edges are converted into geometric lines which are then in turn used to detect the rectangle shape of the license plate.

Each step has its own set of parameters and needs specific tuning. As a consequence, traditional computer vision techniques are not always reliable when the conditions change. For instance, if we designed our license plate detector to work in a garage, then using it outside, in the presence of shadows, at night or in broad light might yield less than optimal results rendering it useless.


Long story short, we used to design specific and tailored recipes for each computer vision task. Now, with deep learning, we build algorithms that learn to make their own rules.

Next week we’ll go over how we do things nowadays and what the term deep learning really means. Stay tuned!


Our Blog Articles