Published on

What is Computer Vision?

Deepomatic Text Breaker - Small

 What is Computer Vision?

Humans rely on their eyes to gather and interpret information, letting them understand their environment to execute actions properly. This "simple" action involves visual data collection and real-time interpretation. However, even though human vision is remarkable, it has limits, especially regarding errors or fatigue. In the end, we are not machines. 

So, what is Computer Vision? If we dissect the term "Computer Vision," it essentially signifies what a computer can perceive. Therefore, we can consider Computer Vision to be the eyes of a machine powered by an artificial intelligence system. Similarly, these "eyes" allow machines to observe, analyze, and comprehend, ultimately understanding their environment as humans do. However, contrary to human vision, these "visual senses" are powered by cameras and sophisticated algorithms that can swiftly and accurately analyze data.

Computer Vision systems can process thousands of images or videos within minutes while identifying defects or irregularities. This surpasses human capabilities in speed and efficiency as CV can significantly reduce its margin of error, leading to more accurate and reliable outcomes in high-volume operations.

The computer vision market has been thriving for the last few years. With a projected annual growth rate (CAGR) of 11.69% from 2024 to 2030, the market size will reach US$50.97 billion by the decade's end.

Deepomatic Text Breaker - Small

How does Computer Vision work?

Let’s see how Computer Vision technology works. In short, developing Computer Vision algorithms, or “Computer Vision models” typically follows three steps: 

  • Acquire visual data (e.g.,  pictures or videos) and manually annotate it automatically with the objects you intend to recognize later on.
  • Train the Computer Vision model to understand what those objects look like to recognize them on new unseen images
  • Use the trained Computer Vision model to automatically predict those objects on the flow of production images at scale.

Said like this, Computer Vision seems easy, right? Well, these “simple” steps are actually quite complex! Let’s dive deeper into the subject.

Images are represented digitally as a matrix of pixels, each containing values for the three primary colors: red, green, and blue (RGB). These values range from 0 to 255, where 0 represents complete darkness and 255 represents full brightness. All pixels come together to form the digital image. Here is a simple representation:

RGB representation of van gogh's "sunflower" painting

Source: ConnectCode

Another way to visualize how these pixels come together into a single image is to break down the image into a matrix showing the value of each pixel. Here is a representation of Andy Warhol’s “Marilyn Diptych” paint:

Andy Wharhol's Marilyn Diptych painting represented with 0-1 numbers showing the RGB colors

Keeping in mind how images are represented,  we proceed to understand how Computer Vision systems process them. Two pillar technologies allow Computer Vision to recognize images and what is inside: Deep Learning and Convolutional Neural Networks (CNNs). 

Deep Learning (DL) 

Deep Learning is a type of machine learning that uses specific algorithms called Neural Networks. These networks are inspired by how our brains work, with interconnected layers of "neurons." DL is excellent at learning from vast amounts of data to find complex patterns similar to human capacities. It is possible because it allows computers to learn independently after analyzing contextual datasets.

Convolutional Neural Networks (CNN)

CNNs are a specific family of neural networks especially suited for processing image or audio data. Inspired by how the retina works in our eyes, they work by applying spatial filters to the image pixels (an operation called “convolution”).

Each filter is responsible for producing a signal specific to a visual pattern. By applying this operation several times, the model can recognize more and more complex shapes, ultimately resulting in the identification of the image's content itself.

Explanation of how Convolutional Neural Networks work, using the example of the picture of a car

Source: Towards Data Science

Training the model involves adjusting filters to minimize errors on annotated images, enabling accurate identification of target objects in new images without manual intervention. Taking the example above, it means allowing a computer to learn how to recognize car pictures without human help.

Deepomatic Text Breaker - Small

Common Computer Vision Tasks

Deepomatic Text Breaker - Small

Computer Vision History

We have been talking a lot about what Computer Vision is, how it works, and some of the common tasks this technology performs, but when did it even start?

The origin

The very origin of Computer Vision dates back to 1957 when Russel Kirsch asked: What would happen if computers could look at pictures? This question led to Kirsch’s group creating the first digital scanner, thus generating the first digital image.

A couple of years later, in 1959, two neurophysiologists discovered by accident that the starting point of image processing was simple structures like straight edges. These findings resulted from a “Receptive fields of single neurons in the cat's striate cortex study” by David Hubel and Torsten Wiesel. 

Computer Vision in the 1960s

The following decade experienced two big moments. First, artificial intelligence became an academic field of study, with many researchers advocating for its development. Indeed, AI was founded in the summer of 1956 during the Dartmouth College conference but only boomed in the 1960’s.

Secondly, in 1963, Lawrence Roberts was able to turn 3D information into 2D information, reducing the visual world into regular geometric shapes. His Ph.D. thesis, “Machine Perception of three-dimensional Solids,” described this process. 

Computer Vision in the 1970s

The 1970s were the period where the groundwork for Computer Vision was developed, as researchers dived deeper into algorithm development for feature extraction (key points identification) and edge detection (object recognition) on images. This was also when industry players started to test computer vision for quality control (roots of visual inspection).

Computer Vision from 1980 to the end of the Millennium

During the 80s, two big contributions were made to the computer vision field. The first is David Marr's introduction to hierarchical models, which focuses on detecting edges, corners, curves, and similar basic shapes.

Over the same period, Kunihiko Fukushima developed Neocognitron, a technology that works on pattern recognition. Little did he know that his finding would be the starting point of Convolutional Neural Networks (CNN), core to today's computer vision.

Towards the end of the millennium and after years of research based on Marr’s work, there was a shift in the researcher's approach to Computer Vision. They stopped trying to reconstruct objects through 3D models and instead focused on feature-based object recognition. Parallely, the launch of LeNet-5, an essential ingredient for CNN’s, still used today. 

Computer Vision from the 2000s onwards

During the early 2000s, massive study results floated to the surface. Findings like the first real-time face recognition (Paul Viola and Michael Jones, 2001) and standardized datasets for object classification (Pascal VOC) appeared. Then, in 2011, ImageNet happened. An enormous data set with over a million images, all manually tagged, featuring around a thousand object classes.

Two years later, Alexnet appeared during an image recognition competition. The model achieved a 16.4% error rate compared to former “records” of around 26%. Since then, error rates have dropped substantially and will continue to do so.

Deepomatic Text Breaker - Small

Computer Vision Applications

Deepomatic Text Breaker - Small

Computer Vision's Technological Integration and Advancements

The Computer Vision models we described at the beginning of this post are called "fully supervised." This is because one needs to collect many examples of images annotated with the target objects we intend to teach to the model. 

Emerging trends and future potential

A new wave of foundation models has emerged, propelled by the adoption of 'self-supervised' learning techniques. In this approach, the objective typically revolves around taking raw data, such as images or text, and training a model to reconstruct a portion of that data that has been intentionally concealed. Unlike traditional methods reliant on external manual annotations, this technique operates on existing data, where the masked segment is known to the training algorithm. This effectively circumvents the conventional bottleneck of necessitating human-labeled images for training. Consequently, it has paved the way for training significantly larger models on datasets several orders of magnitude larger than previously feasible. This includes training models on text, images, or videos sourced directly from the vast expanse of the internet.

These self-supervised models might reconstruct data in a latent space of "features" that abstractly represent the underlying data, like, for example, Meta's Dinov2 model, which was trained on 142 million pictures. Still, they can also try to reconstruct the original data directly (e.g., generate human-readable text or images), called "generative models." Those models are impressive because they produce very realistic text or images.

Another recent evolution is multi-modality. Until recently, computer vision depended on models trained on a single data type, like images. On the contrary, multi-modal models can jointly process data of various kinds, for example, by generating a video corresponding to a text description, enabling machines to understand the world better. This functionality unlocks a vast potential in human-machine interactions as people can instruct computers in human language to perform various tasks, implying the manipulation of all sorts of data (images, PDFs, curves and trends, etc…).

Deepomatic Text Breaker - Small

Computer Vision Challenges and Limitations

Like any growing technology, Computer Vision has challenges, from performance to biases.

Accuracy VS size challenges

When trained on sufficient data, the larger a deep learning model is, the more accurate it is. However, a large model consumes lots of memory and compute time. When deploying this model in a constrained environment, e.g., on mobile, memory and computing resources are limited.

To overcome this limitation, one can work towards two complementary directions to squeeze out the most from mobile hardware in terms of accuracy versus speed/memory tradeoffs:

  • Reducing the model size with quantization and pruning techniques
  • Relying on more mobile-friendly models which are designed from scratch for mobile.

Performance Challenges

It is necessary to have a well-defined problem you want to solve, and it's usually the most challenging part. If you spend a lot of time making a dataset and then realize you forgot something, you would have to go back to train that bit again. Despite these challenges, teams with great tooling can develop Computer Vision projects within a few weeks, encompassing dataset creation, training, and initial results.

Sensibility to bias

Machine learning models are susceptible to bias present in the training data. A notable example is the bias observed in some publicly available face detection algorithms, which tend to perform less accurately on people of color, particularly women, due to their under-representation in the training dataset. This phenomenon extends to various automated tasks, emphasizing the importance of detecting and addressing bias to enhance the accuracy and fairness of models.

Proprietary models and datasets

Since the surge of deep learning models in 2012, academic research has progressed rapidly, somewhat diminishing the advantage once held by proprietary models. However, this trend has shifted in the past 3-4 years with the exponential growth in the scale of both models and datasets.

This expansion necessitates immense computational resources for model training, a demand that often surpasses what academic funding can provide. Nevertheless, there has been a recent trend towards open-sourcing models and datasets. Certain organizations, such as Meta, Mistral, and Kyutai, have committed to this direction, enabling the research community to leverage their insights and advancements.

Computing power

The electric consumption associated with running deep learning algorithms is increasingly under scrutiny due to its contribution to carbon emissions. For instance, during the evaluation of models such as GPT-3, researchers discovered that over six months, they trained 4,789 model versions, consuming 9,998 days' worth of GPU time and emitting over 78,000 pounds of CO2.

Deepomatic Text Breaker - Small

Computer Vision Practical Insights and Learning Resources


Automate Quality. Accelerate Growth.