Moving beyond: Deepomatic learns how to track multiple objects


What is the Multi-Object Tracking (MOT) system?

How would it be to track the players of your favorite sports team in real time to analyze the game? Or to track and analyze the movement of cells, molecules, and organisms in time-lapse microscopy images? Or to count and follow a herd of animals in a hard to reach area?

These are just some of the many potential uses of Multi-Object Tracking (MOT). Other applications that have sparked an enormous interest include autonomous cars, robot navigation, medical imaging, and security surveillance.

In theory, MOT algorithms are capable of tracking closely-spaced and crossing targets. However, the reality is much more complex. Imagine a crowded environment with similar targets and numerous interactions between said targets such as pedestrians in a city. Some pedestrians may leave or enter the field of view of your camera or just be occluded by a tree, vehicle or building. In addition, they can easily change their appearance, by simply taking off a jacket or looking in another direction. Moreover, pedestrians may simply not be seen by your camera due to noisy images or a too small resolution. Keeping track of them means maintaining the identities among them, a challenging task which Deepomatic has started to tackle!

In this blog post, we want to explore the most interesting aspects and some fundamental elements of MOT algorithms. First, we start with a general overview of MOT algorithms. Then, we dive into their two key components: the observation and dynamic models. Finally, we present a recent scientific paper focusing on the observation model.

MOT algorithms

MOT algorithms can be roughly classified into two distinctive groups: detection free and detection based tracking [1]. The first group does not rely on an object detector to provide target detections, while the second one does.

The main advantage of the first approach is the independence of the detector type and its performances. This allows a general application of the tracker to any kind of object (people, animals, cars, cells, etc.). Conversely, the detection based tracking or “tracking-by-detection” approach is mostly specialized in tracking one given object. What the latter group lacks in generality, it makes up for in practicality in real-life applications. This tends to be the most popular approach for two main reasons. First, new objects are discovered and disappearing objects are terminated automatically. Secondly, object detection has witnessed huge improvements in recent years.

Each MOT algorithm consists of two primary components:

  1. An observation model. It measures the similarity between tracked objects in past frames and detected targets in a new frame through appearance, motion, and interaction cues.
  2. A dynamic model — this receives the similarity matrix from the observation model as input, and studies the behaviour of tracked objects over time (appearance / disappearance of certain entities, tracking over time of the others).

Finding similarities: the observation model


The appearance model describes the visual representation of an object and computes the similarity between two observations at different times. Even though it is an important cue for affinity calculations, it is not necessarily sufficient to discriminate between different observations. A typical case would be two pedestrians with similar clothing in different locations in consecutive frames.


The motion model describes how an object is moving: a pedestrian can be static (e.g., standing at a traffic light), walking with constant speed in a given direction or walking around a corner, accelerating or decelerating. The model can hence predict the possible positions of a pedestrian in the future frames, helping to distinguish between similar appearances, but does not take into account influences of other objects.


The interaction model reproduces the influences between different objects. For instance, a pedestrian in a group would follow the group movement, or a singular pedestrian would adapt his speed or trajectory in order to avoid collision with others.

Thanks to these three models, we are able to build a similarity matrix between tracked objects in past frames and detected targets in a new frame, and this matrix will then feed the dynamic model.

Determining the tracks: the dynamic model

The main role of the dynamic model is to find the ‘optimal’ sequence for each detected object, i.e. its track, using either all frames (so-called ‘offline’ methods) or only the frames up to the last frame observed (so-called ‘online’ methods). Two approaches exist to determine this sequence, the probabilistic inference, mainly used in online algorithms, and the deterministic optimisation, mainly used in offline algorithms. The first approach estimates the most probable state (size, position, velocity etc.) of an object by using information from previous observations. In the second approach the data association is interpreted as a special optimisation problem, trying to assign the optimal solution to all tracked objects.


Until recently, MOT algorithms focused on the association of detected objects from one frame to another, hence improving the dynamic model. Now, the trend has shifted to construct stronger similarity scores based on appearance cues, hence improving the appearance model [2]. In this post, we will introduce a new approach using several cues over a period of time [3] following the latest trend of MOT algorithms.

“Tracking the Untrackable”

In the paper, “Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies” [3] the authors Sadeghian, Alahi and Savarese propose a new method to calculate the similarity between objects. They combine the above discussed appearance, movement, and interaction cues by using a Recurrent Neural Network (RNN), which learns and remembers the dependencies in a sequence of observation, in contrast to pairwise similarity where only the observations from the current and previous frames are used. Each cue itself is also an RNN, see Fig. 1 where the RNNs are represented by a trapezoid, retaining information from the past.


Furthermore, the authors treat the computation of the similarity score based on the appearance as a re-identification problem, where two bounding boxes are compared and the algorithm determines if the content depicts the same object. This is done using a Siamese architecture containing two identical Convolutional Neural Networks (CNN) with the exact same configuration as subnetworks. The motion model calculates the velocities of each object and computes its similarity. Finally, they encode the interactions between different objects by occupancy grids centered around each target, allowing the localisation of each object in the current frame as well as other surrounding objects.

The final similarity scores obtained by the previously described observation model are then used by their dynamic model to assign the new detections in the current frame to already tracked objects from past frames. This is treated as an optimised data association problem and solved by computing the optimal assignments through maximising the similarity scores. In layman’s terms, the dynamic model determines the perfect track match for all new detections by maximising the similarity scores between all the objects in past and current frames. The fastest algorithm to achieve this perfect match is the so-called Hungarian algorithm.


This approach enables us to rediscover/recover targets that have been masked during a certain period of time, and to then achieve a better temporal continuity in the tracking, thus significantly improving performance.

What do you think? How would you use Multi-Object Tracking in your applications? We would love to hear from you, and help you if you face this kind of issue.


[1] Luo et al., “Multiple Object Tracking: A Literature Review”, arXiv: 1409.7618 [cs.CV]

[2] Leal-Taixé et al., “Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking”, arXiv:1704.02781 [cs.CV]

[3] Sadeghian, Alahi, Savarese, “Tracking The Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies”, arXiv:1701.01909 [cs.CV]


Our Blog Articles