Samba: Synchronized Set-of-Sequences Modeling for End-to-end Multiple Object Tracking

Abstract

Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.

SambaMOTR

Architecture

SambaMOTR combines a transformer-based object detector with a set-of-sequences Samba model. The object detector's encoder extracts image features from each frame, which are fed into its decoder together with detect and track queries to detect newborn objects or re-detect tracked ones. The Samba set-of-sequences model is composed of multiple synchronized Samba units that simultaneously process the past memory and currently observed output queries for all tracklets to predict the next track queries and update the track memory. The hidden states of newborn objects are initialized from zero values (barred squares). In case of occlusions or uncertain detections, the corresponding query is masked (red cross) during the Samba update.

Samba: Synchronized Set-of-Sequences Modeling

Synchronized Selective State-Space Models

We illustrate a set of \( k \) synchronized SSMs. A Long-Term Memory Update block updates each hidden state \( \tilde{h}_{t-1}^i \) based on the current observation \( x_t^i \) , resulting in the updated memory \( h_t^i \) . The Memory Synchronization block then derives the synchronized hidden state \( \tilde{h}_t^i \) , which is fed into the Output Update module to predict the output \( y_t^i \) .

Set-of-sequences Model

Our set-of-sequences model Samba simultaneously processes an arbitrary number \( M \) of input sequences. Each sequence is processed by a Samba unit, synchronized with the others thanks to our synchronized state-space model. All Samba units share weights and are composed of a stack of \( N \) Samba blocks. A Samba block has the same architecture as a Mamba block, but it adopts our synchronized SSM to synchronize long-term memory representations across the individual SSMs.

Results

Our work is motivated by the need for a tracking approach that can effectively model these joint motions and interactions over extended periods. By developing SambaMOTR, we aim to capture the intricate dynamics of group movements, enabling more accurate and robust tracking in scenarios where objects' motions are inherently interconnected. Modeling joint motions is crucial in datasets like DanceTrack, where dancers move in synchronization and frequently interact; SportsMOT, where players' movements are highly interdependent based on game strategies and ball position; and BFT (Bird Flock Tracking), where birds exhibit collective behavior influenced by their neighbors. By accounting for these complex interactions, SambaMOTR can maintain consistent tracking even during occlusions, rapid movements, and densely packed scenes typical in these challenging datasets. Here's a glimpse of SambaMOTR's performance:

DanceTrack Results

BFT Results

SportsMOT Results

Explore our tracking results on SportsMOT across volleyball, basketball, and football categories:

Volleyball

Basketball

Football

BibTeX

@article{segu2024samba,
  author   =  {Segu, Mattia and Piccinelli, Luigi and Li, Siyuan and Yang, Yung-Hsu and Van Gool, Luc and Schiele, Bernt},
  title    =  {Samba: Synchronized Set-of-Sequences Modeling for End-to-end Multiple Object Tracking},
  journal  =  {arXiv preprint arXiv:2410.01806},
  year     =  {2024}
}