Overview

In this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target video using a single query image of the object. Toward addressing this challenging problem, we extend a popular and successful object detection method, namely DETR (Detection Transformer), and introduce a novel approach -- query-guided detection transformer for videos (QDETRv). A distinctive feature of QDETRv is its capacity to exploit information from the query image and spatio-temporal context of the target video, which significantly aids in precisely pinpointing the desired object in the video.

We incorporate cross-attention mechanisms that capture temporal relationships across adjacent frames to handle the dynamic context in videos effectively. Further, to ensure strong initialization for QDETRv, we also introduce a novel unsupervised pretraining technique tailored to videos. This involves training our model on synthetic object trajectories with an analogous objective as the query-guided localization task. During this pretraining phase, we incorporate recurrent object queries and loss functions that encourage accurate patch feature reconstruction. These additions enable better temporal understanding and robust representation learning.

Our experiments show that the proposed model significantly outperforms the competitive baselines on two public benchmarks, VidOR and ImageNet-VidVRD, extended for one-shot open-set localization tasks.

Method




Illustration of the proposed QDETRv. The process begins with the feature extraction of a query image and video frames using a CNN encoder. A cross-attention mechanism and dot-product attention are used to create an attention map, transforming target frame features. The output is integrated into DETR’s encoder, and predictions for bounding boxes are generated using the DETR decoder. [Best viewed in color]

Limitations




In the green box results, QDETRv accurately localizes objects from the query image on the left. The red box highlights the model's limitations, with missed localization in videos where the object is only partially visible in the query image.

Citation


The website template was borrowed from Michaël Gharbi and Ref-NeRF.