Retrieving target vehicles through natural language descriptions is crucial for urban management within intelligent transportation systems. Existing methods use models like CLIP that exploit the relationship between text and visual data. Since conventional CLIP models take images as input, they utilize syn- thetic data, such as moving maps, to represent vehicle trajectories. However, these models struggle to comprehend the temporal aspects of video data. Researchers have attempted to improve temporal un- derstanding by using various data augmentations and video encoders. Nonetheless, video encoders can only process a few frames at a time, and traditional frame sampling methods do not effectively capture the dynamics of vehicle movement. To address these issues, We propose a motion-based video sampling technique that efficiently harnesses the motion data of target vehicles. By leveraging state-of-the-art video foundation models and a re-ranking algorithm, we have improved the performance of models on public datasets for natural language-based vehicle retrieval. Additionally, the available benchmark dataset is unique, limited in size, and exhibits significant class imbalances. Therefore, we applied the Video CutMix augmentation algorithm and demonstrated through experiments that vehicle augmenta- tion is feasible in addressing class imbalance.
Publisher
Ulsan National Institute of Science and Technology