Abstract
In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously. Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths. To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes. Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps. A dual training objective further enables consistent forecasting accuracy across diverse observation horizons. Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also on the single-agent benchmarks. Our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.
Motivation and Contributions
- Trajectory prediction is a core component of the autonomous vehicle (AV) control stack, providing hypotheses on the future motions of surrounding agents based on perception outputs.
- Accurate and robust predictions are critical for safe and efficient motion planning which enables the AV to anticipate and respond to dynamic behaviors in complex traffic scenarios.
- In real-world driving, traffic scenes are constantly changing. Agents entering the field-of-view of the AV have been observed only briefly, whereas for other agents a more comprehensive motion history is available.
- Motion forecasting models must therefore be able to leverage heterogeneous historical observations effectively while operating under real-time constraints in continuously evolving scenes.
- Challenges
- Existing benchmarks consider only fixed-size historical and future windows, while in practice, the historical context can range from a few frames up to several seconds.
- Methods that rely on extensive contexts typically achieve the most accurate results, but must delay predictions until a sufficient amount of observations are accumulated for newly detected agents.
- Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths.
- To handle the constantly evolving context of real-world traffic scenes, models require the ability to efficiently propagate motion features as long as the agents are visible, a challenge that has not yet been sufficiently addressed.
Key Contributions
Results
AV2 Visualizations
BibTeX
@inproceedings{prutsch2026sharp,
title={{SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting}},
author={Prutsch, Alexander and Fruhwirth-Reisinger, Christian and Schinagl, David and Possegger, Horst},
booktitle={In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}