Latency Matters: Real-Time Action Forecasting Transformer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023, Highlight Presentation, Vancouver, Canada
We present RAFTformer, a real-time action forecasting transformer for latency-aware real-world action forecasting. RAFTformer is a two-stage fully transformer based architecture comprising of a video transformer backbone that operates on high resolution, short-range clips, and a head transformer encoder that temporally aggregates information from multiple short-range clips to span a long-term horizon. Additionally, we propose a novel self-supervised shuffled causal masking scheme as a model level augmentation to improve forecasting fidelity. Finally, we also propose a novel real-time evaluation setting for action forecasting that directly couples model inference latency to overall forecasting performance and brings forth a hitherto overlooked trade-off between latency and action forecasting performance. Our parsimonious network design facilitates RAFTformer inference latency to be 9× smaller than prior works at the same forecasting accuracy. Owing to its two-staged design, RAFTformer uses 94% less training compute and 90% lesser training parameters to outperform prior state-of-the-art baselines by 4.9 points on EGTEA Gaze+ and by 1.4 points on EPIC-Kitchens-100 validation set, as measured by Top-5 recall (T5R) in the offline setting. In the real-time setting, RAFTformer outperforms prior works by an even greater margin of upto 4.4 T5R points on the EPIC-Kitchens-100 dataset.