How Should Video LLMs Output Time?
An Analysis of Efficient Temporal Grounding Paradigms

1University of Central Florida, 2Axon
CVPR 2025 Workshop on Efficient and On-Device Computational Vision (ECV)
Overview of three temporal grounding output paradigms

Overview of three output formulation paradigms for temporal grounding in Video LLMs. (a) Text Numeral Generation directly generates timestamps as text tokens. (b) Temporal Token Generation introduces dedicated temporal tokens and a specialized head. (c) Continuous Temporal Decoding maps hidden states through an MLP to predict temporal distributions.

Abstract

While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation.

In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead.

Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead.

Key Contributions

  • Unified Fine-Tuning Comparison: We implement three representative VTG output paradigms (TRACE-style generative, DisTime-style continuous, VTimeLLM-style text numeral) on the same compact backbones with identical training configurations, isolating the output formulation as the sole experimental variable.
  • Efficiency-Aware Analysis: Beyond localization accuracy, we analyze parameter overhead, training stability, inference determinism, and computational cost — dimensions critical for practical deployment yet underexplored in the VTG-MLLM literature.
  • Paradigm Dividend: We discover that optimizing the output formulation provides an orthogonal, highly effective path to superior performance. A 1.5B continuous model outperforms 7B-class baselines using alternative paradigms, demonstrating that output design matters more than brute-force parameter scaling.

Main Results

Under strictly controlled settings, the Continuous Temporal Decoding paradigm consistently achieves the best localization accuracy across all tasks and backbone scales. At Molmo2-8B, it reaches 57.1% mIoU on Charades-STA and 72.6% R1@0.5 on QVHighlights, while the Text Numeral baseline plateaus at 27.5% mIoU at the same scale.

Quantitative comparison on Charades-STA and QVHighlights (Moment Retrieval)
Backbone Paradigm Charades-STA QVHighlights
R1@0.5 R1@0.7 mIoU R1@0.5 R1@0.7 mIoU
SmolVLM2-0.5B Text 15.16.816.2 9.05.612.9
Cont. 41.822.543.4 46.931.050.4
Gen. 20.19.521.9 14.79.618.6
FastVLM-1.5B Text 19.58.919.9 11.77.616.5
Cont. 51.325.146.6 60.935.554.6
Gen. 26.212.328.9 18.412.324.0
SmolVLM2-2.2B Text 19.48.720.8 11.67.216.5
Cont. 50.824.346.6 60.134.654.4
Gen. 25.812.228.1 18.912.323.9
Molmo2-4B Text 22.310.023.9 13.38.319.0
Cont. 65.837.656.3 69.142.859.6
Gen. 33.716.032.3 21.714.127.5
Molmo2-8B Text 25.611.527.5 15.39.521.8
Cont. 66.839.357.1 72.646.161.2
Gen. 37.119.135.1 24.916.231.5

Paradigm Dividend & Efficiency-Accuracy Trade-off

(a) Scaling Curve: The paradigm gap remains structurally persistent from 0.5B to 8B. The Continuous model at just 0.5B (mIoU 43.4%) already outperforms the Text paradigm scaled all the way to 8B (mIoU 27.5%), demonstrating a strong "paradigm dividend." (b) Pareto Frontier: The Continuous paradigm dominates the Pareto frontier, achieving the best trade-off between inference latency and localization accuracy through non-autoregressive decoding.

Scaling curves and Pareto frontier

Qualitative Comparison

Predictions from all three paradigms on SmolVLM2-2.2B across three difficulty levels: (a) a simple action where all paradigms succeed, (b) a temporally ambiguous query where only Continuous aligns precisely (IoU 0.99), and (c) a compound action requiring causal reasoning where all paradigms fail.

Qualitative comparison across three difficulty levels

Ablation Studies

(a) Context Length Robustness: The Text paradigm saturates early and slightly degrades at 64 frames, while Continuous scales steeply to 49.8 mIoU, proving immune to sequence dilution. (b) Data Efficiency: Continuous trained on merely 25% data (40.2% mIoU) substantially outperforms Text (20.8%) and Generative (28.1%) trained on 100% data.

Ablation: context length and data efficiency

Error Taxonomy

We systematically categorize failure cases (IoU < 0.5) into three error types: Type A (Temporal Hallucination) where predictions are completely disjoint from ground truth, Type B (Boundary Jitter) where the event is identified but boundaries are imprecise, and Type C (Semantic Failure) where the query is fundamentally misunderstood. The Continuous paradigm drastically shifts the error mode from severe hallucinations (29.3%) to minor boundary jitters (66.5%).

Error taxonomy across paradigms

Extended Qualitative Analysis

Type A: Temporal Hallucination

The Text and Generative paradigms frequently predict completely disjoint timeframes, while the Continuous paradigm is significantly more robust against blind guessing.

Type A: Temporal Hallucination examples

Type B: Boundary Jitter

Even when the Continuous paradigm fails the strict IoU threshold, its errors are predominantly benign boundary jitters rather than severe hallucinations, typically manifesting as slightly over-extended temporal windows.

Type B: Boundary Jitter examples

Type C: Semantic Confusion

Semantic failures represent a universal challenge across all paradigms, where models fundamentally misunderstand the query or confuse similar actions occurring at different times.

Type C: Semantic Confusion examples

BibTeX

@inproceedings{jin2025tgparadigms,
  author    = {Jin, Shengji and Zou, Yuanhao and Zhu, Victor and Ji, Zhengping and Chen, Chen},
  title     = {How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms},
  booktitle = {CVPR 2025 Workshop on Efficient and On-Device Computational Vision (ECV)},
  year      = {2025},
}