STREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
A New Benchmark for Evaluating MLLMs in Gaze-Guided Streaming Video Understanding
Executive Impact
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce STREAMGAZE, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. STREAMGAZE introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build STREAMGAZE, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all STREAMGAZE tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Substantial Performance Gaps Observed
35% Average Accuracy Gap (Human vs. MLLM)Enterprise Process Flow
| Task Category | Key Focus | Example Task |
|---|---|---|
| Past Tasks |
|
|
| Present Tasks |
|
|
| Proactive Tasks |
|
|
Why Current MLLMs Struggle
Current MLLMs fail to leverage gaze signals for long-term temporal reasoning and proactive inference. They often rely on frame-local visual cues and struggle with generalizing across different temporal reasoning requirements.
Advanced ROI Calculator
Estimate your potential savings and efficiency gains by integrating AI-powered gaze-guided video analysis.
Implementation Roadmap
A phased approach to integrate gaze-guided AI into your enterprise workflows.
Phase 1: Discovery & Strategy
Comprehensive assessment of current video understanding needs and a tailored AI strategy development.
Phase 2: Data & Model Integration
Pipeline setup for gaze-guided data, fine-tuning MLLMs, and initial deployment in a controlled environment.
Phase 3: Pilot & Optimization
Pilot program rollout, performance monitoring, and iterative refinement based on user feedback and ROI analysis.
Phase 4: Full-Scale Deployment
Seamless integration of the optimized solution across relevant enterprise applications and continuous support.
Ready to Transform Your Video Understanding?
Connect with our AI specialists to discuss how STREAMGAZE-inspired solutions can elevate your operational efficiency and proactive capabilities.