Efficiency

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

STTS is a novel, simple yet effective technique for unified, architecture-wide vision token pruning across both ViT and LLM, improving efficiency by 62% with minimal performance loss in video QA tasks.

Mar 18, 2026