Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
STTS is a novel, simple yet effective technique for unified, architecture-wide vision token pruning across both ViT and LLM, improving efficiency by 62% with minimal performance loss in video QA tasks.
Mar 18, 2026