logo E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

NeurIPS 2024

1The Hong Kong Polytechnic University 2ARC Lab, Tencent PCG 3Institute of Automation, Chinese Academy of Sciences 4Tencent AI Lab

Tasks

12

Samples

7.3K

Domains

8

Videos

7K

Total Duration

251.4h

Abstract

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity.

To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data.

Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

Task Definitions

Task Definitions

Generation Pipeline

Generation Pipeline

Benchmark Statistics

Benchmark Comparison
Comparison with Existing Video-LLM Benchmarks
Task Taxonomy
Task Taxonomy and Sample Distribution
Word Cloud
Word Cloud of Text Queries
Video Duration
Distribution of Video Duration (seconds)

Our Method

Our Method

Evaluation Results

Evaluation Results

Visualizations

    TVG RVQ GVQ EVS RAR TAL EPM DVC VHD SLC ECA TEM

Citation

Please kindly cite our paper if you find this project helpful.
@inproceedings{liu2024etbench,
title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2024}
}