|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 95 |
| Published: April 2026 |
| Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom |
10.5120/ijca-1aef39d4b120
|
Ahmad Khalil, Mahmoud Khalil, Alioune Ngom . TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction. International Journal of Computer Applications. 187, 95 (April 2026), 1-9. DOI=10.5120/ijca-1aef39d4b120
@article{ 10.5120/ijca-1aef39d4b120,
author = { Ahmad Khalil,Mahmoud Khalil,Alioune Ngom },
title = { TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction },
journal = { International Journal of Computer Applications },
year = { 2026 },
volume = { 187 },
number = { 95 },
pages = { 1-9 },
doi = { 10.5120/ijca-1aef39d4b120 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2026
%A Ahmad Khalil
%A Mahmoud Khalil
%A Alioune Ngom
%T TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction%T
%J International Journal of Computer Applications
%V 187
%N 95
%P 1-9
%R 10.5120/ijca-1aef39d4b120
%I Foundation of Computer Science (FCS), NY, USA
Video Large Language Models (VideoLLMs) are increasingly deployed for video search and information extraction, where temporally grounded facts must be retrieved from long videos. A key failure mode in this setting is temporal hallucination: extracted statements that misattribute when an event occurs, how long it lasts, or what happens before/after, even when relevant evidence exists in the video. Existing hallucination benchmarks largely target free-form video QA or rely on dense temporal supervision, leaving retrieval-style temporal reliability underexplored. TempHalluc-Bench is introduced as a novel benchmark protocol for evaluating temporal hallucination in retrieval-oriented VideoLLM pipelines, instantiated from ActivityNet Captions. TempHalluc-Bench is enabled by an annotation-free, post-generation verifier that treats the VideoLLM as a black box and uses only (V,R) at inference. The verifier decomposes an extraction into atomic temporal claims, estimates soft temporal support over time using frozen vision–language encoders, and quantifies inconsistency via mismatch to a text-implied temporal prior. Across diverse VideoLLMs, TempHalluc-Bench improves over text-only and similarity-based baselines (Accuracy/F1) and yields stronger accuracy-based reliability signals than prior benchmarks on overlapping models, with gains of up to +48.2 points Accuracy (e.g., VideoChatGPT: 78.4 vs. 30.17).