TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction

Ahmad Khalil; Mahmoud Khalil; Alioune Ngom

Research Article

TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction

by Ahmad Khalil, Mahmoud Khalil, Alioune Ngom

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 95

Published: April 2026

Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom

10.5120/ijca-1aef39d4b120

PDF

Ahmad Khalil, Mahmoud Khalil, Alioune Ngom . TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction. International Journal of Computer Applications. 187, 95 (April 2026), 1-9. DOI=10.5120/ijca-1aef39d4b120

                        @article{ 10.5120/ijca-1aef39d4b120,
                        author  = { Ahmad Khalil,Mahmoud Khalil,Alioune Ngom },
                        title   = { TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 95 },
                        pages   = { 1-9 },
                        doi     = { 10.5120/ijca-1aef39d4b120 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Ahmad Khalil
                        %A Mahmoud Khalil
                        %A Alioune Ngom
                        %T TempHalluc-Bench: Evaluating Temporal Hallucination in VideoLLM-Based Video Search and Information Extraction%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 95
                        %P 1-9
                        %R 10.5120/ijca-1aef39d4b120
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Video Large Language Models (VideoLLMs) are increasingly deployed for video search and information extraction, where temporally grounded facts must be retrieved from long videos. A key failure mode in this setting is temporal hallucination: extracted statements that misattribute when an event occurs, how long it lasts, or what happens before/after, even when relevant evidence exists in the video. Existing hallucination benchmarks largely target free-form video QA or rely on dense temporal supervision, leaving retrieval-style temporal reliability underexplored. TempHalluc-Bench is introduced as a novel benchmark protocol for evaluating temporal hallucination in retrieval-oriented VideoLLM pipelines, instantiated from ActivityNet Captions. TempHalluc-Bench is enabled by an annotation-free, post-generation verifier that treats the VideoLLM as a black box and uses only (V,R) at inference. The verifier decomposes an extraction into atomic temporal claims, estimates soft temporal support over time using frozen vision–language encoders, and quantifies inconsistency via mismatch to a text-implied temporal prior. Across diverse VideoLLMs, TempHalluc-Bench improves over text-only and similarity-based baselines (Accuracy/F1) and yields stronger accuracy-based reliability signals than prior benchmarks on overlapping models, with gains of up to +48.2 points Accuracy (e.g., VideoChatGPT: 78.4 vs. 30.17).

References

Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar R. Zaiane. Faithfulness in natural language generation: A survey of methods and metrics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, and P˚al Halvorsen. Videohedge: Entropy-based hallucination detection for video-vlms via semantic clustering and spatiotemporal perturbations, 2026.
Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. Videorag: Retrieval-augmented generation over video corpus. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21278–21298, July 2025.
Ziwei Ji, Nayeon Lee, Rita Frieske, Tianyi Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 2023.
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
Kyuho Lee, Euntae Kim, Jinwoo Choi, and Buru Chang. Noah: Benchmarking narrative prior driven hallucination and omission in video large language models. arXiv preprint arXiv:2511.06475, 2025.
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Temporal grounding of natural language descriptions in videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
Linjie Li, Jie Lei, Zhe Wang, Jingjing Li, Jason Kuen, Zheng Feng, Dongyu Chen, Jianfei Cai, Mike Zheng Shou, Rui Yan, et al. Hero: Hierarchical encoder for video+language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-rag: Visually-aligned retrieval-augmented long video comprehension. In Advances in Neural Information Processing Systems (NeurIPS), 2025. NeurIPS 2025 Poster.
Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computational Linguistics.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, and Jie Tang. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.
Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, Zhiyu Tan, Hao Li, Xingjun Ma, and Jingjing Chen. Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv:2409.16597, 2024.
Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Deep video discovery: Agentic search with tool use for long-form video understanding. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
Zefan Zhang, Kehua Zhu, Shijie Jiang, Hongyuan Lu, Shengkai Sun, and Tian Bai. Verhallu: Evaluating and mitigating event relation hallucination in video large language models. arXiv preprint arXiv:2601.10010, 2026.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

VideoLLM temporal hallucination video information extraction video search and retrieval annotation-free verification long-video understanding