|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 82 |
| Published: February 2026 |
| Authors: Amitesh Kumar Jha, Rajwant Singh Rao |
10.5120/ijca2026926434
|
Amitesh Kumar Jha, Rajwant Singh Rao . BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION. International Journal of Computer Applications. 187, 82 (February 2026), 29-42. DOI=10.5120/ijca2026926434
@article{ 10.5120/ijca2026926434,
author = { Amitesh Kumar Jha,Rajwant Singh Rao },
title = { BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION },
journal = { International Journal of Computer Applications },
year = { 2026 },
volume = { 187 },
number = { 82 },
pages = { 29-42 },
doi = { 10.5120/ijca2026926434 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2026
%A Amitesh Kumar Jha
%A Rajwant Singh Rao
%T BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION%T
%J International Journal of Computer Applications
%V 187
%N 82
%P 29-42
%R 10.5120/ijca2026926434
%I Foundation of Computer Science (FCS), NY, USA
Transformer-based Optical Character Recognition (OCR) systems have recently demonstrated strong performance by modeling long-range dependencies in text images. However, most existing approaches rely on single-scale visual representations, which limits their robustness in scenarios involving variable font sizes, degraded characters, and complex document layouts. This study proposes a Multi-Scale Feature-Based Transformer (MSFT-OCR) that explicitly integrates fine-, mid-, and coarse-scale visual features using scale-aware attention mechanisms. The proposed architecture enables effective interaction between character-level details and global word-level context through inter-scale attention. Extensive experiments on scene text and document OCR benchmarks demonstrate that the proposed method consistently outperforms single-scale Transformer models on IIIT5K-Words, IAM, SVT on basis of evaluation metrics CA(%), WA(%), NED(%). Ablation studies and attention visualizations further validate the effectiveness of multi-scale modeling in text recognition.