BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION

Amitesh Kumar Jha; Rajwant Singh Rao

Research Article

BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION

by Amitesh Kumar Jha, Rajwant Singh Rao

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 82

Published: February 2026

Authors: Amitesh Kumar Jha, Rajwant Singh Rao

10.5120/ijca2026926434

PDF

Amitesh Kumar Jha, Rajwant Singh Rao . BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION. International Journal of Computer Applications. 187, 82 (February 2026), 29-42. DOI=10.5120/ijca2026926434

                        @article{ 10.5120/ijca2026926434,
                        author  = { Amitesh Kumar Jha,Rajwant Singh Rao },
                        title   = { BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION },
                        journal = { International Journal of Computer Applications },
                        year    = { 2026 },
                        volume  = { 187 },
                        number  = { 82 },
                        pages   = { 29-42 },
                        doi     = { 10.5120/ijca2026926434 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2026
                        %A Amitesh Kumar Jha
                        %A Rajwant Singh Rao
                        %T BEYOND SINGLE-SCALE VISION TRANSFORMERS: MULTI-SCALE FEATURE FUSION FOR ROBUST SCENE AND DOCUMENT TEXT RECOGNITION%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 82
                        %P 29-42
                        %R 10.5120/ijca2026926434
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Transformer-based Optical Character Recognition (OCR) systems have recently demonstrated strong performance by modeling long-range dependencies in text images. However, most existing approaches rely on single-scale visual representations, which limits their robustness in scenarios involving variable font sizes, degraded characters, and complex document layouts. This study proposes a Multi-Scale Feature-Based Transformer (MSFT-OCR) that explicitly integrates fine-, mid-, and coarse-scale visual features using scale-aware attention mechanisms. The proposed architecture enables effective interaction between character-level details and global word-level context through inter-scale attention. Extensive experiments on scene text and document OCR benchmarks demonstrate that the proposed method consistently outperforms single-scale Transformer models on IIIT5K-Words, IAM, SVT on basis of evaluation metrics CA(%), WA(%), NED(%). Ablation studies and attention visualizations further validate the effectiveness of multi-scale modeling in text recognition.

References

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2017.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and recognition with embedded attributes,” IEEE TPAMI, vol. 36, no. 12, pp. 2552–2566, 2014.
Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “AON: Towards arbitrarily-oriented text recognition,” in Proc. CVPR, 2018, pp. 5571–5579. Transformer Foundations & OCR Transformers
A. Vaswani et al., “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
M. Li, T. Lv, L. Chen, Y. Cui, and M. R. Lyu, “TrOCR: Transformer-based optical character recognition with pre-trained models,” in Proc. AAAI, 2023.
G. Kim, T. Hong, and J. Park, “Donut: Document understanding transformer without OCR,” in Proc. ECCV, 2022, pp. 383–398.
K. Lee, M. Kim, and H. Kim, “Pix2Struct: Screenshot parsing as pretraining for visual language understanding,” in Proc. ICML, 2023.
Y. Du, C. Guo, and Z. Liu, “SVTR: Scene text recognition with a single visual model,” in Proc. IJCAI, 2022, pp. 884–890.
Y. Du et al., “SVTRv2: CTC beats encoder–decoder models in scene text recognition,” arXiv preprint arXiv:2401.00487, 2024.
S. Fang, H. Xie, Y. Wang, and L. Jin, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proc. CVPR, 2021, pp. 7098–7107. Multi-Scale & Hierarchical Vision Models
T.-Y. Lin et al., “Feature pyramid networks for object detection,” in Proc. CVPR, 2017, pp. 2117–2125.
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. ICCV, 2021, pp. 10012–10022.
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
A. Jaegle et al., “Perceiver IO: A general architecture for structured inputs and outputs,” in Proc. ICML, 2021.
N. Lu, W. Yu, X. Qi, and X. Bai, “MASTER: Multi-aspect non-local network for scene text recognition,” Pattern Recognition, vol. 117, 2021.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778. Modern Scene & Document OCR Systems
Y. Du et al., “PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system,” arXiv preprint arXiv:2206.03001, 2022.
X. Chen et al., “PaLI: A jointly-scaled multilingual language-image model,” arXiv preprint arXiv:2209.06794, 2022.
Y. Huang et al., “LayoutLMv3: Pre-training for document AI with unified text and image masking,” in Proc. ACM MM, 2022.
M. Li et al., “DocFormer: End-to-end transformer for document understanding,” in Proc. ICCV, 2021.
S. Powalski et al., “Going full OCR-free: End-to-end document understanding using vision language models,” in Proc. ICDAR, 2021.
H. Nam et al., “StrucTexT: Structured text understanding with multi-modal transformers,” in Proc. CVPR, 2022. Surveys, Benchmarks & Recent Advances (2023–2025)
X. Wang, Y. Jiang, Z. Luo, and C. Yao, “Scene text recognition: A survey,” Pattern Recognition Letters, vol. 165, pp. 1–14, 2023.
A. W. M. Smeulders et al., “Deep learning for document analysis: A survey,” International Journal of Computer Vision, vol. 130, pp. 1–38, 2022.
D. Karatzas et al., “ICDAR 2015 competition on robust reading,” in Proc. ICDAR, 2015.
A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. CVPR, 2016.
Y. Zhou et al., “Vision-language models for OCR and document understanding,” in Proc. IJCAI, 2024.
H. Li et al., “Advances in transformer-based OCR: A comprehensive review,” in Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2025.
A. K. Jha and R. S. Rao, “Advances in scene text recognition: A comprehensive review of sequential transformation attention-based networks (STANs) and related approaches,” Proceedings in Mathematics and Informatics, De Gruyter, 2021.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

CA (Character Accuracy) WA (Word Accuracy) NED (Normalized Edit Distance). MSFT-OCR