Language-Specific Tokenization for Assamese: Efficiency and Downstream Integration with LLMs

Basab Nath; Sagar Tamang

doi:10.65500/computics-2025-004

Article Open Access 46 Views | 23 Downloads | 1–10 | PDF

Language-Specific Tokenization for Assamese: Efficiency and Downstream Integration with LLMs

Basab Nath ¹

, Sagar Tamang ²

¹ School of Computer Science Engineering and Technology, Bennett University, Greater Noida, Uttar Pradesh 201310, India

² Department of Computer Applications, Indian Institute of Technology Patna, Bihta Campus, Patna, Bihar 801106, India

DOI: https://doi.org/10.65500/computics-2025-004
Received: 27 September 2025 | Revised: 3 November 2025 | Accepted: 17 November 2025 | Published: 8 December 2025

Abstract

Tokenization is a fundamental step in NLP, influencing both computational efficiency and downstream task performance. While subword methods such as Byte-Pair Encoding (BPE), WordPiece, and Unigram have shown strong results for high-resource languages, their suitability for low-resource and morphologically rich languages like Assamese remains insufficiently understood. This study presents a systematic evaluation of these tokenizers on a curated Assamese Wikipedia corpus, examining intrinsic efficiency metrics—including subword fertility, compression ratio, tokenization speed, and token diversity—alongside statistical validation and energy trade-offs. We further connect intrinsic behaviour to practical outcomes by fine-tuning IndicBERT and mBERT on sentiment and hate-speech tasks, and by assessing morphological boundary preservation. Results show that BPE-32K provides the most compact and semantically coherent segmentation, improves downstream F1 scores by 3–4 points, and preserves morpheme boundaries in 82% of cases, while WordPiece-16K offers the fastest tokenization speed. Overall, the findings demonstrate that vocabulary scaling reduces over-segmentation and that language-specific tokenizers substantially outperform multilingual defaults for Assamese. This work provides empirical guidelines for selecting tokenizers tailored to low-resource Indic languages and for integrating them effectively into LLM pipelines.

Keywords: Tokenization; Assamese; Low-Resource NLP; IndicBERT; mBERT

Acknowledgment

The authors would like to thank Bennett University, India, for providing computational resources and institutional support for this research

References

Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, 7–12 August 2016.
Schuster, M.; Nakajima, K. Japanese and Korean voice search. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), Kyoto, Japan, 25–30 March 2012; pp. 5149–5152.
Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, 15–20 July 2018.
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.; Hughes, M.; Dean, J. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351.
Tamang, S.; Bora, D.J. Performance evaluation of tokenizers in large language models for the Assamese language. arXiv 2024, arXiv:2410.03718.
Goyal, N.; Gao, C.; Chaudhary, V.; Fan, A.; El-Kishky, A.; et al. The FLORES-200 evaluation benchmark for low-resource and multilingual machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), Abu Dhabi, UAE, 7–11 December 2022; pp. 6985–7002.
Kakwani, D.; Gupta, A.; Siddhant, A.; et al. IndicNLP suite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020.
Khanuja, S.; Doddapaneni, S.; Kumar, V.; et al. MuRIL: Multilingual representations for Indian languages. In Findings of the Association for Computational Linguistics: ACL 2021, Online, 6–11 June 2021.
Ghosh, D.; Senapati, A. Hate speech detection in low-resourced Indian languages: An analysis of Assamese and Bodo. Nat. Lang. Process. J. 2025.
Rust, P.; Pfeiffer, J.; Vulić, I.; et al. How good is your tokenizer? On the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021), Online, 1–6 August 2021.
Bostrom, K.; Durrett, G. Byte-level subwords improve multilingual machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020.
Nath, B.; Sarkar, S.; Das, S.; et al. A trie-based lemmatizer for Assamese language. Int. J. Inf. Technol. 2022, 14, 2355–2360. https://doi.org/10.1007/s41870-022-00942-9
Gazit, B.; Shmidman, S.; Shmidman, A.; Pinter, Y. Splintering nonconcatenative languages for better tokenization. arXiv 2025, arXiv:2503.14433.
Singh, H.; Gupta, N.; Bharadwaj, S.; Tewari, D.; Talukdar, P. IndicGenBench: A multilingual benchmark to evaluate generation capabilities of LLMs on Indic languages. arXiv 2024, arXiv:2404.16816.
Dagan, G.; Synnaeve, G.; Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; Article No. 387.
Nath, B.; Tamang, S.; Elwasila, O.; Gulzar, Y. Task-oriented evaluation of Assamese tokenizers using sentiment classification. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 9.
Karthika, N.J.; Brahma, M.; Saluja, R.; Ramakrishnan, G.; Desarkar, M.S. Multilingual tokenization through the lens of Indian languages: Challenges and insights. arXiv 2025, arXiv:2506.17789.
Brahma, M.; Karthika, N.J.; Singh, A.; Adiga, D.; Bhate, S.; Ramakrishnan, G.; Saluja, R.; Desarkar, M.S. MorphTok: Morphologically grounded tokenization for Indian languages. arXiv 2025, arXiv:2504.10335.
Kakwani, D.; Kunchukuttan, A.; Golla, S.; Bhattacharyya, A.; Khapra, M.M.; Kumar, P. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; pp. 4948–4961
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.
Nath, B.; Sarkar, S.; Das, S.; et al. A trie-based lemmatizer for the Assamese language. International Journal of Information Technology, 2022, 14, 2355–2360. https://doi.org/10.1007/s41870-022-00942-9.
Nath, B.; Sarkar, S. Comparative Analysis of Neural Machine Translation Models for Low-Resource English–Assamese Language Pair. In: Biswas, S.K.; Bandyopadhyay, S.; Hayashi, Y.; Balas, V.E. (eds), Intelligent Computing Systems and Applications. ICICSA 2023. Lecture Notes in Networks and Systems, 1307. Springer, Singapore (2025). https://doi.org/10.1007/978-981-96-3860-4_1

© 2026 by the authors. Published by Impaxon Publishing.
This article is licensed under the Creative Commons Attribution (CC BY) License .
You are free to share and adapt the material as long as appropriate credit is given.

Publisher’s Note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of Impaxon Publishing or the journal editors. The publisher remains neutral with regard to jurisdictional claims in institutional affiliations.

Impact in Computics

Language-Specific Tokenization for Assamese: Efficiency and Downstream Integration with LLMs

Abstract

Acknowledgment

References

Cite This Article