A. V. Glazkova, I. A. Smal, O. N. Lyashevskaya, D. A. Morozov, “Discriminative lemmatization of abbreviations in the era of LLMS”, Dokl. RAN. Math. Inf. Proc. Upr., 2025, Volume 527,Pages <nobr>146

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Discriminative lemmatization of abbreviations in the era of LLMS

A. V. Glazkova^ab, I. A. Smal^c, O. N. Lyashevskaya^de, D. A. Morozov^bc

^a Tyumen State University, Tyumen, Russia
^b Russian National Corpus, Moscow, Russia
^c Novosibirsk State University, Novosibirsk, Russia
^d National Research University Higher School of Economics, Moscow
^e V. V. Vinogradov Russian Language Institute of the Russian Academy of Sciences

Abstract: This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of four context-aware approaches: (1) masked language model ranking, (2) binary classification, (3) multi-class classification, and (4) prompt-based learning. Special attention is given to cases of contextual ambiguity, where the same abbreviation within a single text fragment corresponds to different lemmas. The results demonstrate that fine-tuned multi-class classification achieves the highest quality. However, with limited training data, both prompt-based learning and masked language model ranking show promising results. Moreover, the effectiveness of these approaches increases in cases of contextual ambiguity. The study contributes to the development of Russian text processing methods by providing practical recommendations for selecting architectures for abbreviation lemmatization tasks.

Keywords: lemmatization, abbreviations, russian language, discriminative methods, text classification, natural language processing.

UDC: 004.8

Received: 21.08.2025
Accepted: 22.09.2025

DOI: 10.7868/S2686954325070124