Abstract:
This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of four context-aware approaches: (1) masked language model ranking, (2) binary classification, (3) multi-class classification, and (4) prompt-based learning. Special attention is given to cases of contextual ambiguity, where the same abbreviation within a single text fragment corresponds to different lemmas. The results demonstrate that fine-tuned multi-class classification achieves the highest quality. However, with limited training data, both prompt-based learning and masked language model ranking show promising results. Moreover, the effectiveness of these approaches increases in cases of contextual ambiguity. The study contributes to the development of Russian text processing methods by providing practical recommendations for selecting architectures for abbreviation lemmatization tasks.
Keywords:lemmatization, abbreviations, russian language, discriminative methods, text classification, natural language processing.