Abstract:
This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pre-trained language models. The first approach is generative, where the lemma is produced as a textual output by the model. The second approach relies on classification models to select the most appropriate lemma for abbreviations that have multiple common expansions. The paper discusses the strengths and limitations of both approaches. The experiments are conducted on Russian texts selected from the Russian National Corpus.
Key words and phrases:lemmatization, abbreviations, morphological tagging, Russian language, text classification, generative models.