RUS  ENG
Full version
JOURNALS // Modelirovanie i Analiz Informatsionnykh Sistem // Archive

Model. Anal. Inform. Sist., 2025 Volume 32, Number 3, Pages 298–310 (Mi mais853)

Artificial intelligence

Modern Russian-language texts models comparison for the task of CEFR levels classification

V. A. Lavrovskiy, N. S. Lagutina, O. B. Lavrovskaya

P.G. Demidov Yaroslavl State University, Yaroslavl, Russia

Abstract: The development of high-quality tools for automatic determination of text levels according to the CEFR scale allows creating educational and testing materials more quickly and objectively. In this paper, the authors examine two types of modern text models: linguistic characteristics and embeddings of large language models for the task of classifying Russian-language texts by six CEFR levels: A1-C2 and three broader categories A, B, C. The two types of models explicitly represent the text as a vector of numerical characteristics. In this case, dividing the text into levels is considered as a common classification task in the field of computational linguistics. The experiments were conducted with our own corpus of 1904 texts. The best quality is achieved by rubert-base-cased-conversational without additional adaptation when determining both six and three text categories. The maximum F-measure value for levels A, B, C is 0.77. The maximum F-measure value for predicting six text categories is 0.67. The quality of text level determination depends more on the model than on the machine learning classification algorithm. The results differ from each other by no more than 0.01-0.02, especially for ensemble methods.

Keywords: natural language processing, Russian-language texts classification, linguistic characteristics, embeddings, BERT, GPT, CEFR.

UDC: 004.912

MSC: 68T50

Received: 04.08.2025
Revised: 25.08.2025
Accepted: 27.08.2025

DOI: 10.18255/1818-1015-2025-3-298-310



© Steklov Math. Inst. of RAS, 2026