Abstract:
The development of high-quality tools for automatic determination of text levels according to the CEFR scale allows creating educational and testing materials more quickly and objectively. In this paper, the authors examine two types of modern text models: linguistic characteristics and embeddings of large language models for the task of classifying Russian-language texts by six CEFR levels: A1-C2 and three broader categories A, B, C. The two types of models explicitly represent the text as a vector of numerical characteristics. In this case, dividing the text into levels is considered as a common classification task in the field of computational linguistics. The experiments were conducted with our own corpus of 1904 texts. The best quality is achieved by rubert-base-cased-conversational without additional adaptation when determining both six and three text categories. The maximum F-measure value for levels A, B, C is 0.77. The maximum F-measure value for predicting six text categories is 0.67. The quality of text level determination depends more on the model than on the machine learning classification algorithm. The results differ from each other by no more than 0.01-0.02, especially for ensemble methods.
Keywords:natural language processing, Russian-language texts classification, linguistic characteristics, embeddings, BERT, GPT, CEFR.