Abstract:
This article is devoted to the use of modern language models based on BERT and models based on three types of text linguistic features for automatic determination of the text genre, as well as a comparative analysis of these models from the points of view of computer and classical linguistics. The authors have collected their own corpus of Russian-language Internet texts in eight genres: VKontakte posts, comments, articles from the Habr portal, retail descriptions, news, scientific articles, advertising, movie reviews from the Kinopoisk website. Each text was represented as a vector of numerical features using each of the selected models: five BERT variations and linguistic features of character, structure and rhythm levels. Vectors based on linguistic features were also concatenated for two or three levels to obtain additional text models. Next, the vectors were classified into eight genres using neural network classifiers, a perceptron and LSTM. The results of the classification showed that BERT models achieved a high quality of genre detection: up to 91–99% of precision, recall, and F-measure. The combination of linguistic features made it possible to obtain the F-measure about 90%. An analysis of the classification results and text models from a linguistic point of view revealed the features of individual genres and possible reasons for both high results and classification errors.
Keywords:stylometry, natural language processing, rhythm features, genres, text classification, BERT.