RUS  ENG
Full version
JOURNALS // Zapiski Nauchnykh Seminarov POMI // Archive

Zap. Nauchn. Sem. POMI, 2025 Volume 546, Pages 246–258 (Mi znsl7640)

Determining the long-windedness of Russian text

D. R. Taldytovaa, V. A. Malykhb

a National University of Science and Technology «MISIS», Moscow
b St. Petersburg National Research University of Information Technologies, Mechanics and Optics

Abstract: Text redundancy occurs when information is duplicated in a sentence, paragraph, or entire document. The problem of identifying and eliminating redundancy has not been fully explored. In this work, we study what we call the “long-windedness” of a text document and methods to evaluate it. We present a dataset that can be used to train or fine-tune models for the task of eliminating text redundancy. It is based on a set of articles from Russian language media and created using Saiga and YandexGPT Lite LLMs. We also perform a comparative analysis of Russian-language LLMs on the compression of text documents. We found that among commercial LLMs the best is GigaChat Lite, and LLM Saiga is performing close to it.

Key words and phrases: text redundancy, large language models, text summarization.

UDC: 004.912

Received: 28.02.2025



© Steklov Math. Inst. of RAS, 2026