Abstract:
Text redundancy occurs when information is duplicated in a sentence, paragraph, or entire document. The problem of identifying and eliminating redundancy has not been fully explored. In this work, we study what we call the “long-windedness” of a text document and methods to evaluate it. We present a dataset that can be used to train or fine-tune models for the task of eliminating text redundancy. It is based on a set of articles from Russian language media and created using Saiga and YandexGPT Lite LLMs. We also perform a comparative analysis of Russian-language LLMs on the compression of text documents. We found that among commercial LLMs the best is GigaChat Lite, and LLM Saiga is performing close to it.
Key words and phrases:text redundancy, large language models, text summarization.