I. V. Vinokurov, “Recovering text sequences using deep learning models”, Program Systems: Theory and Applications, 2024, Volume 15, Issue 3,Pages <nobr>75

Applied software systems

Recovering text sequences using deep learning models

I. V. Vinokurov

Financial University under the Government of the Russian Federation, Moscow, Russia

Abstract: This article presents the results of the formation, training and performance evaluation of models with the Encoder-Decoder and Sequence-To-Sequence (Seq2Seq) architectures for solving the problem of supplementing incomplete texts. Problems of this type often arise when restoring the contents of documents from their low-quality images. The studies conducted in the work are aimed at solving the practical problem of forming electronic copies of scanned documents of the «Roskadastr» PLC, the recognition of which is difficult or impossible with standard means.
The formation and study of models was carried out in Python using the high-level API of the Keras package. A dataset consisting of several thousand pairs was formed for the purpose of training and studying the models. Each pair in this set represented an incomplete and corresponding full text. To evaluate the quality of the models, the values of the loss function and the accuracy, BLEU and ROUGE-L metrics were calculated. Loss and accuracy made it possible to evaluate the effectiveness of the models at the level of predicting individual words. The BLEU and ROUGE-L metrics were used to evaluate the similarity between the full and reconstructed texts. The results showed that both the Encoder-Decoder and Seq2Seq models cope with the task of reconstructing text sequences from their fixed set, but the Seq2Seq transformer-based model achieves better results in terms of training speed and quality

Key words and phrases: deep learning models, encoder-decoder, sequence-to-sequence transformer, text recovering, BLEU, ROUGE-L, Keras, Python.

UDC: 004.932.75'1, 004.89
BBK: 32.813.5: 32.973.202-018.2

MSC: Primary 68T20; Secondary 68T07, 68T45

Received: 03.03.2024
14.04.2024
Accepted: 15.08.2024

Language: Russian and English

DOI: 10.25209/2079-3316-2024-15-3-75-110