Abstract:
The article proposes a solution for generating Russian-language image captions that addresses two distinct registers-formal and conversational. This study is motivated by the need for educational tools to assist non-native speakers in mastering colloquial Russian. The methodology employs a multimodal encoder-decoder ensemble architecture, in which a pre-trained ResNet-152 Convolutional Neural Network serves as the encoder, and an LSTM network functions as the decoder. The captioning performance is further enhanced by incorporating the Bahdanau attention mechanism. To facilitate training, the authors constructed a proprietary dataset derived from MS COCO, which was translated and stylistically adapted via the GigaChat large language model. During ensemble construction, ruCLIPScore is utilized to select the most effective model configurations. Experimental results indicate that the ensemble significantly outperforms its individual constituent models according to ruCLIPScore and can produce captions with stylistic diversity across registers.
Keywords:neural networks, image captioning, visual language models, computer vision.