RUS  ENG
Full version
JOURNALS // Program Systems: Theory and Applications // Archive

Program Systems: Theory and Applications, 2025 Volume 16, Issue 4, Pages 173–216 (Mi ps480)

Artificial intelligence and machine learning

Comparative analysis of backbone architectures for instance segmentation of objects in aerial imagery using Mask R-CNN

I. V. Vinokurov, D. A. Frolova, A. I. Ilyin, I. R. Kuznetsov

Financial University under the Government of the Russian Federation, Moscow

Abstract: This paper compares Mask R-CNN models with various pretrained backbone architectures for implementing instance segmentation of real estate objects in aerial images. The models were fine-tuned on a specialized dataset provided by the PLC « Roskadastr».
Analysis of the accuracy of detecting bounding boxes and object segmentation masks revealed the preferred architectures: Swin transformers (Swin-S and Swin-T) and the ConvNeXt-T convolutional network. The high accuracy of these models is explained by their ability to account for global contextual dependencies of the image.
The results of the study allow us to formulate the following recommendations for choosing a backbone architecture: for real-time monitoring systems where performance is critical, lightweight models (EfficientNet-B3, ConvNeXt-T, Swin-T) are advisable; for offline tasks requiring maximum accuracy (such as real estate mapping), the large-scale Swin-S model is recommended.

Key words and phrases: instance segmentation, backbone, Mask R-CNN, ResNet, DenseNet, EfficientNet, ConvNeXt, Swin.

UDC: 004.932.75'1, 004.89
BBK: 32.813.5: 32.973.202-018

MSC: Primary 68T20; Secondary 68T07, 68T45

Received: 22.09.2025
Accepted: 12.10.2025

Language: Russian and English

DOI: 10.25209/2079-3316-2025-16-4-173-216



© Steklov Math. Inst. of RAS, 2026