Abstract:
This paper compares Mask R-CNN models with various pretrained
backbone architectures for implementing instance segmentation of real estate objects
in aerial images. The models were fine-tuned on a specialized dataset provided by the
PLC « Roskadastr».
Analysis of the accuracy of detecting bounding boxes and object segmentation masks
revealed the preferred architectures: Swin transformers (Swin-S and Swin-T) and the
ConvNeXt-T convolutional network. The high accuracy of these models is explained by
their ability to account for global contextual dependencies of the image.
The results of the study allow us to formulate the following recommendations for
choosing a backbone architecture: for real-time monitoring systems where performance is
critical, lightweight models (EfficientNet-B3, ConvNeXt-T, Swin-T) are advisable; for
offline tasks requiring maximum accuracy (such as real estate mapping), the large-scale
Swin-S model is recommended.
Key words and phrases:instance segmentation, backbone, Mask R-CNN, ResNet, DenseNet, EfficientNet, ConvNeXt, Swin.