RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2025 Volume 37, Issue 6(1), Pages 233–242 (Mi tisp1069)

Comparison of the interpretability of ResNet50 and ViT-224 models in the classification task is erroneous on images of a scanned microscope object

V. N. Gridin, I. A. Novikov, B. R. Salem, V. I. Solodovnikov

Center of Information Technologies in Design, Russian Academy of Sciences

Abstract: The paper studies the interpretability of two popular deep learning architectures, ResNet50 and Vision Transformer (ViT-224), in the context of solving the problem of classifying pathogenic microorganisms in images obtained using a scanning electron microscope and preliminary sample preparation using lanthanide contrast. In addition to standard quality metrics such as precision, recall, and F1 score, a key aspect was the study of the built-in attention maps of Vision Transformer and post-interpretation of the performance of the trained ResNet50 model using the Grad-CAM method. The experiments were performed on the original dataset, as well as three of its modifications: with a zeroed background (threshold), with modified image areas using the inpainting method, and with a completely cleared background using zeroed background areas. To evaluate the generality of the attention mechanism in Vision Transformer, a test was also conducted on the classic MNIST handwritten digit recognition task. The results showed that the Vision Transformer architecture exhibits more localized and biologically based attention heatmaps, as well as greater resilience to changes in background noise.

Keywords: Vision Transformer, ResNet50, Grad-CAM, attention maps, attention heat maps, interpretability, classification, bacteria, image analysis

DOI: 10.15514/ISPRAS-2025-37(6)-15



© Steklov Math. Inst. of RAS, 2026