Abstract:
The paper studies the interpretability of two popular deep learning architectures, ResNet50 and Vision Transformer (ViT-224), in the context of solving the problem of classifying pathogenic microorganisms in images obtained using a scanning electron microscope and preliminary sample preparation using lanthanide contrast. In addition to standard quality metrics such as precision, recall, and F1 score, a key aspect was the study of the built-in attention maps of Vision Transformer and post-interpretation of the performance of the trained ResNet50 model using the Grad-CAM method. The experiments were performed on the original dataset, as well as three of its modifications: with a zeroed background (threshold), with modified image areas using the inpainting method, and with a completely cleared background using zeroed background areas. To evaluate the generality of the attention mechanism in Vision Transformer, a test was also conducted on the classic MNIST handwritten digit recognition task. The results showed that the Vision Transformer architecture exhibits more localized and biologically based attention heatmaps, as well as greater resilience to changes in background noise.