Abstract:
Two approaches to modernizing the self-attention mechanism in transformer blocks are considered. The first approach introduces a multiplicative stochastic component to the self-attention weight coefficients. This ensures structural regularization of weights by smoothing them and preventing uncontrolled growth. The second approach involves adding a trainable scaling matrix for the scalar products of queries and keys. This allows for regulating the computed self-attention weights, even in cases of saturation of the standard softmax activation function. A strict proof and justification for the resulting regularization are provided for all approaches. To confirm the positive effects of the proposed modifications, results are presented for solving image classification tasks using the standard Vision Transformer architecture. Additionally, the effectiveness of these methods is demonstrated in the task of image quality improvement under external distortions and noise. In this case, the original transformer architecture is used, which not only show the described effects but also improves state-of-the-art results.