RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2025 Volume 527, Pages 117–133 (Mi danma672)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Dynamic division of labor in hybrid AI: contrasting encoder strategies and their impact on LSTM modulators

A. K. Zvereva, A. V. Grabovoy, M. S. Kaprielova

Moscow Institute of Physics and Technology, Dolgoprudny, Russia

Abstract: The collaboration between spatial encoders, such as CNNs and Vision Transformers (ViTs), and temporal modulators like LSTMs is fundamental to hybrid models, yet the dynamics of this interplay remain poorly understood. This study introduces a cross-domain framework, leveraging metrics from Information Theory and the Markov Information Bottleneck (MIB) principle, to quantify the internal information flows of these architectures. We analyze two distinct models – a CNN-LSTM for surveillance video and a ViT-LSTM for medical fMRI sequences – on both their native and foreign domains. Our analysis reveals two fundamentally different encoding strategies. The CNN employs a rigid, data-agnostic “Gradual Compression” pipeline. The ViT, in contrast, demonstrates an “Adaptive Compression” strategy, where the successful formation of an efficient information bottleneck is conditional on the data domain. We demonstrate that this adaptability is supported by a nuanced three-stage functional hierarchy within the ViT. Furthermore, we find the LSTM modulator’s role adapts to its partner: paired with the CNN, it acts as a conventional compressor, but when paired with the ViT, it switches to an “unpacker” role to process the hyper-compressed representations it receives. We propose that this complex interplay exemplifies a principle of a “Dynamic Division of Labor”, where functional roles are not fixed but emerge from the system's interaction with the data. This insight challenges the static view of neural network components and opens a path toward more robust, context-aware AI.

Keywords: hybrid models, spatio-temporal learning, convolutional neural networks (CNNs), vision transformer (ViT), LSTM, information theory, information bottleneck, model interpretability.

UDC: 004.8

Received: 15.08.2025
Accepted: 15.09.2025

DOI: 10.7868/S2686954325070100



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2026