RUS  ENG
Full version
JOURNALS // Informatics and Automation // Archive

Tr. SPIIRAN, 2018 Issue 58, Pages 77–110 (Mi trspy1007)

This article is cited in 13 papers

Artificial Intelligence, Knowledge and Data Engineering

An analytic survey of end-to-end speech recognition systems

N. M. Markovnikova, I. S. Kipyatkovaba

a St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS)
b Saint Petersburg State University of Aerospace Instrumentation (SUAI)

Abstract: This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction, training and optimization. We consider models based on connectionist temporal classification (CTC) as a loss function for neural networks, models based on encoder-decoder architecture with attention mechanism. Also, we describe neural networks models built using conditional random field (CRF), that is a generalization of hidden markov models that allows to fix some drawbacks of standard hybrid speech recognition systems like an assumption of independency of elements from speech frames sequences. We also describe integration possibilities with language models at a stage of decoding for end-to-end systems. Also, various modification and improvements of standard end-to-end models, for example, like generalization of connectionist temporal classification and regularization using at attention-based encoder-decoder models. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems like TensorFlow, Eesen, Kaldi, etc. Theirs comparing was provided by simplicity and accessibility of implementation end-to-end speech recognition system.

Keywords: speech recognition, end-to-end models, neural networks, deep learning.

UDC: 004.522

Received: 28.11.2017

DOI: 10.15622/sp.58.4



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2026