RUS  ENG
Full version
JOURNALS // Computing, Telecommunication and Control // Archive

Computing, Telecommunication and Control, 2022 Volume 15, Issue 4, Pages 73–85 (Mi ntitu333)

Multi-channel transformer: a transformer-based model for multi-speaker speech recognition

E. S. Fadeeva, V. A. Ershov

Company "Yandex"

Abstract: Most of the modern approaches to multi-speaker speech recognition are either not applicable in case of overlapping speech or require a lot of time to run, which can be critical, for example, in case of real-time speech recognition. In this paper, a transformer-based end-to-end model for overlapping speech recognition is presented. It is implemented by using a generalization of the standard approach to speech recognition. The introduced model achieves results comparable in quality to modern state-of-the-art models, but requires less model calls, which speeds up the inference. In addition, a procedure for generating synthetic data for model training is described. This procedure allows to compensate for the lack of real multi-speaker speech training data by creating a stream of data from the initial collection.

Keywords: распознавание речи, распознавание многоголосной речи, диаризация, разделение речи, голосовые технологии.

UDC: 004.8

Received: 29.11.2022

Language: English

DOI: 10.18721/JCSTCS.15406



© Steklov Math. Inst. of RAS, 2026