Abstract:
Automatic speech recognition (ASR) systems for real-life scenarios are required to process audio streams of arbitrary length with stable accuracy under limited computational resources. While the joint connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) model delivers high recognition quality, its vanilla form is unable to meet these requirements. This paper proposes an input-synchronous blockwise decoding algorithm for the joint CTC-AED model. The algorithm processes overlapping blocks of audio synchronously with the input frames, utilizing CTC alignment to determine the proper context from the overlapping part for the AED component. The fixed block length ensures predictable and limited resource consumption and avoids long-form speech generalization issues, while the overlap mitigates WER degradation caused by edge effects. Unlike existing methods, the proposed approach requires neither model architecture modifications nor a special training procedure, while also supporting block overlapping. The word error rate (WER) performance of the algorithm is studied with respect to block size and overlap size.