N. Kolmakov, A. Golubinskiy, “Assessing the influence of floating-point bit depth on speaker recognition accuracy”, Informatics and Automation, 2026, Issue 25, volume 1,Pages <nobr>176

Artificial Intelligence, Knowledge and Data Engineering

Assessing the influence of floating-point bit depth on speaker recognition accuracy

N. Kolmakov^a, A. Golubinskiy^b

^a Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)
^b Russian Science Foundation

Abstract: The article analyzes the impact of varying the bit depth (quantization) of a neural network’s output tensor on speaker recognition accuracy. This tensor represents the neural network’s latent space, containing the latent features utilized for speaker recognition tasks. Typically, 32 bits are allocated per value in the output space (the output tensors of the methods under study contain 512 values), resulting in significant memory requirements for maintaining a continuously updated database. Consequently, the "minifloat" floating-point format is of particular interest, as it enables numerical representations using only 8, 6, or 4 bits. To ensure comprehensive results, three neural network models demonstrating superior recognition performance on the test set were selected: CAM++, WavLM, and ReDimNet. These models possess unique architectural characteristics, facilitating the assessment of how bit depth reduction affects recognition accuracy across different neural network architectures. Recognition accuracy is evaluated using the Equal Error Rate (EER). The evaluation employs the English-language VoxCeleb-1 dataset, the audio characteristics of which correspond to those of a small-scale biometric system database. The relevance of this study is underscored by the increasing volume of research proposing the use of voice as a verification key. Therefore, managing large biometric datasets requires substantial storage capacity and RAM. Modern databases are continuously updated and expanded, leading to increased resource demands for their maintenance. Applying quantization to the neural network’s output tensor offers a potential solution. However, excessive reduction of the bit depth in the output tensor can lead to a significant degradation in recognition quality compared to the baseline network. The primary focus of this research is to minimize the resources required to support a biometric system without the need for additional neural network training.

Keywords: neural networks, speaker recognition, floating point, embedding quantization.

UDC: 004.008

Received: 29.06.2025

DOI: 10.15622/ia.25.1.6