Abstract:
Machine learning is widely applied across diverse domains, with research teams continually developing new recognition models that compete on open datasets. In some tasks, accuracy surpasses 99%, and the differences between top-performing models are often marginal, measured in hundredths of a percent. These minimal differences, combined with the varying size of the benchmark datasets, raise questions about the reliability of model evaluation and ranking. This paper introduces a method for determining the necessary dataset size to ensure robust hypothesis testing for model performance. It also examines the statistical significance of accuracy rankings in recent studies on MNIST, CIFAR-10, and CIFAR-100 datasets.