S. Karpovich, A. Smirnov, N. Teslya, “Methodology for creating a benchmark to evaluate LLM performance on numerals”, Informatics and Automation, 2025, Issue 24, volume 6,Pages <nobr>1721

Artificial Intelligence, Knowledge and Data Engineering

Methodology for creating a benchmark to evaluate LLM performance on numerals

S. Karpovich^a, A. Smirnov^b, N. Teslya^b

^a LLC “Rambler DC”
^b St. Petersburg Federal Research Center of the Russian Academy of Sciences

Abstract: The article presents a methodology for designing a benchmark to assess numerical reasoning skills in Large Language Models (LLMs). In the context of LLMs, numerical reasoning is defined as a model’s ability to correctly interpret, process, and utilize numerical information in text, including understanding magnitudes and relations between numbers, performing arithmetic operations, and generating numerals accurately in its outputs. The proposed methodology is based on decomposing applied tasks and enables targeted evaluation of specific facets of numerical reasoning using tasks that involve numerals. Particular attention is paid to the representation of numbers in textual prompts to LLMs, as this factor directly affects the quality of the final output. The need for rigorous assessment of LLMs’ numerical reasoning stems from its critical role across a wide range of text-centric applications, including automated summarization, generation of analytical reports, extraction and interpretation of quantitative data, and conversational systems operating on financial, scientific, or technical information. Drawing on an analysis of state-of-the-art LLM evaluation approaches, core principles for constructing evaluation benchmarks are formulated with an emphasis on generality and real-world applicability. In accordance with the proposed methodology, the MUE (Math Understanding Evaluation) benchmark is introduced; it comprises five test suites, each designed to assess a distinct aspect of LLM numerical reasoning. A comparative evaluation of popular LLMs is conducted, leading models are identified, and the strengths and weaknesses of their numerical reasoning are characterized. The findings are intended to inform LLM developers in refining architectures and training strategies, and to guide end users and integrators in selecting an optimal model for applied projects.

Keywords: methodology, Large Language Models (LLM), LLM benchmark, Natural Language Processing (NLP), numerals.

UDC: 004.054

Received: 08.07.2025

DOI: 10.15622/ia.24.6.7