RUS  ENG
Full version
JOURNALS // Zapiski Nauchnykh Seminarov POMI // Archive

Zap. Nauchn. Sem. POMI, 2025 Volume 546, Pages 193–202 (Mi znsl7637)

RuMathBERT: a Russian-language model for mathematical formula interpretation

A. Latushkoa, E. Bruchesba

a Novosibirsk State University
b Institute of Informatics Systems SB RAS

Abstract: Important information in scientific and technical texts is often contained within mathematical formulae and cannot be acquired from the plain text, making it a significant challenge for vanilla language models to process such texts containing formulae in a manner that fully encapsulates their semantics. While there have been models developed for this purpose for the English language, none have been created for Russian so far. In this paper we present RuMathBERT: a model trained on Russian texts, which can be used for processing scientific texts containing formulae. Evaluating our model quality and comparing it to other models used for processing Russian and English regular and scientific texts demonstrated that RuMathBERT shows better understanding of the semantics of formulae and their relationship with context. The dataset used for training the model is available on Hugging Face at https://huggingface.co/datasets/iis-research-team/ruwiki-formulae and the RuMathBERT model itself is also available at https://huggingface.co/iis-research-team/RuMathBERT.

Key words and phrases: BERT, mathematical texts, formulae, NLP.

UDC: 004.912

Received: 05.05.2025

Language: English



© Steklov Math. Inst. of RAS, 2026