RUS  ENG
Full version
JOURNALS // Computational nanotechnology // Archive

Comp. nanotechnol., 2025 Volume 12, Issue 2, Pages 19–27 (Mi cn552)

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Modification of the method for modeling the thematic environment of terms using the LDA approach

O. V. Zolotarev, V. A. Yurchak

Russian New University

Abstract: Thematic modeling is an essential tool for analyzing large volumes of textual data, enabling the identification of latent semantic patterns. However, conventional approaches such as Latent Dirichlet Allocation (LDA) encounter difficulties when dealing with multi-valued and unigram tokens, resulting in reduced accuracy and clarity in the outcomes. This study aims to develop a technique for constructing a thematic structure based on refined LDA, which incorporates contextual features, vector representations of words, and external vocabularies. The objective is to address terminological ambiguity and enhance the clarity of thematic groups. The paper employs a mathematical model that integrates probabilistic thematic modeling with vector representations, facilitating the differentiation of word meanings and the establishment of precise connections between them. Using the corpus of Dimensions AI and PubMed publications, the study demonstrates an improved distribution of terms within thematic clusters. This involves frequency analysis and vector similarity, which are essential components of the study. The results emphasize the effectiveness of an integrated approach to dealing with complex linguistic structures in automated text analysis.

Keywords: LDA method, thesauri, multivalued tokens, monolex tokens, Dimensions AI, PubMed.

UDC: 004.827

DOI: 10.33693/2313-223X-2025-12-2-19-27



© Steklov Math. Inst. of RAS, 2026