Abstract:
Thematic modeling is an essential tool for analyzing large volumes of textual data, enabling the identification of latent semantic patterns. However, conventional approaches such as Latent Dirichlet Allocation (LDA) encounter difficulties when dealing with multi-valued and unigram tokens, resulting in reduced accuracy and clarity in the outcomes. This study aims to develop a technique for constructing a thematic structure based on refined LDA, which incorporates contextual features, vector representations of words, and external vocabularies. The objective is to address terminological ambiguity and enhance the clarity of thematic groups. The paper employs a mathematical model that integrates probabilistic thematic modeling with vector representations, facilitating the differentiation of word meanings and the establishment of precise connections between them. Using the corpus of Dimensions AI and PubMed publications, the study demonstrates an improved distribution of terms within thematic clusters. This involves frequency analysis and vector similarity, which are essential components of the study. The results emphasize the effectiveness of an integrated approach to dealing with complex linguistic structures in automated text analysis.