M. A. Apishev, “Effective implementations of topic modeling algorithms”, Proceedings of ISP RAS, 2020, Volume 32, Issue 1,Pages <nobr>137

This article is cited in 3 papers

Effective implementations of topic modeling algorithms

M. A. Apishev

Lomonosov Moscow State University

Abstract: Topic modeling is an area of natural language processing that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. The exploding volume of text data motivates the community to constantly upgrade topic modeling algorithms for multiprocessor systems. In this paper, we provide an overview of effective EM-like algorithms for learning latent Dirichlet allocation (LDA) and additively regularized topic models (ARTM). Firstly, we review 11 techniques for efficient topic modeling based on synchronous and asynchronous parallel computing, distributed data storage, streaming, batch processing, RAM optimization, and fault tolerance improvements. Secondly, we review 14 effective implementations of topic modeling algorithms proposed in the literature over the past 10 years, which use different combinations of the techniques above. Their comparison shows the lack of a perfect universal solution. All improvements described are applicable to all kinds of topic modeling algorithms: PLSA, LDA, MAP, VB, GS, and ARTM.

Keywords: parallel algorithms, distributed data storage, stream data processing, fault tolerance, topic modeling, EM algorithm, latent Dirichlet allocation, additive regularization of topic models.

DOI: 10.15514/ISPRAS-2020-32(1)-8