RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2019 Volume 31, Issue 5, Pages 127–136 (Mi tisp458)

This article is cited in 3 papers

Cross-lingual similar document retrieval methods

D. V. Zubarev, I. V. Sochenkov

Federal Research Center «Computer Science and Control» of Russian Academy of Sciences

Abstract: In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.

Keywords: cross-lingual document retrieval, cross-lingual plagiarism detection, cross-lingual word embeddings.

Language: English

DOI: 10.15514/ISPRAS-2019-31(5)-9



© Steklov Math. Inst. of RAS, 2026