Abstract:
The paper provides a comprehensive review of contemporary methods for automatic cognate detection, integrating deep learning techniques with traditional linguistic analyses. The primary objective is to systematize existing architectures, assess their strengths and limitations, and propose an integrative model combining phonetic, morphological, and semantic representations of lexical data. To this end, we critically analyze studies published between 2015 and 2025, selected via a specialized parser from the arXiv repository. The review addresses three core tasks: (1) evaluating the accuracy and robustness of Siamese convolutional neural networks (CNNs) and transformer-based models in transferring phonetic patterns across diverse language families; (2) comparing the effectiveness of orthographic metrics (e.g., LCSR, normalized Levenshtein distance, Jaro–Winkler index) with semantic embeddings (fastText, MUSE, VecMap, XLM-R); and (3) examining hybrid architectures that incorporate morphological layers and transitive modules for identifying partial cognates. Our findings indicate that a combination of phonetic modules (Siamese CNNs + transformers), morphological processing (BiLSTM leveraging UniMorph data), and learnable semantic vectors yields the best accuracy and stability across various language pairs, including low-resource scenarios. We propose an integrative architecture capable of adapting to linguistic diversity and effectively measuring word relatedness. The outcome of this research includes both an analytical report on state-of-the-art methods and a set of recommendations for advancing automated cognate detection in large-scale linguistic applications.