Abstract:
The task of automated morpheme segmentation for morphologically rich but low-resource languages, such as Belarusian, remains insufficiently studied. This paper presents the first large-scale comparative study on the effectiveness of modern neural network approaches to morpheme segmentation using Belarusian language data. We compared three approaches that have demonstrated high quality for other languages: algorithms based on convolutional neural networks (CNNs), algorithms based on LSTM networks, and fine-tuning of BERT-like models. Due to the limited availability of monolingual Belarusian models, we also included larger Russian and multilingual models in the comparison. The experiments were conducted on the openly available Slounik dataset using two strategies for splitting the data into training and test sets. In the first case, the split was random; in the second, words were split by their roots to ensure that words with the same root did not appear in both the training and test sets simultaneously. An ensemble of LSTM networks achieved the best performance in the experiments, with a word accuracy of 91.42% on the random split and 73.89% on the root-based split. Comparable results were demonstrated by fine-tuned multilingual and Russian BERT-like models, highlighting the potential of applying large models, including those trained on closely related and higher-resource languages, to this task. An analysis of the errors confirmed that, as with other Slavic languages, the majority of inaccuracies are related to the identification of root boundaries.
Keywords:natural language processing, automated morpheme segmentation, deep learning, Belarusian language, low-resource languages.