I. O. Buyanov, D. V. Yaskova, D. S. Serenko, D. N. Shkereda, A. D. Yaskov, I. V. Sochenkov, “The methodology of constructing the large-scale dataset for detecting presuicidal and anti-suicidal signals in social media texts in Russian”, Proceedings of ISP RAS, 2025, Volume 37, Issue 6(2),Pages <nobr>191

The methodology of constructing the large-scale dataset for detecting presuicidal and anti-suicidal signals in social media texts in Russian

I. O. Buyanov^a, D. V. Yaskova^b, D. S. Serenko^a, D. N. Shkereda^a, A. D. Yaskov^c, I. V. Sochenkov^ade

^a Federal Research Center "Computer Science and Control" of Russian Academy of Sciences
^b MTS
^c Company "Yandex"
^d Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)
^e Ivannikov Institute for System Programming of the RAS

Abstract: The suicide is a terrifying act of a person who is misled by his own mental state. This problem arises across many countries. Sadly, Russia also has quite high number of persons who committed suicide. Luckily, a subset of these people writes their struggles in social media, allowing a way to find them and help. However, these valuable texts disappearing in many irrelevant texts which is considerably slowing down the decision process about person's suicidal risk. To tackle this problem, in this work we have presented a detailed methodology of building the dataset for detecting texts that describe presuicidal and anti-suicidal signals. This methodology describes the process of instruction and class table creation, the process of annotation, verification and post-annotation correction. Guiding by this methodology, we collect and annotate a large-scale Russian dataset with more than 50 thousand texts from social media. We provide a count statistic of the dataset as well as common problems in annotation. We also conduct basic experiments of building the classification models to show the on go performance on different levels of annotation. Furthermore, we make the dataset, code and all materials publicly available.

Keywords: dataset construction, suicide, methodology, annotation

Language: English

DOI: 10.15514/ISPRAS-2025-37(6)-29