Abstract:
Currently, a huge amount of data is presented in the form of tables. They are widely used to solve various practical problems in different domains. Specialized methods and software are developed for semantic interpretation (annotation) of tables and construction of knowledge graphs based on them. Effective testing of such software requires the creation and use of Russian-language datasets. This paper proposes a Russian-language tabular dataset, called RF-200, containing 200 tables from 26 domains labeled using the Talisman platform. The results of testing the performance of our approach for fact extraction from Russian-language tables using RF-200 are presented, in which the F1 reached a value of 0.464, surpassing traditional methods of fact extraction from texts (F1 = 0.277). The results emphasize the importance of specialized solutions for working with structured data, especially for Russian-language sources. The practical significance of the work lies in the integration of the approach into the Talisman platform, which expands the capabilities of semantic analytics carried out on tables. The study contributes to the automation of table processing, solving the problem of semantic interpretation in the context of linguistic diversity, and opens up prospects for the integration of deep learning methods and scaling of the created dataset.