RUS  ENG
Full version
JOURNALS // Sistemy i Sredstva Informatiki [Systems and Means of Informatics] // Archive

Sistemy i Sredstva Inform., 2024 Volume 34, Issue 4, Pages 73–84 (Mi ssi957)

This article is cited in 1 paper

Developing the structure of supracorpora databases

A. A. Goncharov

Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119133, Russian Federation

Abstract: The paper presents the methods for developing the structure of supracorpora databases to provide a more detailed representation of the results from parallel text analysis. The initial data structure for the annotation of translation correspondences is examined and four methods for its improvement are described. These methods provide the possibilities ($i$) to mark up the original and translation text blocks in more detail; ($ii$) to classify the features of a text block using multiple facets; ($iii$) to save data about lexical markers of text block features; and ($i\nu$) to save data about the irrelevance of text fragments pairs to a search query. All these possibilities allow improving the quality of the final data in terms of its completeness and consistency and the corresponding changes in the data structure can make it more flexible. The proposed changes to the data structure are independent of the goals and objectives of any specific study that may be conducted using supracorpora databases.

Keywords: supracorpora database, parallel texts, text annotation, corpus linguistics.

Received: 15.09.2024

DOI: 10.14357/08696527240406



© Steklov Math. Inst. of RAS, 2026