Constructing ìaching-readable dictionary based on Russian Wiktionary
A. A. Krizhanovsky St. Petersburg Institute for Informatics and Automation of RAS
Abstract:
Due to a big number of articles and many-sided word’s description the Wiktionary is an important linguistics resource, e.g. for such tasks as information search, ontology alignment, word sense disambiguation, spell checking, machine translation, etc.
In this paper the practical questions of data extraction from Wiktionary are elaborated. Wiktionary is a multilingual, web-based project to create a free content dictionary, available in over 151 languages (and in Russian Wiktionary there are more than 300 languages).
In order to store the lexicographic data extracted from Russian Wiktionary (1) a database structure (tables and relations) was designed, (2) an application programming interface to this database was developed.
The structure of the developed database corresponds to the parts of the Wiktionary article. The application programming interface allows reading, writing and searching for data in this database.
The graphical user interface was implemented, which allows present the wordcards to the user. The paper is devoted of the creation of a machine-readable dictionary based on data from Russian Wiktionary.
It should be noted that (1) other language editions of Wiktionary are out of the scope of this paper, (2) only a small part of lexicographic information from Russian Wiktionary texts has been extracted and stored into machine readable dictionary. An extraction from Wiktionary of a pronunciation (phonetic transcription, a sound sample), a hyphenation, an etymology, a quotation (example sentence), a parallel text (examples with translations), a figure (which illustrates a word meaning) were not considered because this is a first step towards the creation of an open-source Wiktionary parser software.
Keywords:
machine-readable dictionary, lexicography, information retrieval, wiki.
UDC:
004.912
Received: 10.12.2009