Abstract:
The problem of extracting filling elements (fields) from a recognized image of a document with the help of descriptors – descriptions of one or more structural elements is considered. Structural elements can be words of static text and scribble lines used to shape the design of a document. Business documents with a simplified structure and a limited vocabulary are considered. Flexible business documents that allow significant modifications to the page design are considered. Descriptors are created taking into account a significant number of possible errors in document page recognition. Combined descriptors consisting of several terms and line segments are described. A binding algorithm based on descriptors is given. It is experimentally shown that the extraction of combined descriptors improves the accuracy of recognition of document fields during recognition by 17%, and the accuracy of extracting information from the document image by 16%. The SDK Smart Document Engine was used as OCR in the experiment.