Abstract:
The article studies the recognition of special structural segments of genomes called promoters. To solve the problem of promoter recognition machine learning methods based on logical analysis and data classification were used for the first time. These methods are based on searching for informative fragments in feature descriptions of precedents and are focused on processing low-value integer information. The fragments found are well interpretable and allow distinguishing promoters from other regions of the genome. However, their search is time-consuming. The results of experiments on an unbalanced sample of a large volume are presented, considering both the traditional method of feature formation using $k$-meres and the method of direct application of the logical classifier to the original data. It is shown that in the second case, the quality of logical classification is significantly higher and amounts to 94.3% according to ROC-AUC using the ensemble approach. The best result, namely, an ROC-AUC accuracy of 95.1%, was shown by the CatBoost classifier when directly applied to the original sample. With the traditional method of feature generation, the accuracy of CatBoost is 94.8%.
Keywords:gene promoter prediction, machine learning, supervised classification, logical classifier, logical analysis of data, ensemble of classifiers, $k$-mer, model organism, Drosophila melanogaster.