Bacteria Taxonomic Classification Using Machine Learning Models
Abstract
Classification of taxonomic for genomic sequences is commonly depended on evolutionary distance acquired by alignment methods. alignment-free method introduced based on probabilistic topic modeling out of clustering or Naïve Bayes algorithm through classification. Using a k-mer (fractions of length k) fragmentations of DNA sequences and the Latent Dirichlet Allocation algorithm(LDA), a clusters are built for 16S RNA bacterial sequences for different number of topics, or adopt classifiers through Naïve Bayes based on k-mer fragments .The classification model is evaluated during the cross-validation procedure, taking into account the bacterial data set of 1000 sequences belonging to the majority numeric bacteria phyla : class, order, family and genus. To test the efficiency of the proposed model. The results, in terms of accuracy scores and for four categories, range from "100%", at the "class level", to "98%" at the "genus level", taking into account k-mers of length 8. The robustness of the proposed model indicates these results.