Bacteria Taxonomic Classification Using Machine Learning Models

  • Najah Abed Alhadi Shanan, Hussein Attya Lafta, Sura Z. Al_Rashid


Classification of taxonomic for genomic sequences is commonly depended on evolutionary distance acquired by alignment methods. alignment-free method introduced based on probabilistic topic modeling out of clustering or Naïve Bayes algorithm through classification. Using a k-mer (fractions of length k) fragmentations of DNA sequences and the Latent Dirichlet Allocation algorithm(LDA), a clusters are built for 16S RNA bacterial sequences for different number of topics, or adopt classifiers through Naïve Bayes  based on k-mer fragments .The classification model is evaluated during the cross-validation procedure, taking into account the bacterial data set of 1000 sequences belonging to the majority numeric  bacteria phyla : class, order, family and genus. To test the efficiency of the proposed model. The results, in terms of accuracy scores and for four categories, range from "100%", at the "class level", to "98%" at the "genus level", taking into account k-mers of length 8. The robustness of the proposed model indicates these results.