This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.
Anomaly detection in metabarcoding sequences using an LSTM-CNN deep neural network ensemble (MetAnoDe)
Downloads
Authors
Abstract
Metabarcoding has emerged as a critical tool in ecology and other scientific disciplines, facilitating species identification in mixed samples for biodiversity monitoring, community and microbiome analysis, dietary studies, and understanding species interactions. However, challenges arise from errors and artifacts introduced during sampling and laboratory processes such as PCR and sequencing. Manual inspection is impractical due to the vast number of sequences, necessitating rapid algorithms for data cleanup. Thorough bioinformatic processing can reduce such errors through removal of low-quality or non-target sequences using quality-, abundance-, and alignment-based approaches. However, in practice, some anomalous sequences evade detection, while valid sequences may also be incorrectly removed.
Deep neural networks (DNNs) offer a promising complementary solution to alignment-based methods by recognizing complex DNA sequence patterns. This study introduces MetAnoDe (Metabarcoding Anomaly Detection), a software workflow combining LSTM and CNN models within an ensemble framework. MetAnoDe employs an alignment-free approach that complements existing tools, enhancing metabarcoding data cleanup efficiency. Cross-validation and independent real-world dataset testing demonstrated high classification accuracy across both bacterial 16S-V4 and plant ITS2 markers. The ensemble model achieved validation accuracies of up to 97%, while also identifying substantial proportions of anomalous sequences not detected by current alignment-based workflows. The software additionally supports automated generation of new models for other metabarcoding markers.
In conclusion, MetAnoDe enhances metabarcoding data cleanup by efficiently identifying anomalous sequences. Combining deep-learning and traditional bioinformatic approaches improves identification of residual non-target reads, thereby increasing robustness and reliability of downstream biodiversity analyses.
DOI
https://doi.org/10.32942/X2792N
Subjects
Life Sciences
Keywords
machine learning, microbiome, metabarcoding, 16S, ITS2, outlier detection, convolutional neural network, long short-term memory, recurrent neural networks
Dates
Published: 2025-03-20 12:06
Last Updated: 2026-05-28 01:02
Older Versions
License
CC-BY Attribution-NonCommercial 4.0 International
Additional Metadata
Data and Code Availability Statement:
https://github.com/chiras/MetAnoDe
Language:
English
There are no comments or no comments have been made public for this article.