This is a Preprint and has not been peer reviewed. This is version 3 of this Preprint.
Downloads
Authors
Abstract
Species identification using DNA barcodes has revolutionized biodiversity sciences and society at large. However, conventional barcoding methods may lack power and universal applicability across the Tree of Life. Alternative methods based on whole genome sequencing are hard to scale due to large data requirements. Here, we develop a novel DNA-based identification method, varKoding, using exceptionally low-coverage genome skim data to create two-dimensional images representing the genomic signature of a species. Using these representations, we train neural networks for taxonomic identification. Applying a taxonomically verified novel genomic dataset of Malpighiales plant accessions, we optimize training hyperparameters and find the highest performance by combining a transformer architecture with a new modified chaos game representation. Remarkably, >91% precision is achieved despite minimal input data, exceeding alternative methods tested. We illustrate the broad utility of varKoding across several focal clades of eukaryotes and prokaryotes. We also train a model capable of identifying all species in NCBI SRA using less than 10 Mbp sequencing data with 96% precision and 95% recall and robust to sequencing platforms. Enhanced computational efficiency and scalability, minimal data inputs robust to sequencing details, and modularity for further development make varKoding an ideal approach for biodiversity science.
DOI
https://doi.org/10.32942/X24891
Subjects
Bioinformatics, Computational Biology, Genomics, Other Ecology and Evolutionary Biology
Keywords
biodiversity science, computer vision, DNA barcoding, Malpighiaceae, natural history collections, Neural Networks, Species identification, taxonomy
Dates
Published: 2024-01-18 06:01
Last Updated: 2024-12-11 22:23
Older Versions
License
Creative Commons Attribution-NonCommercial 4.0
Additional Metadata
Language:
English
Conflict of interest statement:
None
Data and Code Availability Statement:
The current version of varKoder is available at https://github.com/brunoasm/varKoder. A fastai model pre-trained on SRA data is available at https://huggingface.co/brunoasm/vit_large_patch32_224.NCBI_SRA. Open data is not available, pending manuscript peer review.
There are no comments or no comments have been made public for this article.