This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.
Downloads
Authors
Abstract
Species identification using DNA barcodes has revolutionized biodiversity sciences and society at large. However, conventional barcoding methods do not reflect genomic complexity, may lack sufficient variation, and rely on limited genomic loci that are not universal across the Tree of Life. Here, we develop a novel barcoding method that uses exceptionally low-coverage genome skim data to create a “varKode”, a two-dimensional image representing the genomic landscape of a species. Using these varKodes, we then train neural networks for precise taxonomic identification. Applying an expertly annotated genomic dataset including hundreds of newly sequenced genomic samples from the plant clade Malpighiales, we demonstrate >91% precision when identifying species or genera. Remarkably, high accuracy remains despite minimal data amounts that lead to failure when applying alternative methods. We further illustrate the broad utility of varKodes across several focal clades of eukaryotes and prokaryotes. As a final test, we classify the entire NCBI eukaryote sequence-read archive to identify its 861 constituent families with >95% precision despite utilizing less than 10 Mbp of data per sample. Enhanced computational efficiency and scalability, minimal data inputs robust to degraded DNA, and modularity for further development make varKoding an ideal approach for biodiversity science.
DOI
https://doi.org/10.32942/X24891
Subjects
Bioinformatics, Computational Biology, Genomics, Other Ecology and Evolutionary Biology
Keywords
biodiversity science, computer vision, DNA barcoding, Malpighiaceae, natural history collections, Neural Networks, Species identification, taxonomy
Dates
Published: 2024-01-18 09:01
Last Updated: 2024-04-18 20:34
Older Versions
License
Creative Commons Attribution-NonCommercial 4.0
Additional Metadata
Language:
English
Conflict of interest statement:
None
Data and Code Availability Statement:
The current version of varKoder is available at https://github.com/brunoasm/varKoder. A fastai model pre-trained on SRA data is available at https://huggingface.co/brunoasm/vit_large_patch32_224.NCBI_SRA. Open data is not available, pending manuscript peer review.
There are no comments or no comments have been made public for this article.