A universal DNA barcode for the Tree of Life

This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Bruno A S de Medeiros , Liming Cai, Peter J Flynn, Yujing Yan, Xiaoshan Duan, Lucas C Marinho, Christiane Anderson, Charles Davis 

Abstract

Species identification using DNA barcodes has revolutionized biodiversity sciences and society at large. However, conventional barcoding methods do not reflect genomic complexity, may lack sufficient variation, and rely on limited genomic loci that are not universal across the Tree of Life. Here, we develop a novel barcoding method that uses exceptionally low-coverage genome skim data to create a “varKode”, a two-dimensional image representing the genomic landscape of a species. Using these varKodes, we then train neural networks for precise taxonomic identification. Applying an expertly annotated genomic dataset including hundreds of newly sequenced genomic samples from the plant clade Malpighiales, we demonstrate >91% precision when identifying species or genera. Remarkably, high accuracy remains despite minimal data amounts that lead to failure when applying alternative methods. We further illustrate the broad utility of varKodes across several focal clades of eukaryotes and prokaryotes. As a final test, we classify the entire NCBI eukaryote sequence-read archive to identify its 861 constituent families with >95% precision despite utilizing less than 10 Mbp of data per sample. Enhanced computational efficiency and scalability, minimal data inputs robust to degraded DNA, and modularity for further development make varKoding an ideal approach for biodiversity science.

DOI

https://doi.org/10.32942/X24891

Subjects

Bioinformatics, Computational Biology, Genomics, Other Ecology and Evolutionary Biology

Keywords

biodiversity science, computer vision, DNA barcoding, Malpighiaceae, natural history collections, Neural Networks, Species identification, taxonomy

Dates

Published: 2024-01-18 09:01

Last Updated: 2024-04-18 20:34

Older Versions
License

Creative Commons Attribution-NonCommercial 4.0

Additional Metadata

Language:
English

Conflict of interest statement:
None

Data and Code Availability Statement:
The current version of varKoder is available at https://github.com/brunoasm/varKoder. A fastai model pre-trained on SRA data is available at https://huggingface.co/brunoasm/vit_large_patch32_224.NCBI_SRA. Open data is not available, pending manuscript peer review.