Skip to main content
A composite universal DNA signature for the Tree of Life

A composite universal DNA signature for the Tree of Life

This is a Preprint and has not been peer reviewed. This is version 5 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Bruno A S de Medeiros , Liming Cai, Peter J Flynn, Yujing Yan, Xiaoshan Duan, Lucas C Marinho, Christiane Anderson, Charles Davis 

Abstract

Species identification using DNA barcodes has revolutionized biodiversity sciences.
However, conventional barcoding methods may lack power and universal applicability
across the Tree of Life. Alternative methods based on whole genome sequencing are hard
to scale due to large data requirements. Here, we develop a novel DNA-based identification
method, varKoding, using exceptionally low-coverage genome skim data to create two-
dimensional images representing the genomic signature of a species. Using these
representations, we train neural networks for taxonomic identification. Applying a
taxonomically verified novel genomic dataset of Malpighiales plant accessions, we optimize
training hyperparameters and find the highest performance by combining a transformer
architecture with a new modified chaos game representation. Greater than 91% precision
is achieved despite minimal input data, exceeding alternative methods tested. We illustrate
the broad utility of varKoding across several focal clades of eukaryotes and prokaryotes.
We also train a model capable of identifying all species in NCBI SRA using less than 10 Mbp
sequencing data with 96% precision and 95% recall and robust to sequencing platforms.
The varKoding approach offers enhanced computational efficiency and scalability, minimal
data inputs robust to sequencing details, and modularity for further development in
biodiversity science.

DOI

https://doi.org/10.32942/X24891

Subjects

Bioinformatics, Computational Biology, Genomics, Other Ecology and Evolutionary Biology

Keywords

biodiversity science, computer vision, DNA barcoding, Malpighiaceae, natural history collections, Neural Networks, Species identification, taxonomy

Dates

Published: 2024-01-18 11:01

Last Updated: 2025-04-23 13:58

Older Versions

License

Creative Commons Attribution-NonCommercial 4.0

Additional Metadata

Conflict of interest statement:
None

Data and Code Availability Statement:
The current version of varKoder is available at https://github.com/brunoasm/varKoder. A fastai model pre-trained on SRA data is available at https://huggingface.co/brunoasm/vit_large_patch32_224.NCBI_SRA. Open data is not available, pending manuscript peer review.

Language:
English