Skip to main content
BOLDistilled: Comprehensive but compact DNA barcode reference libraries

BOLDistilled: Comprehensive but compact DNA barcode reference libraries

This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Supplementary Files

Authors

Sean William John Prosser , Robin M Floyd, Ken A Thompson, Paul DN Hebert

Abstract

Advances in DNA sequencing technology have stimulated the rapid uptake of protocols—such as eDNA analysis and metabarcoding—that infer the species composition of environmental samples from DNA sequences. DNA barcode reference libraries play a critical role in the interpretation of sequences gathered through such protocols, but many lack adequate taxonomic curation, include redundant records, do not support end-user analytical pipelines, and are not permanently archived in repositories. Furthermore, because DNA sequencers are outpacing Moore’s Law and reference libraries are rapidly expanding, the computational power required to assign sequences to source taxa increases yearly. To address these limitations while also providing access to anonymized private data from the Barcode of Life Data System (BOLD), we introduce an algorithmic approach to construct DNA barcode reference libraries that overcome the above issues. Hosted online, ‘BOLDistilled’ libraries are comprehensive but compact, because the algorithm distills genetic variation into a minimal set of records. We generated a BOLDistilled library for the barcode region of the cytochrome c oxidase 1 gene (COI) based on all data in BOLD. This library contains 1.2M records versus 17.5M in the complete library, a compression which reduced the time required for sequence analysis of metabarcoded samples by ≥98% with no reduction in the accuracy of taxonomic placements. BOLDistilled libraries will be updated routinely, with the current version and all previous versions available at boldsystems.org/BOLDistilled. By providing access to persistent, comprehensive, and high-quality reference data, BOLDistilled libraries will strengthen the capacity of DNA-based identification systems to advance biodiversity science.

DOI

https://doi.org/10.32942/X2DG9K

Subjects

Biodiversity, Bioinformatics, Ecology and Evolutionary Biology, Life Sciences

Keywords

DNA barcoding, metabarcoding, Bioinformatics, molecular ecology, biodiversity

Dates

Published: 2025-04-21 19:32

Last Updated: 2025-04-21 19:32

Older Versions

License

CC-BY Attribution-NonCommercial 4.0 International

Additional Metadata

Conflict of interest statement:
None

Data and Code Availability Statement:
This study contains no original data. The genetic resource resulting from our study is available on Dryad (doi: 10.5061/dryad.k98sf7mjd) under a CC BY-NC-ND license.

Language:
English