Removing the bad apples: a simple bioinformatic method to improve loci-recovery in de novo RADseq data for non-model organisms

This is a Preprint and has not been peer reviewed. The published version of this Preprint is available: https://doi.org/10.1111/2041-210X.13562. This is version 1 of this Preprint.

This Preprint has no visible version.

Download Preprint
Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Supplementary Files
Authors

José Cerca, Marius F. Maurstad, Nicolas Rochette, Angel Rivera-Colón, Niraj Rayamajhi, Julian Catchen, Torsten Struck

Abstract

The restriction site-associated DNA (RADseq) family of protocols involves digesting DNA and sequencing the region flanking the cut site, thus providing a cost and time efficient way for obtaining thousands of genomic markers. However, when working with non-model taxa with few genomic resources, optimization of RADseq wet-lab and bioinformatic tools may be challenging, often resulting in allele dropout – that is when a given RADseq locus is not sequenced in one or more individuals resulting in missing data. Additionally, as datasets include divergent taxa, rates of dropout will increase since restriction sites may be lost due to mutation. Mitigating the impacts of allele dropout is crucial, as missing data may lead to incorrect inferences in population genetics and phylogenetics. Here, we demonstrate a simple pipeline for the optimization of RADseq datasets which involves reducing and analysing datasets at a population or species level. By running the software Stacks at this level, we were able to reliably identify and remove individuals with high levels of missing data (bad apples) likely stemming from artefacts in library preparation, DNA quality or sequencing artefacts. Removal of the bad apples generally led to an increase of loci and decrease of missing data in the final datasets, thus improving the biological interpretability of the data.

DOI

https://doi.org/10.32942/osf.io/47tka

Subjects

Bioinformatics, Computational Biology, Genetics and Genomics, Genomics, Life Sciences

Keywords

genetics, genome, genomics, Library preparation, methods, Pipeline

Dates

Published: 2020-08-30 18:31

License

CC-By Attribution-ShareAlike 4.0 International