This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.
pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education
Downloads
Authors
Abstract
Pynnotate is a Python-based tool designed for automated retrieval, parsing, and extraction of annotated gene sequences from GenBank records. The tool addresses the common challenges researchers face when working with GenBank data, including inconsistent gene nomenclature, redundant sequences, and the need for standardised gene extraction across multiple taxa. Pynnotate operates through both a graphical user interface and a command-line interface, making it accessible to users with varying levels of bioinformatics experience. The tool supports flexible sequence retrieval through manually defined accession numbers or NCBI query terms, and offers three distinct filtering modes: unconstrained (all sequences), strict (one sequence per species prioritising gene completeness), and flexible (multiple sequences per species when contributing different genes). Key features include synonym resolution for gene names, customizable sequence headers, metadata tracking, and automated gene extraction into separate files. Built-in dictionaries support animal and plant mitochondrial DNA, chloroplast DNA, and ribosomal DNA, and allow users to provide custom synonym dictionaries. The tool generates structured output including FASTA files, metadata matrices, and detailed logs, facilitating integration with downstream analyses. Designed for speed and scalability, pynnotate efficiently handles large datasets, allowing quick retrieval and extraction of annotated sequences across multiple taxa. Finally, pynnotate serves as a valuable resource for both research applications and educational settings, particularly benefiting educators conducting bioinformatics analyses with students with limited command-line experience.
DOI
https://doi.org/10.32942/X2294V
Subjects
Bioinformatics, Ecology and Evolutionary Biology, Evolution
Keywords
bioinformatics, comparative genomics, feature extraction, molecular evolution, phylogenetics, Python, sequence annotation
Dates
Published: 2026-02-26 12:24
Last Updated: 2026-02-26 12:24
License
CC BY Attribution 4.0 International
Additional Metadata
Conflict of interest statement:
None.
Data and Code Availability Statement:
The ‘pynnotate’ public repository is available at https://github.com/fernandacaron/pynnotate.
Language:
English
There are no comments or no comments have been made public for this article.