pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education

Fernanda S. Caron; Felipe de M. Magalhães; Matheus Salles; Fabricius M. C. B. Domingos

pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Fernanda S. Caron , Felipe de M. Magalhães, Matheus Salles, Fabricius M. C. B. Domingos

Abstract

Pynnotate is a Python-based tool designed for automated retrieval, parsing, and extraction of annotated gene sequences from GenBank records. The tool addresses the common challenges researchers face when working with GenBank data, including inconsistent gene nomenclature, redundant sequences, and the need for standardised gene extraction across multiple taxa. Pynnotate operates through both a graphical user interface and a command-line interface, making it accessible to users with varying levels of bioinformatics experience. The tool supports flexible sequence retrieval through manually defined accession numbers or NCBI query terms, and offers three distinct filtering modes: unconstrained (all sequences), strict (one sequence per species prioritising gene completeness), and flexible (multiple sequences per species when contributing different genes). Key features include synonym resolution for gene names, customizable sequence headers, metadata tracking, and automated gene extraction into separate files. Built-in dictionaries support animal and plant mitochondrial DNA, chloroplast DNA, and ribosomal DNA, and allow users to provide custom synonym dictionaries. The tool generates structured output including FASTA files, metadata matrices, and detailed logs, facilitating integration with downstream analyses. Designed for speed and scalability, pynnotate efficiently handles large datasets, allowing quick retrieval and extraction of annotated sequences across multiple taxa. Finally, pynnotate serves as a valuable resource for both research applications and educational settings, particularly benefiting educators conducting bioinformatics analyses with students with limited command-line experience.

DOI

https://doi.org/10.32942/X2294V

Subjects

Bioinformatics, Ecology and Evolutionary Biology, Evolution

Keywords

bioinformatics, comparative genomics, feature extraction, molecular evolution, phylogenetics, Python, sequence annotation

Dates

Published: 2026-02-27 02:24

Last Updated: 2026-02-27 02:24

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
None.

Data and Code Availability Statement:
The ‘pynnotate’ public repository is available at https://github.com/fernandacaron/pynnotate.

Language:
English