Skip to main content
Choices that matter: the impact of substitution models on machine learning-based species  delimitation inference

Choices that matter: the impact of substitution models on machine learning-based species delimitation inference

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Matheus Salles, Fabricius Domingos

Abstract

The choice of nucleotide substitution models is a cornerstone of phylogenetic inference, influencing the accuracy of the estimated evolutionary parameters and, by extension, demographic and species delimitation model selection. With the growing adoption of machine learning methods trained on simulated data, it remains unclear how the substitution model used during simulation training influences classifier performance and robustness when applied to empirical data, usually characterized by pervasive genomic heterogeneity. To address this gap, we conducted a controlled simulation study to evaluate the impact of substitution-model misspecification on supervised machine learning inference. We trained supervised classifiers on data simulated under three common substitution models (JC69, HKY, and GTR) and evaluated their performance in selecting the correct delimitation model from test datasets featuring a mixture of substitution processes across loci, a realistic scenario mimicking genomic heterogeneity. Our results demonstrate that classifiers trained under a single, simplistic substitution model generalized effectively to mixed-model test data, consistently identifying the true demographic model with high posterior probability (mean probability > 0.86 even using 100 SNPs), with highest performance plateauing beyond 600–800 SNPs. Notably, the differences in accuracy among classifiers trained under JC69, HKY, or GTR were minimal, indicating that the demographic signal captured by the site frequency spectrum predominates over substitution-model artifacts within the tested parameter space. However, this robustness is context-dependent. We caution that some extreme, though realistic, evolutionary scenarios (such as deep divergence, strong among-site rate variation, or protein-coding data) likely exceeds the conditions tested here and may severely degrade classifier performance. Furthermore, robust model selection does not imply accurate parameter estimation, as branch lengths and evolutionary rates remain sensitive to model misspecification. We conclude that for many practical applications in species delimitation, faster and computationally efficient training under simple models can be sufficient, provided it is coupled with rigorous validation, model-adequacy assessment, and an awareness of the limitations imposed by complex genomic data. Our findings offer a pragmatic framework for integrating phylogenetic model selection with modern ML workflows, balancing computational efficiency with biological rigor.

DOI

https://doi.org/10.32942/X2P665

Subjects

Bioinformatics, Computational Biology, Genetics and Genomics, Life Sciences, Molecular Genetics

Keywords

supervised learning, random forest, site frequency spectrum, phylogenetics, simulation

Dates

Published: 2026-06-17 04:35

Last Updated: 2026-06-17 04:35

License

No Creative Commons license

Additional Metadata

Data and Code Availability Statement:
The data underlying this article, including phylogenetic datasets, corresponding trees, input and output files for all analyses, and any other relevant supplementary files are available in Zenodo, at https://doi.org/10.5281/zenodo.17274456

Language:
English