This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.
High Data Quality Enhances Microplastic Toxicity Prediction
Downloads
Authors
Abstract
Unlike chemicals, microplastics (MPs) lack standardized identifiers, limiting the applicability of traditional predictive ecotoxicology methods such as quantitative structure-activity relationship (QSAR) models. This study aimed to predict MP toxicity using MP properties, MP concentration, organismal traits, endpoints, and experimental design, and to evaluate how data pre-processing, dataset size, and quality influence model performance. We applied the Boosted Regression Tree (BRT) machine learning algorithm to four datasets derived from the Toxicity of Microplastics Explorer database (ToMEx 2.0): (i) imputed missing values, (ii) complete-case (missing values removed), (iii) high-quality data, and (iv) low-quality data. The high-quality dataset yielded the best final predictions for both random cross-validation (AUC = 0.93) and blocked cross-validation by particle identifier (AUC = 0.87). Explainable artificial intelligence (xAI) analyses showed that predictive performance was primarily determined by endpoints and concentration, with MP properties contributing despite limited reporting. Our findings demonstrate the feasibility of machine learning to predict and identify key drivers of MP toxicity, highlighting that high-quality data improves predictive performance while reducing data mining and computational costs. Standardized experiments, detailed MP characterization, and high reporting standards would better support risk assessment frameworks and inform the design of safer materials.
DOI
https://doi.org/10.32942/X2C96D
Subjects
Life Sciences
Keywords
ecotoxicology, explainable artificial intelligence, predictive modeling, microplastic properties, risk assessment
Dates
Published: 2026-03-23 04:06
Last Updated: 2026-03-23 04:06
Older Versions
License
CC-By Attribution-ShareAlike 4.0 International
Additional Metadata
Conflict of interest statement:
The authors declare that they have no conflicts of interest.
Data and Code Availability Statement:
All data and code will be made publicly available upon publication.
Language:
English
There are no comments or no comments have been made public for this article.