Skip to main content
Ten simple rules to follow when cleaning occurrence data in palaeobiology

Ten simple rules to follow when cleaning occurrence data in palaeobiology

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Supplementary Files

Authors

Lewis A. Jones , Christopher D. Dean, Bethany J. Allen, Harriet B. Drage , Joseph T. Flannery-Sutherland, William Gearty , Alfio Alessandro Chiarenza , Erin M. Dillon, Bruna M. Farina, Pedro L. Godoy

Abstract

Large datasets of fossil occurrences, often downloaded from online community-maintained databases, are a vital resource for understanding broad-scale evolutionary patterns, such as how biodiversity has changed through time and space. Such datasets, however, are not infallible and must be ‘cleaned’ of inaccurate, incomplete, or duplicate data prior to analysis. Researchers must decide upon the extent, feasibility, and value of data cleaning steps to perform, but while guides are available for working with neontological occurrences, there is currently no clear procedure for palaeobiological data despite its unique attributes. Here, we outline ten rules that aim to aid the process of cleaning fossil occurrence data for downstream analysis. These rules cover the major steps involved in processing data prior to analysis, including project setup, data exploration and cleaning, and finalising and reporting work. We provide accompanying examples and a vignette covering the entire data cleaning process to demonstrate the application of each rule. We believe that these rules will serve as a useful guideline to support data cleaning and foster new standards for the palaeobiological community.

DOI

https://doi.org/10.32942/X2FS8M

Subjects

Paleobiology

Keywords

palaeontology, fossils, biodiversity, reproducibility, data cleaning

Dates

Published: 2025-03-21 16:30

Last Updated: 2025-03-21 16:30

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
We declare we have no competing interests.

Data and Code Availability Statement:
The data and code generated for this article have been included within a dedicated GitHub repository: https://github.com/palaeoverse/ten-rules. In addition, they have been uploaded to a Zenodo repository through integrated version control: https://doi.org/10.5281/zenodo.14938533.

Language:
English