Paying it forward: Crowdsourcing of taxonomic harmonization and linking of biodiversity identifiers

This is a Preprint and has not been peer reviewed. The published version of this Preprint is available: https://doi.org/10.3897/BDJ.11.e114076. This is version 2 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Brandon Kwee Boon Seah 

Abstract

Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some degree of curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon, and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist dataset. This represents a common usage scenario: finding publicly available sequencing data (available from NCBI) for species chosen by their occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name matching and data cleaning, describe common issues encountered during curation, and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name matching caused by homonyms. By correcting such errors during data cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to improvement of community resources, thereby improving the quality of downstream analyses.

DOI

https://doi.org/10.32942/X2Q01H

Subjects

Biodiversity

Keywords

data curation, biodiversity informatics, data integration

Dates

Published: 2023-10-11 03:54

Last Updated: 2023-11-24 18:16

Older Versions
License

CC BY Attribution 4.0 International

Additional Metadata

Language:
English

Data and Code Availability Statement:
Code associated with this preprint is available from https://github.com/monagrland/taxo-harmo