This is a Preprint and has not been peer reviewed. The published version of this Preprint is available: https://doi.org/10.3897/BDJ.11.e114076. This is version 2 of this Preprint.
Downloads
Authors
Abstract
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some degree of curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon, and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist dataset. This represents a common usage scenario: finding publicly available sequencing data (available from NCBI) for species chosen by their occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name matching and data cleaning, describe common issues encountered during curation, and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name matching caused by homonyms. By correcting such errors during data cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to improvement of community resources, thereby improving the quality of downstream analyses.
DOI
https://doi.org/10.32942/X2Q01H
Subjects
Biodiversity
Keywords
data curation, biodiversity informatics, data integration
Dates
Published: 2023-10-10 18:54
Last Updated: 2023-11-24 09:16
Older Versions
License
CC BY Attribution 4.0 International
Additional Metadata
Language:
English
Data and Code Availability Statement:
Code associated with this preprint is available from https://github.com/monagrland/taxo-harmo
There are no comments or no comments have been made public for this article.