Using large language models to address the bottleneck of georeferencing natural history collections

Yuyang Xie; Daniel S Park; Miranda A Sinnott-Armstrong; Joyce Ho; Tianlong Chen; Alan S Weakley; Luis José Aguirre; Jaein Choi; Marisa Laitinen; Nicholas Steeves; Chingyan Huang; Ran Xu; Xiao Feng

Using large language models to address the bottleneck of georeferencing natural history collections

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Yuyang Xie , Daniel S Park, Miranda A Sinnott-Armstrong, Joyce Ho, Tianlong Chen, Alan S Weakley, Luis José Aguirre, Jaein Choi, Marisa Laitinen, Nicholas Steeves, Chingyan Huang, Ran Xu, Xiao Feng

Abstract

Natural history collections are fundamental for biodiversity research. The broad use of them relies on the digitization effort, especially georeferencing that translates textual locality descriptions into geographic coordinates. However, traditional georeferencing approaches are labor-intensive and costly, thus georeferencing is a major bottleneck in the digitization process that prevents the usage of millions of specimens across the world. This study investigated the potential of using large language models (LLMs) to facilitate georeferencing. We utilized LLMs from OpenAI and DeepSeek to georeference 5,000 vascular plant specimen records with known coordinates, and compared the results against those of GEOLocate (a widely used georeferencing tool) and manual georeferencing. We found that the best-performing LLMs (e.g., gpt-4o) outperformed specialized tools like GEOLocate in spatial applicability, and demonstrated near-human-level accuracy with a median georeferencing error of <10 km. Georeferencing based on LLMs were also considerably fast (<1 s per record) and affordable ($0.10 per 100 records); thus, they present a cost-effective approach for georeferencing. LLMs may not fully replace human curation in the short term, but can be incorporated into current workflows to greatly increase the efficiency of georeferencing. Future advances in LLMs may revolutionize the digitization of natural history collections.

DOI

https://doi.org/10.32942/X2134G

Subjects

Biodiversity, Ecology and Evolutionary Biology

Keywords

Artificial Intelligence, Large Language Model, biodiversity, herbarium, museum, Specimen

Dates

Published: 2025-05-03 00:58

Last Updated: 2025-05-03 00:58

License

CC BY Attribution 4.0 International

Additional Metadata

Language:
English