This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.
Mapping Cultural Ecosystem Service Flows from Social Media Imagery with Vision–Language Models: A Zero-Shot CLIP Framework
Downloads
Authors
Abstract
Geotagged social media imagery provides a valuable source for mapping cultural ecosystem service (CES) flows, which represent realized human interactions with nature, yet its open-world user-generated content poses challenges to automated content analysis. Supervised models require large labeled datasets and show limited generalization across contexts, whereas unsupervised approaches often need post-hoc interpretation. Vision–language models offer a promising alternative but remain largely unexplored in CES research. We present a label-efficient framework that leverages the open-source Contrastive Language–Image Pretraining (CLIP) model to classify and map 12 CES flows across Florida using only 120 labeled images. Five CLIP variants and three prompt strategies were benchmarked to evaluate zero-shot performance under closed-set conditions, and three CLIP-based pipelines with differing supervision levels were compared to address the open-set challenge of filtering irrelevant content. Mixed class-specific prompts increased closed-set accuracy to 97%. Under open-set conditions, a hybrid pipeline combining a lightweight binary classifier with zero-shot CLIP inference achieved the strongest performance (accuracy = 88%; F1-macro = 0.88; F1-other = 0.91), demonstrating major gains in label-efficiency and open-set robustness. Statewide flow maps reveal consistent hotspots for outdoor recreation, wildlife viewing, and landscape aesthetics along coastal areas and major inland greenspaces, extending beyond formal park systems into urban greenspaces and other natural and working lands. The resulting map products and interactive web application provide actionable tools for identifying CES hotspots and the landscapes that support human–nature interactions. Overall, this study demonstrates the transformative potential of foundation VLMs for large-scale CES assessment using social media imagery.
DOI
https://doi.org/10.32942/X29S8C
Subjects
Computational Engineering, Computer Sciences, Natural Resources and Conservation, Nature and Society Relations, Sustainability
Keywords
cultural ecosystem services, Contrastive Language-Image Pre-training, vision–language model, Zero-Shot Learning, open-set recognition, social media imagery, natural and working landscapes
Dates
Published: 2025-12-10 18:40
Last Updated: 2025-12-10 18:40
License
CC BY Attribution 4.0 International
Additional Metadata
Language:
English
Data and Code Availability Statement:
Interactive CES maps related to this study are available at: https://es-geoai.rc.ufl.edu/agroes-ces-clip/. Open data/code may be released in a future update.
There are no comments or no comments have been made public for this article.