Mapping Cultural Ecosystem Service Flows from Social Media Imagery with Vision–Language Models: A Zero-Shot CLIP Framework

Hao-Yu Liao; Chang Zhao; Caglar Koylu; Haojie Cao; Jiangxiao Qiu; Corey T Callaghan; Jiayi Song; Wei Shao

Mapping Cultural Ecosystem Service Flows from Social Media Imagery with Vision–Language Models: A Zero-Shot CLIP Framework

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Hao-Yu Liao, Chang Zhao, Caglar Koylu, Haojie Cao, Jiangxiao Qiu, Corey T Callaghan , Jiayi Song, Wei Shao

Abstract

Geotagged social media imagery provides a valuable source for mapping cultural ecosystem service (CES) flows, which represent realized human interactions with nature, yet its open-world user-generated content poses challenges to automated content analysis. Supervised models require large labeled datasets and show limited generalization across contexts, whereas unsupervised approaches often need post-hoc interpretation. Vision–language models offer a promising alternative but remain largely unexplored in CES research. We present a label-efficient framework that leverages the open-source Contrastive Language–Image Pretraining (CLIP) model to classify and map 12 CES flows across Florida using only 120 labeled images. Five CLIP variants and three prompt strategies were benchmarked to evaluate zero-shot performance under closed-set conditions, and three CLIP-based pipelines with differing supervision levels were compared to address the open-set challenge of filtering irrelevant content. Mixed class-specific prompts increased closed-set accuracy to 97%. Under open-set conditions, a hybrid pipeline combining a lightweight binary classifier with zero-shot CLIP inference achieved the strongest performance (accuracy = 88%; F1-macro = 0.88; F1-other = 0.91), demonstrating major gains in label-efficiency and open-set robustness. Statewide flow maps reveal consistent hotspots for outdoor recreation, wildlife viewing, and landscape aesthetics along coastal areas and major inland greenspaces, extending beyond formal park systems into urban greenspaces and other natural and working lands. The resulting map products and interactive web application provide actionable tools for identifying CES hotspots and the landscapes that support human–nature interactions. Overall, this study demonstrates the transformative potential of foundation VLMs for large-scale CES assessment using social media imagery.

DOI

https://doi.org/10.32942/X29S8C

Subjects

Computational Engineering, Computer Sciences, Natural Resources and Conservation, Nature and Society Relations, Sustainability

Keywords

cultural ecosystem services, Contrastive Language-Image Pre-training, vision–language model, Zero-Shot Learning, open-set recognition, social media imagery, natural and working landscapes

Dates

Published: 2025-12-10 10:40

Last Updated: 2025-12-10 10:40

License

CC BY Attribution 4.0 International

Additional Metadata

Data and Code Availability Statement:
Interactive CES maps related to this study are available at: https://es-geoai.rc.ufl.edu/agroes-ces-clip/. Open data/code may be released in a future update.

Language:
English