Using Elicit AI research assistant for data extraction in systematic reviews: a feasibility study across environmental and life sciences

Malgorzata Lagisz; Ayumi Mizuno; Kyle Morrison; Pietro Pollo; Lorenzo Ricolfi; Yefeng Yang; Shinichi Nakagawa

Using Elicit AI research assistant for data extraction in systematic reviews: a feasibility study across environmental and life sciences

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Malgorzata Lagisz , Ayumi Mizuno , Kyle Morrison , Pietro Pollo, Lorenzo Ricolfi , Yefeng Yang, Shinichi Nakagawa

Abstract

Data extraction in systematic reviews, maps and meta-analyses is time-consuming and prone to human error or subjective judgment. Large Language Models offer potential for automating this process, yet their performance has been evaluated in a limited range of platforms, disciplines, and review types.
We assessed the performance of the Elicit platform across diverse data extraction tasks using journal articles from seven systematic-like reviews in life and environmental sciences. Human-extracted data served as the gold standard. For each review, we used eight articles for prompt development and another eight for testing. Initial prompts were iteratively refined to exceed 87% accuracy or up to five rounds. We then tested extraction accuracy, reproducibility across user accounts, and the effect of Elicit’s high-accuracy mode.
Of 90 considered prompts, 70 exceeded the 87% accuracy when compared to gold standard values but tended to be lower when tested on a new set of articles. Repeating data extractions with different Elicit user accounts resulted in 90% agreement on extracted values, though supporting quotes and reasoning matched in only 46% and 30% of cases, respectively. In high-accuracy mode, value matches dropped to 77%, with just 10% quote matches and 0% reasoning matches. Extraction accuracy did not differ by data types. Elicit also helped identify eight (<1%) errors in the gold standard data.
Our results show that Elicit can complement, but not replace, human data extractors. Elicit may be best used as a secondary reviewer and to evaluate the clarity of data extraction protocols. Prompts must be fine-tuned and independently validated.

DOI

https://doi.org/10.32942/X2F346

Subjects

Life Sciences

Keywords

Artificial Intelligence, evidence synthesis, systematic maps, meta-analysis, research methods, proof of concept

Dates

Published: 2025-08-11 11:04

Last Updated: 2025-08-11 11:04

License

CC-BY Attribution-NonCommercial 4.0 International

Additional Metadata

Conflict of interest statement:
We acknowledge that we used temporary free Elicit Plus plan access provided by Elicit Research, PBC, to conduct tests from different user accounts. Representatives of Elicit Research, PBC, did not participate in conceptualisation or designing of this study. We did not receive any financial payments from Elicit Research, PBC, and have no other relationships or activities that could have influenced our work on this project.

Data and Code Availability Statement:
All data and code used in this study are available at https://github.com/mlagisz/elicit_extractions_testing and https://osf.io/ejyva/

Language:
English