In recent years, Large Language Models (LLMs) have often been used by paper reviewers, despite this practice being generally prohibited. This has raised, and continues to raise, issues concerning ethics, review reliability, and the risk of review manipulation. Indeed, several arXiv preprints were recently discovered to contain invisible, LLM-targeted instructions designed to persuade an AI reviewer to yield a positive review. In this paper, we propose a systematic analysis of LLMs’ review capabilities in this complex and evolving scenario. In particular, we want to address two research questions: (i) How can LLM ratings be compared with human ratings?, and (ii) Can hidden positive prompts injected in a manuscript alter an LLM’s generated review? To address these questions, we created a dataset of 400 papers from OpenReview. For each paper, this dataset contains human reviews and scores already present in OpenReview, as well as reviews performed by three state-of-the-art LLMs, added by us. Our results show that human reviewers assign higher and more widely dispersed scores that clearly distinguish accepted and rejected papers. In contrast, LLM ratings cluster close to their mean value, blurring the distinction between accepted and rejected papers. Furthermore, a negative prompt given by the reviewer makes the LLM lower its scores, while a hidden positive prompt injected by the author often fails to raise scores, and sometimes triggers even lower scores, if detected by the LLM. These results reveal both the potential and fragility of delegating peer review tasks to LLMs.
Are Large Language Models Better Peer-Reviewers Than Humans? An Early Investigation on OpenReview / Bonifazi, G.; Buratti, C.; Marchetti, M.; Parlapiano, F.; Traini, D.; Ursino, D.; Virgili, L.. - (2025). ( ITADATA-WS 2025: The 4th Italian Conference on Big Data and Data Science Turin, Italy 9–11 September 2025).
Are Large Language Models Better Peer-Reviewers Than Humans? An Early Investigation on OpenReview
G. Bonifazi
;C. Buratti
;M. Marchetti
;F. Parlapiano
;D. Traini
;D. Ursino
;L. Virgili
2025-01-01
Abstract
In recent years, Large Language Models (LLMs) have often been used by paper reviewers, despite this practice being generally prohibited. This has raised, and continues to raise, issues concerning ethics, review reliability, and the risk of review manipulation. Indeed, several arXiv preprints were recently discovered to contain invisible, LLM-targeted instructions designed to persuade an AI reviewer to yield a positive review. In this paper, we propose a systematic analysis of LLMs’ review capabilities in this complex and evolving scenario. In particular, we want to address two research questions: (i) How can LLM ratings be compared with human ratings?, and (ii) Can hidden positive prompts injected in a manuscript alter an LLM’s generated review? To address these questions, we created a dataset of 400 papers from OpenReview. For each paper, this dataset contains human reviews and scores already present in OpenReview, as well as reviews performed by three state-of-the-art LLMs, added by us. Our results show that human reviewers assign higher and more widely dispersed scores that clearly distinguish accepted and rejected papers. In contrast, LLM ratings cluster close to their mean value, blurring the distinction between accepted and rejected papers. Furthermore, a negative prompt given by the reviewer makes the LLM lower its scores, while a hidden positive prompt injected by the author often fails to raise scores, and sometimes triggers even lower scores, if detected by the LLM. These results reveal both the potential and fragility of delegating peer review tasks to LLMs.| File | Dimensione | Formato | |
|---|---|---|---|
|
Bonifazi_Are-Large-Language-Models-Better_2025.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Creative commons
Dimensione
1.25 MB
Formato
Adobe PDF
|
1.25 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


