0

arXiv:2601.02370v1 Announce Type: new
Abstract: Large language models (LLMs) are becoming essential tools for strategy scholars who need to evaluate text corpora at scale. This paper provides a systematic analysis of the reliability of LLM-as-evaluator in strategy research. After classifying the typical ways in which LLMs can be deployed for evaluation purposes in strategy research, we draw on the specialised AI literature to analyse their properties as measurement instruments. Our empirical analysis reveals substantial instability in LLMs' evaluation output, stemming from multiple factors: the specific phrasing of prompts, the context provided, sampling procedures, extraction methods, and disagreements across different models. We quantify these effects and demonstrate how this unreliability can compromise the validity of research inferences drawn from LLM-generated evaluations. To address these challenges, we develop a comprehensive protocol that is variance-aware, normative, and auditable. We provide practical guidance for flexible implementation of this protocol, including approaches to preregistration and transparent reporting. By establishing these methodological standards, we aim to elevate LLM-based evaluation of business text corpora from its current ad hoc status to a rigorous, actionable, and auditable measurement approach suitable for scholarly research.
Be respectful and constructive. Comments are moderated.

No comments yet.