Table of Contents

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

1 KAIST 2 Carnegie Mellon University 3 NVIDIA
4 Independent Researcher
* Co-first authors
Multi-Turn Interactions in Large Language Models @ NeurIPS 2025 (Top 1%, Oral)

Why RefineBench is Required?

(Left) Strong LMs such as Claude-4-Sonnet can self-refine effectively on AIME-24, where they already solve problems reasonably well in the first iteration. However, on saturated benchmarks such as MATH-500, there is little headroom for improvement, and on our proposed benchmark, RefineBench, performance gains remain limited. Hence, RefineBench serves as a testbed for measuring self-refinement capability of frontier LMs. (Right) The biggest bottleneck when an LM refines its output is that it often struggles to identify which aspects need to be corrected. In RefineBench, beyond the self-refinement setting where the LM must independently identify and fix errors, we also introduce settings where partial hints are provided about what needs to be revised, or where the amount of feedback varies. This enables a systematic analysis of refinement capability.

Key Features of RefineBench

Covering Verifiable / Non-Verifiable Tasks

Includes both free-form generation and answer-based correctness tasks, ensuring diverse refinement evaluation.

Supporting Various Refinement Scenarios

Supports both guided and self-refinement settings, enabling controlled analysis of refinement strategies.

Broad Domain Diversity

Covers 11 domains, from math, statistics, and STEM to humanities, law, and social sciences.

Checklist-Based Evaluation

Each task is assessed via a detailed checklist defining explicit and transparent evaluation criteria.

Consistent Multi-Turn Assessment

Measures performance improvement across multiple turns and evaluates refinement under both self- and guided-refinement settings.

Evaluation Workflow

RefineBench evaluates an LM’s ability to iteratively refine its own answers through a structured three-step workflow:

Refinement Step Given a user query and the previous answer, the target LM generates a refined response (or an initial answer at the first turn). In self-refinement, the model autonomously decides whether to continue refining.
Evaluation Step An evaluator LM (e.g., GPT-4.1) checks the refined answer against a predefined checklist of binary criteria (“Yes” / “No”) to assess quality and completeness.
Feedback Step The checklist-based feedback is used to form the next prompt, allowing multi-turn iterative refinement (typically up to five rounds).

Dataset Overview

RefineBench includes 1,000 problems spanning 11 domains and 239 subjects, each paired with a checklist averaging 9.9 binary criteria. Major domains include Math (32%), Humanities/Social Science (19%), and Law (14%), ensuring balanced coverage of verifiable and non-verifiable reasoning tasks.

Compared to existing datasets, RefineBench uniquely supports:

  • Extrinsic (guided) and Intrinsic (self-refinement) setups supported for comprehensive evaluation.
  • Partially guided refinement enables fine-grained control of feedback exposure.
  • Checklist-based supervision provides explicit, interpretable evaluation criteria.
  • Multi-turn evaluation tracks consistency and iterative improvement across refinement steps.
  • Broad coverage across verifiable and open-ended reasoning tasks.

RefineBench Data Viewer

Data Visualization

Loading dataset...

BibTeX

@misc{lee2025refinebenchevaluatingrefinementcapability,
          title={RefineBench: Evaluating Refinement Capability of Language Models via Checklists}, 
          author={Young-Jun Lee and Seungone Kim and Byung-Kwan Lee and Minkyeong Moon and Yechan Hwang and Jong Myoung Kim and Graham Neubig and Sean Welleck and Ho-Jin Choi},
          year={2025},
          eprint={2511.22173},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2511.22173}, 
    }