Evaluator-Optimizer | The Agentic Wiki

●●●●○ Complexity

Overview

The evaluator-optimizer pattern creates a feedback loop between two LLM roles: a generator that produces output and an evaluator that critiques it. The generator creates an initial attempt, the evaluator assesses it against defined criteria and provides specific feedback, and the generator revises its output based on that feedback. This loop continues until the evaluator determines the output meets the quality threshold or a maximum iteration count is reached. The pattern is particularly effective when there are clear quality criteria that an LLM can assess but that are difficult to satisfy in a single generation pass.

How It Works

The generator produces an initial output. Given the task description and any constraints, the generator LLM creates a first draft.
The evaluator assesses the output. A separate LLM call (or the same model with an evaluator prompt) reviews the output against defined criteria. It returns a structured evaluation: a pass/fail decision, a quality score, and specific feedback identifying what needs improvement.
Check the termination condition. If the evaluator approves the output (meets the quality threshold) or the maximum number of iterations has been reached, return the current output.
The generator revises. If the output did not pass evaluation, the generator receives the evaluator’s feedback along with the previous output, and produces an improved version.
Repeat from step 2. The loop continues until the termination condition is met.

When to Use

You have clear, articulable quality criteria that an LLM can evaluate (e.g., code correctness, adherence to a style guide, factual accuracy against source material).
A single generation pass frequently produces output that is close but not quite right.
The task benefits from iterative refinement — each round of feedback produces measurably better output.
You are willing to trade additional latency and cost for higher output quality.
Human-in-the-loop review is expensive, and automated evaluation can approximate it.

When Not to Use

The task is simple enough that the generator produces acceptable output on the first attempt.
You lack clear evaluation criteria — if the evaluator cannot provide actionable feedback, the loop will not converge.
Latency is critical and you cannot afford multiple round-trips.
The generator tends to degrade rather than improve with iterative feedback (a sign that the evaluation criteria or feedback format needs rethinking).

Example

# Evaluator-Optimizer: Iterative code review and improvement.

MAX_ITERATIONS = 3
QUALITY_THRESHOLD = 8  # out of 10

def generate_code(task: str, feedback: str = "") -> str:
    """Generator: Write or revise code based on the task and any feedback."""
    prompt = f"Task: {task}"
    if feedback:
        prompt += f"\n\nPrevious feedback to address:\n{feedback}"
    response = llm.call(
        system="You are an expert Python developer. Write clean, well-documented code.",
        prompt=prompt
    )
    return response.text

def evaluate_code(task: str, code: str) -> dict:
    """Evaluator: Review the code and return a score with feedback."""
    response = llm.call(
        system=(
            "You are a senior code reviewer. Evaluate the code on correctness, readability, "
            "edge-case handling, and documentation. Return JSON with 'score' (1-10) and 'feedback'."
        ),
        prompt=f"Task: {task}\n\nCode:\n{code}"
    )
    return json.loads(response.text)

# Run the evaluator-optimizer loop
task = "Write a function to merge two sorted linked lists."
code = generate_code(task)

for i in range(MAX_ITERATIONS):
    evaluation = evaluate_code(task, code)
    print(f"Iteration {i + 1}: score={evaluation['score']}")

    if evaluation["score"] >= QUALITY_THRESHOLD:
        print("Quality threshold met.")
        break

    code = generate_code(task, feedback=evaluation["feedback"])

# 'code' now holds the best version produced within the iteration budget.

Reflection — A lighter-weight version where a single LLM critiques its own output in the same call or a follow-up, without a separate evaluator role.
Prompt Chaining — The evaluator-optimizer loop can be embedded as one stage within a larger chain.
Orchestrator-Worker — The orchestrator can use evaluator-optimizer loops internally to refine worker outputs before final synthesis.

Overview

How It Works

When to Use

When Not to Use

Example

Related Patterns