LLM as an Oracle

Most discussions of LLM evaluation ask which evaluator is best.

I think that is the wrong first question.

Some tasks need judgment. Others need verification. A rubric can help decide whether an explanation is clear, persuasive, or useful. It is a poor substitute for test cases when evaluating code. Likewise, exact checks are not enough when the output is inherently qualitative.

That distinction is the motivation behind llm-as-an-oracle.

An Oracle, in this setting, is not an all-knowing model. It is an adaptive evaluation layer that decides whether a task should be evaluated by an LLM-as-a-Judge or an LLM-as-a-Verifier.

The central claim is simple:

Evaluation should be routed to the strategy that best matches the structure of the task.

That sounds obvious once stated plainly, but it is easy to violate in practice. As soon as a benchmark, agent workflow, or production evaluation pipeline standardizes on a single evaluator, it begins to treat fundamentally different tasks as though they required the same kind of evidence.

The evaluation problem

Human evaluation remains the reference point for many LLM systems. It is often the most flexible form of assessment because humans can interpret incomplete instructions, account for context, distinguish severity from style, and notice when a candidate answer is technically correct but pragmatically poor.

It also scales badly.

That scaling pressure created demand for automated evaluation. Traditional metrics can be useful, but they are narrow:

None of these solves the broader evaluation problem on its own.

The rise of capable instruction-following models created a new option: LLM-based evaluators. This has produced a family of LLM-as-* patterns:

These ideas have appeared across several lines of work on model-based judging, verification, critique generation, and ranking. 1

These patterns are often discussed as alternatives. I think they are better understood as evaluation modes with different operating assumptions.

The question is not merely whether LLM evaluators are useful. The more precise question is:

Which evaluator is appropriate for this task, given the evidence available?

Judge and Verifier solve different problems

The easiest way to understand the Oracle idea is to first separate the two evaluation strategies it routes between.

LLM-as-a-Judge

An LLM-as-a-Judge performs holistic evaluation. It reads the task, candidate trajectory, and evaluation criteria, then emits a score or preference. This is the natural fit when the target quality is open-ended, subjective, or otherwise difficult to reduce to executable checks. 2

Typical Judge-friendly questions include:

The Judge pattern is valuable because many real tasks do not collapse cleanly into executable checks. They require interpretation.

In llm-as-an-oracle, the Judge strategy supports:

These details matter because a Judge is not merely “ask another model what it thinks.” A useful Judge has structure around how scores are produced and how comparisons are stabilized.

For example, a Judge can score each trajectory against several weighted criteria, then use pairwise comparisons only when two candidates are close. If the pairwise order is swapped and averaged, the system can reduce simple positional bias without pretending that the evaluator has become objective. 2

LLM-as-a-Verifier

An LLM-as-a-Verifier is better suited to tasks where stronger evidence exists. It is appropriate when candidate trajectories can be evaluated against signals that are closer to correctness than preference. 3

Typical Verifier-friendly tasks include:

The Verifier strategy in this project is designed around:

That last point is important. A Verifier tries to squeeze more discriminative signal out of the evaluator than a single coarse score can provide.

The Judge asks, “Which answer seems better under this rubric?”

The Verifier asks, “Which trajectory survives the strongest evidence-sensitive checks available for this task?”

Those are related questions, but not the same question.

What the Oracle adds

If Judge and Verifier are both useful, a natural response is to expose both and let the caller choose.

That is necessary, but insufficient.

Many workflows mix task types:

In those settings, asking the caller to manually select an evaluator every time creates friction and invites inconsistency.

The Oracle layer addresses that problem.

Its job is to:

  1. inspect the task and trajectories
  2. extract signals about the task structure
  3. decide which evaluator is the better fit
  4. execute only that strategy
  5. return both the result and the routing explanation

The Oracle is therefore not a third evaluator. It is a decision layer above the two evaluators.

Anatomy of the Oracle router

The default router in llm-as-an-oracle is deterministic. It does not call an LLM to decide which evaluator to use. Instead, it extracts interpretable signals and applies a fixed chain of routing policies.

That design choice is intentional. The system should make evaluator selection more legible, not less.

LLM as an Oracle routing diagram

Step 1: extract routing signals

The router converts the task and trajectories into a structured set of signals. The current implementation uses features such as:

These features encode simple but meaningful intuitions.

For example:

The goal is not to perfectly infer task type from text. The goal is to make the selection logic explicit enough to inspect, revise, and extend.

Step 2: collect policy votes

Signals are passed through a chain of policies. The default router uses policies that reason about:

Each policy casts a weighted vote for either Judge or Verifier.

Conceptually:

Ground truth present?        -> favor Verifier
Execution output available?  -> favor Verifier
Open-ended wording?          -> favor Judge
Very low routing confidence? -> fall back to Judge

The implementation is more nuanced than that sketch, but the spirit is the same. Evaluation mode is chosen by accumulating evidence.

Step 3: aggregate confidence

The router aggregates weighted policy votes into a final confidence score. The winning strategy is selected only if its confidence is strong enough. Otherwise, the system falls back to the more general-purpose Judge path.

This creates an important separation:

That is a better design than letting every downstream evaluator silently assume the task is well suited to its own strengths.

Step 4: expose the routing trace

The output of a routing decision includes:

This is one of the parts I care about most.

Evaluation pipelines already accumulate ambiguity. A score without a path to understanding how it was obtained is difficult to debug. The Oracle makes one critical source of ambiguity observable: why this evaluator was chosen in the first place.

Evaluating trajectories, not just answers

The project uses the term trajectory deliberately.

A trajectory is a candidate task-solving attempt. It may contain:

This is especially relevant for agents. When a coding agent fixes a bug, the thing we care about is not only the final patch. We may care about:

For text-only tasks, evaluating the final answer may be sufficient. For agents, the evaluation object often needs to be richer.

The Oracle architecture assumes that richer object from the beginning.

A concrete example: the N+1 query bug

One example in the repository asks three agents to fix an N+1 query problem. The original function loads orders first, then issues one SQL query per order to load items. That means:

The task asks for a constant-query solution and provides both ground truth and test cases.

The original bug has this shape:

def get_orders_with_items(user_id: int) -> list[dict]:
    orders = db.execute(
        "SELECT * FROM orders WHERE user_id = ?",
        [user_id],
    )

    for order in orders:
        order["items"] = db.execute(
            "SELECT * FROM items WHERE order_id = ?",
            [order["id"]],
        )

    return orders

The loop is the problem. The first query fetches the orders, then each order triggers another query for its items.

Three candidate trajectories are evaluated:

  1. one rewrites the query with a JOIN
  2. one performs a batched prefetch with WHERE IN
  3. one adds an lru_cache around the inner item lookup

The first two are legitimate fixes, though they make different tradeoffs. The third sounds plausible because caching often improves performance. But it does not solve the stated problem. On a cold cache, query count still grows with the number of orders.

The misleading fix looks like this:

@lru_cache(maxsize=256)
def fetch_items(order_id: int) -> list[dict]:
    return db.execute(
        "SELECT * FROM items WHERE order_id = ?",
        [order_id],
    )


def get_orders_with_items(user_id: int) -> list[dict]:
    orders = db.execute(
        "SELECT * FROM orders WHERE user_id = ?",
        [user_id],
    )

    for order in orders:
        order["items"] = fetch_items(order["id"])

    return orders

This may help repeated calls for the same order, but it does not change the first-run query pattern:

expected: query_count == O(1)
actual:   query_count == 1 + number_of_orders

A real fix changes the query shape. For example, the JOIN trajectory uses one SQL query and groups the rows afterward:

SELECT o.*, i.*
FROM orders o
LEFT JOIN items i ON i.order_id = o.id
WHERE o.user_id = ?;

This is exactly the sort of case where evaluator choice matters.

Why Judge alone is risky here

A holistic Judge may recognize that the cache-based answer is weaker. But it is also possible for that answer to benefit from surface plausibility:

If the score is driven too much by presentation quality, the wrong candidate can become competitive.

Why Verifier is the better fit

The task has stronger evidence:

That is precisely the situation where the Oracle should route toward the Verifier. The evaluation problem is not mainly aesthetic. It is evidential.

The interesting part is not merely that Verifier can help. The more general point is that the Oracle can identify this task shape before evaluation begins.

Judge and Verifier are not rivals

It is tempting to treat this as a winner-take-all comparison:

I do not think that framing is useful.

Each strategy fails differently.

Failure modes of Judge

A Judge can:

Failure modes of Verifier

A Verifier can:

The Oracle does not eliminate these problems. It tries to reduce one avoidable problem: choosing the wrong mode of evaluation for the task at hand.

When the Oracle should ask for help

There is another failure mode worth making explicit: sometimes the task itself is underspecified.

Suppose three architecture recommendations are all defensible, but the best one depends on a missing fact about team size, latency goals, compliance constraints, or deployment environment. No evaluator should pretend confidence if the information needed to decide was never supplied.

The fuller design in this project explores a Human Oracle escalation path for those cases. When evaluation evidence runs out, the system can ask a targeted clarifying question, incorporate the answer, and re-evaluate.

The point is not to put a person in the loop by default. It is to avoid manufactured certainty when a decision depends on missing context.

That same principle motivates the router itself:

What this suggests about evaluation design

The Oracle pattern leads to a broader design lesson.

Good evaluation systems should have at least four properties.

1. Match evaluator to task structure

Open-ended tasks and evidence-grounded tasks are different. Treating them as if they require the same evaluator introduces avoidable error.

2. Make evaluator selection explicit

The choice between Judge and Verifier should be part of the system design, not a hidden convention buried in notebook code or benchmark glue.

3. Preserve inspectability

Scores are easier to trust when their provenance is visible. Routing traces, criteria, and confidence should be inspectable artifacts.

4. Admit when evidence is insufficient

Some tasks are underspecified. A strong evaluator is not a substitute for missing context. Escalation can be the correct system behavior.

Where this pattern is useful

I think LLM-as-an-Oracle is especially relevant for:

In all of those cases, evaluator selection is part of the problem.

Treating it as a first-class system component is cleaner than standardizing on a single evaluation method and compensating later with increasingly elaborate exceptions.

Closing thought

The evaluator should fit the task.

LLM-as-an-Oracle is my attempt to turn that principle into a concrete system: route between judgment and verification, expose the reasons for the choice, and leave room for human escalation when neither automated path has enough evidence.

  1. Zheng et al. introduce LLM-as-a-Judge; the LLM-as-a-Verifier framework develops evidence-sensitive verification; CritiqueLLM studies critique generation for evaluation; and pairwise ranking prompting shows how LLMs can be used directly as rankers. https://arxiv.org/abs/2306.05685 https://llm-as-a-verifier.notion.site/ https://arxiv.org/abs/2311.18702 https://arxiv.org/abs/2306.17563 

  2. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” 2023. https://arxiv.org/abs/2306.05685  2

  3. “LLM-as-a-Verifier: A General-Purpose Verification Framework.” https://llm-as-a-verifier.notion.site/