Part 3

The acceptance rules we had to loosen

A lot of the real design work was in defining what counts as an acceptable variant.

Memo type detection

The old benchmark encoded a narrow confidence preference into correctness. That was a mistake.

The better check is:

  • did the model choose an acceptable type?
  • did it represent confidence numerically?
  • did it handle uncertainty in some reasonable way?

Transcript cleanup

The benchmark should care about preserved meaning, not exact phrasing. If the model says 3 out of 5 instead of 3 of 5, that should not sink the card.

Calendar intent

This was one of the most obviously over-constrained cards.

The model should get credit for any safe, usable representation of scheduling intent, including:

  • event drafts
  • scheduling packets
  • direct event objects with unresolved fields

Similar memo recall

We should care about whether the right memo came back with a usable rationale, not whether the ranking object was named exactly the way we expected.

Context packet builder

We should reward the useful compression itself:

  • active tasks
  • decisions
  • open questions
  • relevant memo references

not one single exact wrapper object.

The calibration rule

Once we had the structure, we needed a discipline for checking whether it was working.

That rule became:

a strong mainstream model should look obviously good on the core eval.

Not perfect. Not magical. Just clearly good.

This is why the first gpt-4.1 V2 slice mattered so much.

On a three-card V2 slice it posted:

{
  "pass_rate": 1.0,
  "average_score": 0.95,
  "task_score": 1.0,
  "usable_score": 1.0,
  "contract_score": 0.6667
}

That result is almost more useful than a perfect score.

It says:

  • the tasks are now reasonable
  • the model solved them cleanly
  • the output was fully usable
  • and exact contract is still something separate we can improve

That is exactly the kind of signal we wanted.

What we implemented in the repo

The benchmark is not just an essay now. It has a concrete shape in the repo.

Key files:

That matters because the benchmark is no longer only a theory. We now have:

  • written principles
  • an explicit card set
  • separate scoring dimensions
  • calibration runs against real models
  • a clearer story about what the benchmark is for

What I would still treat as unfinished

Even though V2 is much better, I would still call it a living benchmark.

The biggest unfinished pieces are:

  • broadening the calibration set beyond one great reference score
  • continuing to test weaker and mid-tier models for healthy variance
  • manually auditing failures to make sure the grader and product truth still agree
  • deciding which stretch tasks deserve their own benchmark instead of staying in a pile of “interesting extras”

That is healthy unfinishedness, though. It is not the same as basic confusion.

The real standard

The standard for this benchmark is simple.

A strong model should look strong. A weak model should look weaker. The score should tell us whether the problem is:

  • the model
  • the schema contract
  • the product integration layer
  • or the benchmark itself

That is what core_eval_v2 is trying to do.

Not settle the whole question of intelligence. Just tell the truth about a small workflow product.

And honestly, that is already hard enough.