The acceptance rules we had to loosen
A lot of the real design work was in defining what counts as an acceptable variant.
Memo type detection
The old benchmark encoded a narrow confidence preference into correctness. That was a mistake.
The better check is:
- did the model choose an acceptable type?
- did it represent confidence numerically?
- did it handle uncertainty in some reasonable way?
Transcript cleanup
The benchmark should care about preserved meaning, not exact phrasing. If the
model says 3 out of 5 instead of 3 of 5, that should not sink the card.
Calendar intent
This was one of the most obviously over-constrained cards.
The model should get credit for any safe, usable representation of scheduling intent, including:
- event drafts
- scheduling packets
- direct event objects with unresolved fields
Similar memo recall
We should care about whether the right memo came back with a usable rationale, not whether the ranking object was named exactly the way we expected.
Context packet builder
We should reward the useful compression itself:
- active tasks
- decisions
- open questions
- relevant memo references
not one single exact wrapper object.
The calibration rule
Once we had the structure, we needed a discipline for checking whether it was working.
That rule became:
a strong mainstream model should look obviously good on the core eval.
Not perfect. Not magical. Just clearly good.
This is why the first gpt-4.1 V2 slice mattered so much.
On a three-card V2 slice it posted:
{
"pass_rate": 1.0,
"average_score": 0.95,
"task_score": 1.0,
"usable_score": 1.0,
"contract_score": 0.6667
}
That result is almost more useful than a perfect score.
It says:
- the tasks are now reasonable
- the model solved them cleanly
- the output was fully usable
- and exact contract is still something separate we can improve
That is exactly the kind of signal we wanted.
What we implemented in the repo
The benchmark is not just an essay now. It has a concrete shape in the repo.
Key files:
eval/local_intelligence/v2/PRINCIPLES.mdeval/local_intelligence/v2/CORE_EVAL_V2_SPEC.mdeval/local_intelligence/v2/card_manifest_v2.jsoneval/local_intelligence/v2/core_eval_v2_cards.jsoneval/local_intelligence/grader.pyeval/local_intelligence/run_eval.py
That matters because the benchmark is no longer only a theory. We now have:
- written principles
- an explicit card set
- separate scoring dimensions
- calibration runs against real models
- a clearer story about what the benchmark is for
What I would still treat as unfinished
Even though V2 is much better, I would still call it a living benchmark.
The biggest unfinished pieces are:
- broadening the calibration set beyond one great reference score
- continuing to test weaker and mid-tier models for healthy variance
- manually auditing failures to make sure the grader and product truth still agree
- deciding which stretch tasks deserve their own benchmark instead of staying in a pile of “interesting extras”
That is healthy unfinishedness, though. It is not the same as basic confusion.
The real standard
The standard for this benchmark is simple.
A strong model should look strong. A weak model should look weaker. The score should tell us whether the problem is:
- the model
- the schema contract
- the product integration layer
- or the benchmark itself
That is what core_eval_v2 is trying to do.
Not settle the whole question of intelligence. Just tell the truth about a small workflow product.
And honestly, that is already hard enough.