Why this post exists
The previous eval post was the reset.
That post was about admitting the first benchmark had drifted. We had built a solid harness around a shaky question. A strong mainstream model was not looking clearly strong on tiny workflow tasks, and that meant the benchmark was still wrong in an important way.
What that post did not do was walk through the construction of the replacement. It argued for a better eval, but it did not really show how we chose the tasks, how we structured the scoring, and what rules we used to keep the new benchmark honest.
This post is that missing piece.
It is the practical design document behind core_eval_v2, written in plain
English instead of benchmark shorthand.
The problem we were actually trying to solve
The product question is much smaller than a generic AI benchmark usually assumes.
We are not asking whether a model can:
- reason across arbitrary domains
- act autonomously for a long time
- manage a whole tool graph
- behave like a full personal agent
We are asking whether it can take a short spoken note or transcript and turn it into something small, useful, and structured.
That means the real product loop looks more like this:
- capture the note
- extract what matters
- normalize it into a usable shape
- connect it to relevant context when needed
- avoid hallucinating or overreaching
That is the loop the eval needs to measure.
Once we wrote it that plainly, a lot of the confusion disappeared.
What was wrong with the first benchmark
The first benchmark mixed together three different questions:
- Did the model solve the task?
- Could the product use the output?
- Did it follow our exact favorite schema?
Those are not the same question.
That sounds obvious in hindsight, but it created most of the brittleness. A model could produce the right summary with the wrong field name. It could return a useful action list in a top-level array instead of a wrapped object. It could represent a scheduling intent correctly but in a different structured packet than the one we preferred.
In all of those cases, the benchmark was too eager to say “fail.”
So the first design rule for V2 became:
solve the task first, then evaluate usability, then evaluate exact contract
That sounds small, but it changes the whole benchmark.
The design constraints for V2
We wrote down a small set of principles and then forced the benchmark to obey them.
1. Product truth first
If a task is clever but not central to the actual product loop, it does not belong in the core eval.
That immediately demoted a bunch of things that were interesting but too architectural:
- local agent loops
- momentum scoring
- voice operating layer behavior
- live meeting state tracking
- knowledge graph maintenance
Those tasks may still matter eventually. They just should not decide whether the benchmark is sane.
2. Strong models should look strong
This became our sanity rule.
If a strong mainstream model cannot post a clearly good score on the core set, then either:
- the task is underspecified
- the grading is brittle
- the schema contract is too narrow
- or the benchmark is simply wrong
That rule turned out to be incredibly useful because it is hard to talk yourself out of.
3. Small, auditable, explainable
Every core card should be easy to explain to a non-specialist.
Good core cards are things like:
- title this memo
- redact this transcript
- extract the action items
- ask a clarifying question if the reminder is ambiguous
If a card requires three paragraphs of architecture context before it makes sense, it probably does not belong in the core eval.
4. Usability matters more than one exact JSON shape
The product can normalize a lot.
It can handle:
- aliases
- top-level arrays
- equivalent field names
- alternate but still safe representations
The benchmark should not punish those differences as harshly as real task mistakes.
5. Keep the core benchmark hard to game
A small hand-authored eval can become a trap. If we optimize models too hard against the benchmark before the benchmark is stable, we risk teaching them the shape of the answer instead of the job itself.
So the core set had to stay:
- small enough to audit
- broad enough not to collapse into template memorization
- stable enough that a good score would mean something
How we structured the benchmark
The cleanest move was to split the benchmark into layers.
Core Eval v2
This is the benchmark that is supposed to answer the product question:
Can a model do the small workflow task clearly and usefully?
It is deliberately small.
The current core set has ten cards.
Stretch / draft benchmark
This is where we keep the broader, more architectural, more subjective tasks. That older set is still useful, but it no longer gets to define sanity.
This split matters because it keeps us from mixing:
- product-real tasks
- internal architecture probes
- research-y ambition
into one misleading blended score.