Part 1

Why this post exists

The previous eval post was the reset.

That post was about admitting the first benchmark had drifted. We had built a solid harness around a shaky question. A strong mainstream model was not looking clearly strong on tiny workflow tasks, and that meant the benchmark was still wrong in an important way.

What that post did not do was walk through the construction of the replacement. It argued for a better eval, but it did not really show how we chose the tasks, how we structured the scoring, and what rules we used to keep the new benchmark honest.

This post is that missing piece.

It is the practical design document behind core_eval_v2, written in plain English instead of benchmark shorthand.

The problem we were actually trying to solve

The product question is much smaller than a generic AI benchmark usually assumes.

We are not asking whether a model can:

  • reason across arbitrary domains
  • act autonomously for a long time
  • manage a whole tool graph
  • behave like a full personal agent

We are asking whether it can take a short spoken note or transcript and turn it into something small, useful, and structured.

That means the real product loop looks more like this:

  1. capture the note
  2. extract what matters
  3. normalize it into a usable shape
  4. connect it to relevant context when needed
  5. avoid hallucinating or overreaching

That is the loop the eval needs to measure.

Once we wrote it that plainly, a lot of the confusion disappeared.

What was wrong with the first benchmark

The first benchmark mixed together three different questions:

  1. Did the model solve the task?
  2. Could the product use the output?
  3. Did it follow our exact favorite schema?

Those are not the same question.

That sounds obvious in hindsight, but it created most of the brittleness. A model could produce the right summary with the wrong field name. It could return a useful action list in a top-level array instead of a wrapped object. It could represent a scheduling intent correctly but in a different structured packet than the one we preferred.

In all of those cases, the benchmark was too eager to say “fail.”

So the first design rule for V2 became:

solve the task first, then evaluate usability, then evaluate exact contract

That sounds small, but it changes the whole benchmark.

The design constraints for V2

We wrote down a small set of principles and then forced the benchmark to obey them.

1. Product truth first

If a task is clever but not central to the actual product loop, it does not belong in the core eval.

That immediately demoted a bunch of things that were interesting but too architectural:

  • local agent loops
  • momentum scoring
  • voice operating layer behavior
  • live meeting state tracking
  • knowledge graph maintenance

Those tasks may still matter eventually. They just should not decide whether the benchmark is sane.

2. Strong models should look strong

This became our sanity rule.

If a strong mainstream model cannot post a clearly good score on the core set, then either:

  • the task is underspecified
  • the grading is brittle
  • the schema contract is too narrow
  • or the benchmark is simply wrong

That rule turned out to be incredibly useful because it is hard to talk yourself out of.

3. Small, auditable, explainable

Every core card should be easy to explain to a non-specialist.

Good core cards are things like:

  • title this memo
  • redact this transcript
  • extract the action items
  • ask a clarifying question if the reminder is ambiguous

If a card requires three paragraphs of architecture context before it makes sense, it probably does not belong in the core eval.

4. Usability matters more than one exact JSON shape

The product can normalize a lot.

It can handle:

  • aliases
  • top-level arrays
  • equivalent field names
  • alternate but still safe representations

The benchmark should not punish those differences as harshly as real task mistakes.

5. Keep the core benchmark hard to game

A small hand-authored eval can become a trap. If we optimize models too hard against the benchmark before the benchmark is stable, we risk teaching them the shape of the answer instead of the job itself.

So the core set had to stay:

  • small enough to audit
  • broad enough not to collapse into template memorization
  • stable enough that a good score would mean something

How we structured the benchmark

The cleanest move was to split the benchmark into layers.

Core Eval v2

This is the benchmark that is supposed to answer the product question:

Can a model do the small workflow task clearly and usefully?

It is deliberately small.

The current core set has ten cards.

Stretch / draft benchmark

This is where we keep the broader, more architectural, more subjective tasks. That older set is still useful, but it no longer gets to define sanity.

This split matters because it keeps us from mixing:

  • product-real tasks
  • internal architecture probes
  • research-y ambition

into one misleading blended score.