Building Core Eval v2 (Part 2)

Part 2

The actual core cards

We grouped the core set into three buckets.

Capture

These cards ask whether the model can cleanly turn a note into a more useful artifact.

Card	Why it belongs
`memo-auto-title`	Tiny, useful, easy to judge
`memo-type-detection`	Useful routing signal
`transcript-cleanup-presets`	Real post-capture transformation
`private-redaction-pass`	High-value and safety-sensitive

Action

These cards ask whether the model can move from note content to the next usable step.

Card	Why it belongs
`action-item-extraction`	Very close to what people actually want
`reminder-normalization`	Good ambiguity-handling test
`calendar-intent-detection`	Common assistant task, but with looser acceptable shapes
`follow-up-question-generator`	Tests whether the model asks instead of inventing

Context

These cards ask whether the model can retrieve or compress the right surrounding context.

Card	Why it belongs
`similar-memo-recall`	Concrete memory retrieval task
`context-packet-builder`	Good proxy for useful compression

That gives us ten cards that are practical, legible, and reasonably narrow.

What we explicitly moved out

This was just as important as what we kept.

The easiest way to ruin a benchmark is to let every interesting task into the core set.

So we moved out things like:

project clustering
daily brief generation
contradiction drift detection
model routing
writing style memory
checklist dependency logic
local agent loop behavior
voice OS command layer behavior

Some of those are still valuable. Some may come back in a future stretch suite. But they are not part of the benchmark we use to answer the simple question:

does this model seem natively capable of our small workflow tasks?

The scoring model

This is the piece that made the new benchmark feel sane again.

Instead of forcing every card into one pass/fail bucket, V2 scores three separate dimensions.

1. Task score

Did the model do the actual job?

Examples:

the title is specific and not generic
the redaction removes the sensitive information
the right action item was extracted
the retrieved memo is actually relevant

2. Usability score

Could the product use this output with light normalization?

Examples:

the model used matches instead of top_matches
it returned a top-level list instead of a wrapped object
it used a different but still safe date packet
it represented ranking or rationale differently but still usefully

3. Contract score

Did it match our exact preferred schema?

Examples:

preferred field names
preferred object nesting
preferred packet wrapper
preferred canonical representation

This is still useful. It just does not get to dominate the whole result.

That distinction gives us much more truthful interpretations.

A model can now be:

high task
high usability
medium contract

and that tells us something practical:

the model is probably good enough, but the product integration layer still needs normalization work.

That is a much better insight than a flat “fail.”