The actual core cards
We grouped the core set into three buckets.
Capture
These cards ask whether the model can cleanly turn a note into a more useful artifact.
| Card | Why it belongs |
|---|---|
memo-auto-title | Tiny, useful, easy to judge |
memo-type-detection | Useful routing signal |
transcript-cleanup-presets | Real post-capture transformation |
private-redaction-pass | High-value and safety-sensitive |
Action
These cards ask whether the model can move from note content to the next usable step.
| Card | Why it belongs |
|---|---|
action-item-extraction | Very close to what people actually want |
reminder-normalization | Good ambiguity-handling test |
calendar-intent-detection | Common assistant task, but with looser acceptable shapes |
follow-up-question-generator | Tests whether the model asks instead of inventing |
Context
These cards ask whether the model can retrieve or compress the right surrounding context.
| Card | Why it belongs |
|---|---|
similar-memo-recall | Concrete memory retrieval task |
context-packet-builder | Good proxy for useful compression |
That gives us ten cards that are practical, legible, and reasonably narrow.
What we explicitly moved out
This was just as important as what we kept.
The easiest way to ruin a benchmark is to let every interesting task into the core set.
So we moved out things like:
- project clustering
- daily brief generation
- contradiction drift detection
- model routing
- writing style memory
- checklist dependency logic
- local agent loop behavior
- voice OS command layer behavior
Some of those are still valuable. Some may come back in a future stretch suite. But they are not part of the benchmark we use to answer the simple question:
does this model seem natively capable of our small workflow tasks?
The scoring model
This is the piece that made the new benchmark feel sane again.
Instead of forcing every card into one pass/fail bucket, V2 scores three separate dimensions.
1. Task score
Did the model do the actual job?
Examples:
- the title is specific and not generic
- the redaction removes the sensitive information
- the right action item was extracted
- the retrieved memo is actually relevant
2. Usability score
Could the product use this output with light normalization?
Examples:
- the model used
matchesinstead oftop_matches - it returned a top-level list instead of a wrapped object
- it used a different but still safe date packet
- it represented ranking or rationale differently but still usefully
3. Contract score
Did it match our exact preferred schema?
Examples:
- preferred field names
- preferred object nesting
- preferred packet wrapper
- preferred canonical representation
This is still useful. It just does not get to dominate the whole result.
That distinction gives us much more truthful interpretations.
A model can now be:
- high task
- high usability
- medium contract
and that tells us something practical:
the model is probably good enough, but the product integration layer still needs normalization work.
That is a much better insight than a flat “fail.”