Part 2

The actual core cards

We grouped the core set into three buckets.

Capture

These cards ask whether the model can cleanly turn a note into a more useful artifact.

CardWhy it belongs
memo-auto-titleTiny, useful, easy to judge
memo-type-detectionUseful routing signal
transcript-cleanup-presetsReal post-capture transformation
private-redaction-passHigh-value and safety-sensitive

Action

These cards ask whether the model can move from note content to the next usable step.

CardWhy it belongs
action-item-extractionVery close to what people actually want
reminder-normalizationGood ambiguity-handling test
calendar-intent-detectionCommon assistant task, but with looser acceptable shapes
follow-up-question-generatorTests whether the model asks instead of inventing

Context

These cards ask whether the model can retrieve or compress the right surrounding context.

CardWhy it belongs
similar-memo-recallConcrete memory retrieval task
context-packet-builderGood proxy for useful compression

That gives us ten cards that are practical, legible, and reasonably narrow.

What we explicitly moved out

This was just as important as what we kept.

The easiest way to ruin a benchmark is to let every interesting task into the core set.

So we moved out things like:

  • project clustering
  • daily brief generation
  • contradiction drift detection
  • model routing
  • writing style memory
  • checklist dependency logic
  • local agent loop behavior
  • voice OS command layer behavior

Some of those are still valuable. Some may come back in a future stretch suite. But they are not part of the benchmark we use to answer the simple question:

does this model seem natively capable of our small workflow tasks?

The scoring model

This is the piece that made the new benchmark feel sane again.

Instead of forcing every card into one pass/fail bucket, V2 scores three separate dimensions.

1. Task score

Did the model do the actual job?

Examples:

  • the title is specific and not generic
  • the redaction removes the sensitive information
  • the right action item was extracted
  • the retrieved memo is actually relevant

2. Usability score

Could the product use this output with light normalization?

Examples:

  • the model used matches instead of top_matches
  • it returned a top-level list instead of a wrapped object
  • it used a different but still safe date packet
  • it represented ranking or rationale differently but still usefully

3. Contract score

Did it match our exact preferred schema?

Examples:

  • preferred field names
  • preferred object nesting
  • preferred packet wrapper
  • preferred canonical representation

This is still useful. It just does not get to dominate the whole result.

That distinction gives us much more truthful interpretations.

A model can now be:

  • high task
  • high usability
  • medium contract

and that tells us something practical:

the model is probably good enough, but the product integration layer still needs normalization work.

That is a much better insight than a flat “fail.”