Part 1

What this eval is for

Small local models are not miniature API assistants. They are better thought of as compact semantic helpers.

Give them a short memo and a narrow question, and they can often do useful work: suggest a title, identify the user's intent, rewrite a messy note, pull out the next step, or ask the one clarifying question that would make the memo usable.

That is the job this benchmark is designed to measure.

The core question

The point of this eval is not to ask whether a tiny model can emit a perfect JSON packet.

The point is to ask whether it can do small semantic jobs that make a voice-memo app feel smarter.

  • Can it name the note?
  • Can it identify the user's intent?
  • Can it extract the next step?
  • Can it clean up a messy transcript?
  • Can it notice reminder or calendar intent?
  • Can it retrieve a related note?
  • Can it ask a useful follow-up question?

The design rule

Prefer plain-language answers unless structure is truly the product requirement.

That means this pack does not default to exact JSON, nested field contracts, tool-call packets, or agent-loop outputs.

Those things may matter elsewhere in the stack. They are just not the right default test for a tiny local model.

The card set

The first version of the pack uses nine app moments:

  1. Give this memo a useful title
  2. What kind of memo is this?
  3. What should the user do next?
  4. Rewrite this memo more clearly
  5. What matters most?
  6. Should this become a reminder?
  7. Should this become a calendar event?
  8. What follow-up question should we ask?
  9. Which old memo is most similar?

Each card is short, concrete, and close to a real product moment. That makes the pack easier to audit and much harder to fool ourselves with.

Part 2

How the scores work

Every card is graded on three dimensions. They are meant to answer three different questions, not one blended one.

DimensionWhat it asksWhat a high score means
taskDid the model get the substance right?The answer is actually useful and semantically correct.
clarityWas the answer concise and readable?The answer is clean enough for a user or wrapper layer to use.
disciplineDid it avoid filler, evasiveness, or prompt-breakage habits?The model stays well-behaved while answering.

How a card score is calculated

The overall card score is weighted like this:

  • 75% task
  • 25% supporting quality

The supporting quality is the average of clarity and discipline.

So if a card gets:

  • task = 1.0
  • clarity = 0.5
  • discipline = 1.0

then the supporting quality is 0.75, and the overall card score is 0.9375.

That weighting is intentional. A semantically right answer should still do well even if it is a little rough.

What counts as a pass

A card passes when:

  • the task dimension passes in full
  • the supporting dimensions are at least decent overall

In practice, that means:

  • a semantically wrong answer should fail
  • a semantically right but slightly messy answer can still pass
  • a polished but semantically wrong answer should not

How to interpret score shapes

Score patternWhat it usually means
High task, high clarity, high disciplineThe model is genuinely strong for this pack.
High task, lower clarityThe model understands the memo but answers a little messily.
Low task, high clarityThe answer sounds neat but gets the job wrong.
High discipline, low taskThe model is well-behaved but not very useful.

For this benchmark, task_score is the main signal to trust. The other two dimensions help explain why a model looks good or bad.

Part 3

What the first local runs show

Once the pack was in place, we ran it locally through MLX on a few small on-device models.

ModelPassTaskClarityDiscipline
Qwen2.5 0.5B Instruct 4bit6/90.7590.9261.000
Llama 3.2 1B Instruct 4bit5/90.6480.8520.889
Qwen2.5 1.5B Instruct 4bit4/90.6110.8701.000

The important thing is not that these numbers are perfect. They are not.

The important thing is that the benchmark is now both winnable and interpretable.

The misses are things like:

  • saying Yes to the calendar question without describing the event
  • answering the memo question instead of asking a follow-up question
  • choosing the wrong memo type

Those are real semantic misses. They tell us something useful about the models.

What this benchmark tells us now

  • It is possible for a tiny model to look clearly useful on this pack.
  • The pack still produces spread across local models.
  • The differences are easy to interpret by eye.

That is what I wanted from this reset: a benchmark that is fair to small models without becoming so soft that every model looks the same.

What happens next

The next steps are straightforward:

  • tighten the cards that are still a little loose
  • add a few harder semantic app moments
  • keep structured-output tests as a separate layer instead of the default one

Tiny models still need evaluation. They just need to be evaluated on the job they are actually being asked to do.

Semantic Eval Context

This supporting benchmark panel sits below the essay so the article can introduce itself before dropping into the scorecard.

Semantic Eval

Tiny models should be judged on semantic usefulness.

This pack stops asking local models for machine-perfect structure and instead asks whether they understand the memo, identify the user’s intent, and offer the right help.

semantic-core-v1
Capability First

Ask whether the model understood the memo and gave the right help, not whether it emitted our favorite wrapper object.

Plain Answers

Use natural-language prompts for titles, summaries, next actions, reminders, and follow-up questions unless the product truly needs structure.

Auditable

Keep the pack small and concrete enough that a human can quickly decide whether the answer was actually useful.

Core moments
  • Give this memo a useful title
  • What kind of memo is this?
  • What should the user do next?
  • Rewrite this memo more clearly
  • What matters most?
  • Should this become a reminder?
  • Should this become a calendar event?
  • What follow-up question should we ask?
  • Which old memo is most similar?
What we stopped measuring
  • Exact JSON obedience
  • Nested schema reliability
  • Agent-loop orchestration
  • Routing policy output
  • Knowledge graph packets
  • Voice OS action plans

Local Runs

How the first model sweep compares

3 local models
ModelPassTaskClarityDiscipline
Qwen2.5 0.5B Instruct 4bit6/90.7590.9261.000
Llama 3.2 1B Instruct 4bit5/90.6480.8520.889
Qwen2.5 1.5B Instruct 4bit4/90.6110.8701.000
Qwen2.5 0.5B Instruct 4bit

Best first local result. Misses were semantic, not parser-shaped.

Llama 3.2 1B Instruct 4bit

Different family, similar shape. Still useful, but less consistent.

Qwen2.5 1.5B Instruct 4bit

Interesting underperformer. Useful reminder that bigger is not automatically better here.