What this eval is for
Small local models are not miniature API assistants. They are better thought of as compact semantic helpers.
Give them a short memo and a narrow question, and they can often do useful work: suggest a title, identify the user's intent, rewrite a messy note, pull out the next step, or ask the one clarifying question that would make the memo usable.
That is the job this benchmark is designed to measure.
The core question
The point of this eval is not to ask whether a tiny model can emit a perfect JSON packet.
The point is to ask whether it can do small semantic jobs that make a voice-memo app feel smarter.
- Can it name the note?
- Can it identify the user's intent?
- Can it extract the next step?
- Can it clean up a messy transcript?
- Can it notice reminder or calendar intent?
- Can it retrieve a related note?
- Can it ask a useful follow-up question?
The design rule
Prefer plain-language answers unless structure is truly the product requirement.
That means this pack does not default to exact JSON, nested field contracts, tool-call packets, or agent-loop outputs.
Those things may matter elsewhere in the stack. They are just not the right default test for a tiny local model.
The card set
The first version of the pack uses nine app moments:
- Give this memo a useful title
- What kind of memo is this?
- What should the user do next?
- Rewrite this memo more clearly
- What matters most?
- Should this become a reminder?
- Should this become a calendar event?
- What follow-up question should we ask?
- Which old memo is most similar?
Each card is short, concrete, and close to a real product moment. That makes the pack easier to audit and much harder to fool ourselves with.
How the scores work
Every card is graded on three dimensions. They are meant to answer three different questions, not one blended one.
| Dimension | What it asks | What a high score means |
|---|---|---|
task | Did the model get the substance right? | The answer is actually useful and semantically correct. |
clarity | Was the answer concise and readable? | The answer is clean enough for a user or wrapper layer to use. |
discipline | Did it avoid filler, evasiveness, or prompt-breakage habits? | The model stays well-behaved while answering. |
How a card score is calculated
The overall card score is weighted like this:
75%task25%supporting quality
The supporting quality is the average of clarity and discipline.
So if a card gets:
task = 1.0clarity = 0.5discipline = 1.0
then the supporting quality is 0.75, and the overall card
score is 0.9375.
That weighting is intentional. A semantically right answer should still do well even if it is a little rough.
What counts as a pass
A card passes when:
- the
taskdimension passes in full - the supporting dimensions are at least decent overall
In practice, that means:
- a semantically wrong answer should fail
- a semantically right but slightly messy answer can still pass
- a polished but semantically wrong answer should not
How to interpret score shapes
| Score pattern | What it usually means |
|---|---|
| High task, high clarity, high discipline | The model is genuinely strong for this pack. |
| High task, lower clarity | The model understands the memo but answers a little messily. |
| Low task, high clarity | The answer sounds neat but gets the job wrong. |
| High discipline, low task | The model is well-behaved but not very useful. |
For this benchmark, task_score is the main signal to trust.
The other two dimensions help explain why a model looks good or
bad.
What the first local runs show
Once the pack was in place, we ran it locally through MLX on a few small on-device models.
| Model | Pass | Task | Clarity | Discipline |
|---|---|---|---|---|
Qwen2.5 0.5B Instruct 4bit | 6/9 | 0.759 | 0.926 | 1.000 |
Llama 3.2 1B Instruct 4bit | 5/9 | 0.648 | 0.852 | 0.889 |
Qwen2.5 1.5B Instruct 4bit | 4/9 | 0.611 | 0.870 | 1.000 |
The important thing is not that these numbers are perfect. They are not.
The important thing is that the benchmark is now both winnable and interpretable.
The misses are things like:
- saying
Yesto the calendar question without describing the event - answering the memo question instead of asking a follow-up question
- choosing the wrong memo type
Those are real semantic misses. They tell us something useful about the models.
What this benchmark tells us now
- It is possible for a tiny model to look clearly useful on this pack.
- The pack still produces spread across local models.
- The differences are easy to interpret by eye.
That is what I wanted from this reset: a benchmark that is fair to small models without becoming so soft that every model looks the same.
What happens next
The next steps are straightforward:
- tighten the cards that are still a little loose
- add a few harder semantic app moments
- keep structured-output tests as a separate layer instead of the default one
Tiny models still need evaluation. They just need to be evaluated on the job they are actually being asked to do.
Semantic Eval Context
This supporting benchmark panel sits below the essay so the article can introduce itself before dropping into the scorecard.
Semantic Eval
Tiny models should be judged on semantic usefulness.
This pack stops asking local models for machine-perfect structure and instead asks whether they understand the memo, identify the user’s intent, and offer the right help.
Ask whether the model understood the memo and gave the right help, not whether it emitted our favorite wrapper object.
Use natural-language prompts for titles, summaries, next actions, reminders, and follow-up questions unless the product truly needs structure.
Keep the pack small and concrete enough that a human can quickly decide whether the answer was actually useful.
- Give this memo a useful title
- What kind of memo is this?
- What should the user do next?
- Rewrite this memo more clearly
- What matters most?
- Should this become a reminder?
- Should this become a calendar event?
- What follow-up question should we ask?
- Which old memo is most similar?
- Exact JSON obedience
- Nested schema reliability
- Agent-loop orchestration
- Routing policy output
- Knowledge graph packets
- Voice OS action plans
Local Runs
How the first model sweep compares
| Model | Pass | Task | Clarity | Discipline |
|---|---|---|---|---|
| Qwen2.5 0.5B Instruct 4bit | 6/9 | 0.759 | 0.926 | 1.000 |
| Llama 3.2 1B Instruct 4bit | 5/9 | 0.648 | 0.852 | 0.889 |
| Qwen2.5 1.5B Instruct 4bit | 4/9 | 0.611 | 0.870 | 1.000 |
Best first local result. Misses were semantic, not parser-shaped.
Different family, similar shape. Still useful, but less consistent.
Interesting underperformer. Useful reminder that bigger is not automatically better here.