Designing A Semantic Eval For Tiny Models

Part 1

What this eval is for

Small local models are not miniature API assistants. They are better thought of as compact semantic helpers.

Give them a short memo and a narrow question, and they can often do useful work: suggest a title, identify the user's intent, rewrite a messy note, pull out the next step, or ask the one clarifying question that would make the memo usable.

That is the job this benchmark is designed to measure.

The core question

The point of this eval is not to ask whether a tiny model can emit a perfect JSON packet.

The point is to ask whether it can do small semantic jobs that make a voice-memo app feel smarter.

Can it name the note?
Can it identify the user's intent?
Can it extract the next step?
Can it clean up a messy transcript?
Can it notice reminder or calendar intent?
Can it retrieve a related note?
Can it ask a useful follow-up question?

The design rule

Prefer plain-language answers unless structure is truly the product requirement.

That means this pack does not default to exact JSON, nested field contracts, tool-call packets, or agent-loop outputs.

Those things may matter elsewhere in the stack. They are just not the right default test for a tiny local model.

The card set

The first version of the pack uses nine app moments:

Give this memo a useful title
What kind of memo is this?
What should the user do next?
Rewrite this memo more clearly
What matters most?
Should this become a reminder?
Should this become a calendar event?
What follow-up question should we ask?
Which old memo is most similar?

Each card is short, concrete, and close to a real product moment. That makes the pack easier to audit and much harder to fool ourselves with.

Part 2

How the scores work

Every card is graded on three dimensions. They are meant to answer three different questions, not one blended one.

Dimension	What it asks	What a high score means
`task`	Did the model get the substance right?	The answer is actually useful and semantically correct.
`clarity`	Was the answer concise and readable?	The answer is clean enough for a user or wrapper layer to use.
`discipline`	Did it avoid filler, evasiveness, or prompt-breakage habits?	The model stays well-behaved while answering.

How a card score is calculated

The overall card score is weighted like this:

75% task
25% supporting quality

The supporting quality is the average of clarity and discipline.

So if a card gets:

task = 1.0
clarity = 0.5
discipline = 1.0

then the supporting quality is 0.75, and the overall card score is 0.9375.

That weighting is intentional. A semantically right answer should still do well even if it is a little rough.

What counts as a pass

A card passes when:

the task dimension passes in full
the supporting dimensions are at least decent overall

In practice, that means:

a semantically wrong answer should fail
a semantically right but slightly messy answer can still pass
a polished but semantically wrong answer should not

How to interpret score shapes

Score pattern	What it usually means
High task, high clarity, high discipline	The model is genuinely strong for this pack.
High task, lower clarity	The model understands the memo but answers a little messily.
Low task, high clarity	The answer sounds neat but gets the job wrong.
High discipline, low task	The model is well-behaved but not very useful.

For this benchmark, task_score is the main signal to trust. The other two dimensions help explain why a model looks good or bad.

Part 3

What the first local runs show

Once the pack was in place, we ran it locally through MLX on a few small on-device models.

Model	Pass	Task	Clarity	Discipline
`Qwen2.5 0.5B Instruct 4bit`	`6/9`	`0.759`	`0.926`	`1.000`
`Llama 3.2 1B Instruct 4bit`	`5/9`	`0.648`	`0.852`	`0.889`
`Qwen2.5 1.5B Instruct 4bit`	`4/9`	`0.611`	`0.870`	`1.000`

The important thing is not that these numbers are perfect. They are not.

The important thing is that the benchmark is now both winnable and interpretable.

The misses are things like:

saying Yes to the calendar question without describing the event
answering the memo question instead of asking a follow-up question
choosing the wrong memo type

Those are real semantic misses. They tell us something useful about the models.

What this benchmark tells us now

It is possible for a tiny model to look clearly useful on this pack.
The pack still produces spread across local models.
The differences are easy to interpret by eye.

That is what I wanted from this reset: a benchmark that is fair to small models without becoming so soft that every model looks the same.

What happens next

The next steps are straightforward:

tighten the cards that are still a little loose
add a few harder semantic app moments
keep structured-output tests as a separate layer instead of the default one

Tiny models still need evaluation. They just need to be evaluated on the job they are actually being asked to do.