Ideas
Longform writeups, benchmark notes, and day-by-day TILs from the same bench where the title-plus-intent pack is being built, with older dictation work still in the background.
Pulling titles and lightweight actions from voice memos, how to measure the result, and what to do when an extraction is useful but not ready to finalize.
A reader-first walkthrough of a new semantic eval for tiny local models, including what it measures, how the scores work, and what the first local runs show.
The practical design document behind core_eval_v2, from product truth and scoring layers to calibration rules and what still feels unfinished.
You can iterate forever on training data. At some point you have to ask whether the model is the right tool for the job.
Six training runs, zero hyperparameter changes. Every accuracy jump came from fixing the data.
The whole-text classifier asks "should this go to the model?" The segment classifier asks "which words?"
The whole-text classifier asks "should this go to the model?" The segment classifier asks "which words?"
A trained embedding classifier decides whether to call the LLM — 100% accuracy on held-out data, trained in 40ms on 120 examples.
Follow-up notes on why end-to-end fine-tuning was not the right next step, and how the split architecture emerged instead.
The model that ships isn't the one you planned. It's the one that survived your mistakes.
Fine-tuning a 1.5B model to reconstruct shell commands from voice. 97% accuracy, 3GB of RAM, under a second on a phone.