Ideas

Essays, notes, and benchmark drafts.

Longform writeups, benchmark notes, and day-by-day TILs from the same bench where the title-plus-intent pack is being built, with older dictation work still in the background.

EssayApril 8, 2026evalsextractionuncertainty

Notes on extractions from voice memos

Pulling titles and lightweight actions from voice memos, how to measure the result, and what to do when an extraction is useful but not ready to finalize.

↗

EssayApril 6, 2026evalson-device-mltiny-models

Designing A Semantic Eval For Tiny Models

A reader-first walkthrough of a new semantic eval for tiny local models, including what it measures, how the scores work, and what the first local runs show.

↗

EssayApril 5, 2026evalsbenchmark-designcalibration

Building Core Eval v2

The practical design document behind core_eval_v2, from product truth and scoring layers to calibration rules and what still feels unfinished.

↗

EssayMarch 9, 2026fine-tuningmodel-capacityfailure-analysis

Part 6: What a 0.6B Model Can't Learn

You can iterate forever on training data. At some point you have to ask whether the model is the right tool for the job.

↗

EssayMarch 8, 2026fine-tuningmlxlora

Part 5: From 18% to 79% in an Afternoon

Six training runs, zero hyperparameter changes. Every accuracy jump came from fixing the data.

↗

EssayMarch 7, 2026classifiersegmentationon-device-ml

Part 4: Splitting the Stream

The whole-text classifier asks "should this go to the model?" The segment classifier asks "which words?"

↗

EssayMarch 7, 2026classifiersegmentationon-device-ml

Part 4: Splitting the Stream

The whole-text classifier asks "should this go to the model?" The segment classifier asks "which words?"

↗

EssayMarch 6, 2026nlembeddingclassifieron-device-ml

Part 3: The 40-Millisecond Gate

A trained embedding classifier decides whether to call the LLM — 100% accuracy on held-out data, trained in 40ms on 120 examples.

↗

EssayMarch 5, 2026on-device-mlfine-tuningarchitecture

Part 2: When Fine-Tuning Isn't the Answer (Yet)

Follow-up notes on why end-to-end fine-tuning was not the right next step, and how the split architecture emerged instead.

↗

EssayMarch 5, 2026mlxfine-tuningon-device-ml

Training on a Mac Mini

The model that ships isn't the one you planned. It's the one that survived your mistakes.

↗

EssayMarch 4, 2026mlxfine-tuninglora

Teaching a Tiny Model to Hear Bash

Fine-tuning a 1.5B model to reconstruct shell commands from voice. 97% accuracy, 3GB of RAM, under a second on a phone.

↗