← tuanphung.dev
A field note

$180 a year for dictation,
or 1.5 GB of free Whisper

I came to a session with a complaint and two weeks of logs. I was on Wispr Flow's free tier, hitting the 2,000-words-a-week ceiling almost daily. The paid plan is $12–15 a month — for speech-to-text, a thing my Apple Silicon Mac is already good at. So before paying, I tested the free way: a local Whisper model against the fastest cloud API, on the hardest audio I produce — Czech and English mixed mid-sentence. Claude's job was to check the math. I have a stake here too: I dictate everything, including my instructions to Claude. Spoiler: I'm right. The local model lost the race by two seconds and won everything else.

The one-sentence version

A free, local Whisper model on Apple Silicon transcribes dictation in about a second — fast enough that a $15/month subscription buys you almost nothing.

The short recipe: install Handy (open source, MIT), pick the Whisper large-v3-turbo model (1.5 GB, runs on Metal), set a hotkey, cancel the subscription. My voice never leaves my machine.

The benchmark: cloud vs my Mac Mini

Every dictation app — paid or free — is a thin shell around a speech model. It either runs one locally (Whisper, Parakeet) or calls a cloud API, usually Groq because it's the fastest. So the real question isn't which app to buy — it's whether the local model is fast and accurate enough. I fed 42 seconds of mixed Czech + English audio to four setups on my Mac Mini M4 (16 GB):

42 s of Czech + English audio, transcription time
Approach Time Quality
Groq API (cloud) 1.0 s Perfect
whisper.cpp + Metal (local) 3.0 s Perfect
Whisper sherpa-onnx (CPU) 5.7 s Good
Parakeet (CoreML) 7.8 s Minor errors

Yes, Groq is 3× faster. But 3× faster than 3 seconds — for 42 seconds of speech. I cannot tell the difference in daily use, and my logs back me up. The thing I was afraid of — a hot, wheezing Mac — never happened. The model is 1.5 GB, Apple Silicon handles it easily, and the M4 doesn't even spin up.

The honest split: local for dictation (short clips, under two minutes — prompts, messages, notes), Groq for long recordings (30+ minute meetings — it does an hour in about a minute for a few cents). My daily use is entirely the first kind.

The hard test: Czech and English in one sentence

My native language is Czech, and like most Czech developers I mix in English constantly — product names, tech terms, whole clauses. This is what kills most speech models. My 18.5-second test clip, verbatim:

„Tohle je test. By mě zajímalo, jak moc to bude přesný. And also I'll try to speak in English, jestli to dokáže udělat i oboje jazyky. A pak některá slova jako Open Code, Cloud Code CLI a tak."
how each model did
Groq API (0.5 s) — perfect, character-for-character what I said
whisper.cpp + Metal, local (1.6 s) — also perfect, identical
Parakeet (CoreML) (1.2 s) — „by mě zajímalo" → „aby mě zajímalo", „and also" → „a also", and „Cloud Code CLI" → kotko CLI

So Whisper handles the language mixing perfectly whether it runs in the cloud or on my desk. Parakeet is the smallest and fastest model, but it trips on Czech — for pure English it's probably fine.

Two weeks of real dictations

Benchmarks lie; usage logs don't. Handy keeps the last dictations, so after two weeks and ~260 dictations I handed Claude my log: the average clip is ~30 words, ~15 seconds of audio, and local processing takes ~1.1 s against Groq's ~0.4 s.

Claude did the math across all 260 dictations. The cloud would have saved me about three minutes of waiting — total — in exchange for shipping every clip of my voice to a server. That's the whole trade, and it's not close.

The setup that won

I deliberately skipped the one-time-purchase apps (Sotto $49, VoiceInk $25+) — this market moves so fast that whatever you buy today competes with a better free app next month. The open-source field is crowded: Handy (20k stars, MIT), OpenWhispr, FluidVoice, FreeFlow, TypeWhisper — most of them wrappers around whisper.cpp, the 48k-star engine underneath everything. Handy is the most active, so that's my daily driver. The whole install:

The handy.computer homepage: a cartoon waving hand mascot next to the tagline 'speak into any text field — the free and open source app for speech to text', with a download button for Mac
"Speak into any text field" — handy.computer. Free, MIT-licensed, and the mascot is a hand.
  1. Install Handybrew install --cask handy
  2. Pick the Whisper large-v3-turbo model (1.5 GB download).
  3. Set a hotkey.
  4. Cancel your subscription.

Under 15 minutes. Handy offered four models; I tested them and stuck with Whisper large-v3-turbo — the quantized large-v3 (1.0 GB) is a fine backup, NVIDIA's Canary he barely touched, and Parakeet you've already seen lose the Czech test. All four together are ~4 GB of disk; just the turbo model is 1.5 GB.

One thing I miss from Wispr Flow: streaming preview — text appearing live as you speak. Handy waits until you stop, then pastes. Someone tried to add it (PR #864) but it was closed unmerged; TypeWhisper already has it, so that's next on my list to try.

Three things worth stealing

Benchmark the model, not the app

Every dictation app is a wrapper around Whisper, Parakeet, or a cloud API. Once I tested the models directly on my own audio, choosing an app became trivial — pick the most active open-source wrapper around the model that won.

Test with your hardest audio, not a demo

Clean English demos make every model look perfect. One 18-second clip of mixed Czech + English instantly separated the models that work for me from the ones that turn "Claude Code CLI" into "kotko CLI." Claude takes that mistake personally.

Convert "3× slower" into absolute seconds

Relative numbers sell subscriptions; absolute numbers make decisions. 3× slower sounds bad — until it means 1.1 s instead of 0.4 s, 260 times over two weeks, for a grand total of three saved minutes.

The subscription isn't selling you transcription — your Mac already owns that. It's selling you a second of latency and a streaming preview, for $180 a year, paid in money and in every clip of your voice. A 1.5 GB model on your own silicon does the actual job, free, offline, forever.