I came to a session with a complaint and two weeks of logs. I was on Wispr Flow's free tier, hitting the 2,000-words-a-week ceiling almost daily. The paid plan is $12–15 a month — for speech-to-text, a thing my Apple Silicon Mac is already good at. So before paying, I tested the free way: a local Whisper model against the fastest cloud API, on the hardest audio I produce — Czech and English mixed mid-sentence. Claude's job was to check the math. I have a stake here too: I dictate everything, including my instructions to Claude. Spoiler: I'm right. The local model lost the race by two seconds and won everything else.
A free, local Whisper model on Apple Silicon transcribes dictation in about a second — fast enough that a $15/month subscription buys you almost nothing.
The short recipe: install Handy (open source, MIT), pick the Whisper large-v3-turbo model (1.5 GB, runs on Metal), set a hotkey, cancel the subscription. My voice never leaves my machine.
Every dictation app — paid or free — is a thin shell around a speech model. It either runs one locally (Whisper, Parakeet) or calls a cloud API, usually Groq because it's the fastest. So the real question isn't which app to buy — it's whether the local model is fast and accurate enough. I fed 42 seconds of mixed Czech + English audio to four setups on my Mac Mini M4 (16 GB):
| Approach | Time | Quality |
|---|---|---|
| Groq API (cloud) | 1.0 s | Perfect |
| whisper.cpp + Metal (local) | 3.0 s | Perfect |
| Whisper sherpa-onnx (CPU) | 5.7 s | Good |
| Parakeet (CoreML) | 7.8 s | Minor errors |
Yes, Groq is 3× faster. But 3× faster than 3 seconds — for 42 seconds of speech. I cannot tell the difference in daily use, and my logs back me up. The thing I was afraid of — a hot, wheezing Mac — never happened. The model is 1.5 GB, Apple Silicon handles it easily, and the M4 doesn't even spin up.
The honest split: local for dictation (short clips, under two minutes — prompts, messages, notes), Groq for long recordings (30+ minute meetings — it does an hour in about a minute for a few cents). My daily use is entirely the first kind.
My native language is Czech, and like most Czech developers I mix in English constantly — product names, tech terms, whole clauses. This is what kills most speech models. My 18.5-second test clip, verbatim:
So Whisper handles the language mixing perfectly whether it runs in the cloud or on my desk. Parakeet is the smallest and fastest model, but it trips on Czech — for pure English it's probably fine.
Benchmarks lie; usage logs don't. Handy keeps the last dictations, so after two weeks and ~260 dictations I handed Claude my log: the average clip is ~30 words, ~15 seconds of audio, and local processing takes ~1.1 s against Groq's ~0.4 s.
Claude did the math across all 260 dictations. The cloud would have saved me about three minutes of waiting — total — in exchange for shipping every clip of my voice to a server. That's the whole trade, and it's not close.
I deliberately skipped the one-time-purchase apps (Sotto $49, VoiceInk $25+) — this market moves so fast that whatever you buy today competes with a better free app next month. The open-source field is crowded: Handy (20k stars, MIT), OpenWhispr, FluidVoice, FreeFlow, TypeWhisper — most of them wrappers around whisper.cpp, the 48k-star engine underneath everything. Handy is the most active, so that's my daily driver. The whole install:
Under 15 minutes. Handy offered four models; I tested them and stuck with Whisper large-v3-turbo — the quantized large-v3 (1.0 GB) is a fine backup, NVIDIA's Canary he barely touched, and Parakeet you've already seen lose the Czech test. All four together are ~4 GB of disk; just the turbo model is 1.5 GB.
One thing I miss from Wispr Flow: streaming preview — text appearing live as you speak. Handy waits until you stop, then pastes. Someone tried to add it (PR #864) but it was closed unmerged; TypeWhisper already has it, so that's next on my list to try.
Every dictation app is a wrapper around Whisper, Parakeet, or a cloud API. Once I tested the models directly on my own audio, choosing an app became trivial — pick the most active open-source wrapper around the model that won.
Clean English demos make every model look perfect. One 18-second clip of mixed Czech + English instantly separated the models that work for me from the ones that turn "Claude Code CLI" into "kotko CLI." Claude takes that mistake personally.
Relative numbers sell subscriptions; absolute numbers make decisions. 3× slower sounds bad — until it means 1.1 s instead of 0.4 s, 260 times over two weeks, for a grand total of three saved minutes.