#Whisper #SpeechRecognition #Benchmarking #AppleSilicon #Dictation

Whisper Model Comparison: Real Benchmarks for Speed vs Accuracy in Dictation

We benchmarked every Whisper model variant on real dictation tasks. The results surprised us: the largest model isn't always the smartest choice for daily voice input.

Lena Zhao|Applied ML Engineer

May 4, 202610 min read

Most dictation users pick the largest Whisper model available, assume they're getting the best possible transcription, and never question the choice. I was one of them. For months, I ran large-v3 for everything: emails, meeting notes, technical documentation, even grocery lists. Then I started tracking my actual correction counts and noticed something uncomfortable. I was spending nearly as much time waiting for transcription as I was saving by not typing.

So I did what any engineer would do. I set up a proper benchmark. Over six weeks, I ran 847 real dictation sessions across all six Whisper model variants, on two Apple Silicon machines, covering five distinct content categories. The results forced me to rethink everything I assumed about model selection. The largest model isn't always the smartest choice, and for most daily dictation, it's actively the wrong one.

The core tension is simple: latency tolerance and accuracy requirements vary wildly depending on what you're dictating. A physician transcribing clinical notes and a writer drafting a newsletter have fundamentally different needs. One model can't serve both optimally. The data proves it.

I Ran 847 Dictation Sessions Across Every Whisper Model. Here's What Actually Matters.

My testing methodology was straightforward but thorough. I ran dictation sessions across six Whisper model sizes (tiny, base, small, medium, large-v2, and large-v3) on two hardware configurations: an M1 Pro MacBook Pro with 16GB RAM and an M2 Max Mac Studio with 32GB RAM. Every session used real spoken content, not curated benchmark datasets or clean studio recordings.

The test corpus broke down like this: 200 email dictations averaging 120 words each, 150 meeting note sessions averaging 400 words, 180 technical document dictations spanning medical, legal, and engineering terminology, 150 casual journal entries, and 167 mixed-context sessions where I switched between topics mid-dictation. Every transcription was compared against a human-verified reference transcript.

The single most important finding: the medium model (769M parameters) hits a sweet spot that no other variant matches for general dictation work. It achieved 94.2% accuracy on conversational content while running 3.1x faster than large-v3. For most professionals dictating emails, notes, and reports, that speed difference compounds into hours saved per month.

How Whisper Models Actually Differ (Beyond Parameter Count)

Parameter count is the number everyone fixates on, but it's a poor proxy for real-world dictation performance. Here's what you're actually getting as you move up the model ladder.

Whisper tiny (39M parameters) is essentially a proof of concept. It can transcribe clear, simple English with short sentences, but it struggles with compound sentences, unusual vocabulary, and any background noise. Base (74M) improves on this but still produces transcriptions that require heavy editing for professional use.

Small (244M) is where things get interesting. It handles conversational English well, gets punctuation right roughly 80% of the time, and runs fast enough for near-real-time transcription on Apple Silicon. Medium (769M) jumps significantly in vocabulary coverage, accent handling, and punctuation prediction. It's the first model that consistently produces text I can send without re-reading every line.

Large-v2 and large-v3 both sit at 1.5 billion parameters, but they're not identical. Large-v3 was trained on a more diverse dataset that included more accented speech, more languages, and more domain-specific content. In my benchmarks, large-v3 outperformed large-v2 by 2.3% on technical dictation and 4.1% on accented English, while showing essentially no difference on standard conversational speech.

One critical detail for Mac users: Apple Silicon's unified memory architecture means VRAM isn't a separate bottleneck the way it is on NVIDIA GPUs. The medium model needs roughly 3GB of memory, while large-v3 needs about 6GB. On a 16GB M1 Pro, running large-v3 while other apps are open can trigger memory pressure and slow everything down. On a 32GB M2 Max, this isn't an issue. Your hardware configuration directly affects which model is practical for daily use.

The Benchmark Setup Nobody Else Is Running

Most Whisper benchmarks you'll find online use LibriSpeech or Common Voice datasets. These are clean, read-aloud recordings in quiet environments. They tell you almost nothing about how Whisper performs on real dictation, where people pause mid-thought, mumble through complex sentences, and speak over the hum of a coffee shop.

My accuracy metric combined two measurements. The primary metric was word error rate (WER), calculated against human-verified transcripts. But I also tracked a usability score that factored in punctuation accuracy, capitalization correctness, and whether filler words ("um," "uh," "you know") were properly excluded or included based on context. A transcription with 95% WER but missing all punctuation still requires significant editing.

For speed, I measured the real-time factor (RTF). An RTF of 1.0 means the model takes exactly as long to process audio as the audio's duration. An RTF of 0.5 means it processes audio twice as fast as real-time. For live dictation, you want RTF below 1.0; otherwise the transcription lags behind your speech and breaks your flow.

847

Total dictation sessions benchmarked across 6 Whisper model sizes and 5 content categories

3.1x

Speed advantage of the medium model over large-v3 on M1 Pro hardware

1.8%

Accuracy gap between medium and large-v3 on everyday conversational dictation

11%

Accuracy gap between medium and large-v3 on medical and legal terminology

0.4x

Real-time factor of the small model on M1 Pro, making live transcription viable

Raw Numbers: Every Model Ranked by Speed and Accuracy

Here are the benchmark results across all six models. WER is inverted to show accuracy (100% minus WER) since that's more intuitive. RTF is measured on both hardware configurations. Usability score is rated 1-10 based on punctuation, capitalization, and filler handling.

Model	Accuracy (Conversational)	RTF (M1 Pro)	RTF (M2 Max)	Usability Score
Tiny (39M)	78.4%	0.15x	0.08x	3.2/10
Base (74M)	84.1%	0.22x	0.12x	4.8/10
Small (244M)	88.7%	0.40x	0.21x	6.9/10
Medium (769M)	94.2%	0.80x	0.38x	8.4/10
Large-v2 (1.5B)	95.1%	2.1x	0.92x	8.7/10
Large-v3 (1.5B)	96.0%	2.5x	1.05x	9.1/10

The pattern is clear. Tiny and base fall off a cliff below 85% accuracy for anything beyond short, simple sentences. They're fine for quick voice commands or search queries, but they produce text that requires more editing than it's worth for professional dictation.

The real story is the diminishing returns curve. Going from small to medium buys you 5.5 percentage points of accuracy. Going from medium to large-v3 buys you just 1.8 points. But that small jump costs you 3.1x the processing time on M1 Pro hardware. On the M2 Max, large-v3 runs at roughly real-time (1.05x RTF), which is workable but still noticeably slower than medium's snappy 0.38x.

Notice the usability score column. Medium scores 8.4/10, meaning it gets punctuation and capitalization right often enough that minor post-editing is sufficient. Large-v3's 9.1/10 is better, but the practical gap between "fix a couple of commas" and "fix one comma" is slim for most workflows.

Where the Large Model Earns Its Keep

I'm not arguing that large-v3 is a bad model. It's the best Whisper model available, full stop. But "best" and "best for your workflow" are different questions.

Technical and domain-specific dictation is where large-v3 pulls decisively ahead. On my medical terminology dictation sessions (drug names, anatomical terms, diagnostic codes), large-v3 achieved 93.1% accuracy compared to medium's 82.4%. That 10.7% gap translates to roughly 15 additional errors per 500-word clinical note. For a physician, those errors aren't just annoying; they're potentially dangerous.

Accented English and multilingual switching showed a similar pattern. I tested sessions where I switched between English and Mandarin mid-sentence (a common pattern in my own workflow). Large-v3 handled these transitions with 87% accuracy, while medium dropped to 64%. If you regularly dictate in multiple languages or have a strong non-native accent, large-v3 is worth the wait.

Noisy environments also favor the larger model. I ran a subset of tests with moderate background noise (coffee shop ambience at roughly 60dB). Large-v3 maintained 91% accuracy while medium dropped to 84%. The larger model's broader training data gives it better noise separation, and that advantage is real.

The rule of thumb: if your dictation involves specialized vocabulary, language mixing, or imperfect recording conditions, large-v3 justifies its latency cost. For everything else, it's overkill.

The 'Good Enough' Threshold Most Dictation Users Should Target

Here's a concept I wish more people talked about: effective words per minute (eWPM). Raw dictation speed means nothing if you spend ten minutes fixing a five-minute transcription. eWPM accounts for both dictation speed and correction time.

The formula is simple. Take the total words dictated, subtract the time spent dictating plus the time spent correcting errors, and divide. When I calculated eWPM across my benchmark sessions, medium came out ahead of large-v3 for emails, meeting notes, and casual content. Why? Because medium's faster processing meant I could review and correct the transcription while the context was still fresh, while large-v3's lag broke my mental flow.

For emails and casual notes, anything above 92% accuracy hits the "good enough" threshold. At that level, you'll encounter roughly 4-5 errors per 500-word email, and those errors are almost always obvious in context (a wrong word that clearly doesn't fit the sentence). Spotting and fixing them takes under 30 seconds.

For technical or published content, you need 95% or higher. Below that threshold, the editing burden starts to negate dictation's speed advantage over typing, especially when errors involve specialized terms that auto-correct can't help with.

Track Your Correction Count Before Switching Models

Before committing to a model, dictate 10 typical sessions with the medium model and count your corrections per session. If you're averaging fewer than 3 corrections per 500 words, medium is your model. If you're consistently hitting 8 or more, the content you dictate likely contains enough specialized vocabulary to justify the switch to large-v3. This takes one day and saves you from weeks of using the wrong model.

Picking the Right Model for Your Workflow (A Decision Framework)

Model selection shouldn't be abstract. Here's a concrete framework based on three variables: what you dictate, what hardware you have, and how much latency you can tolerate.

Scenario 1: Writer Drafting on a MacBook Air

You write blog posts, emails, and documentation. Your vocabulary is standard English. You dictate in a quiet home office on an M1 or M2 MacBook Air with 8-16GB RAM.

Recommended model: Medium. You'll get 94%+ accuracy at sub-real-time speeds. The transcription will be ready before you finish your thought. Running large-v3 on 8GB hardware will cause memory pressure and actually slow down your entire machine, not just the transcription.

Scenario 2: Physician Dictating Clinical Notes

You use medical terminology constantly. Drug names, anatomical terms, and procedure codes need to be correct. You dictate on a well-equipped workstation.

Recommended model: Large-v3. The 11% accuracy improvement on medical terminology means fewer corrections on high-stakes content. Pair it with post-processing that includes a medical vocabulary lookup, and you'll hit 97%+ accuracy on clinical notes.

Scenario 3: Developer Writing Documentation

You dictate code comments, API documentation, and technical specs. Variable names and technical terms are common, but they follow predictable patterns.

Recommended model: Medium with post-processing. Medium handles the narrative portions of technical writing well. For specific technical terms (function names, API endpoints), add a custom vocabulary list for post-processing corrections. This gives you medium's speed with most of large-v3's accuracy for your domain.

Scenario 4: Multilingual Professional

You switch between English and another language during dictation, or you have a strong accent that smaller models struggle with.

Recommended model: Large-v3. The 23% accuracy advantage on language switching is too significant to ignore. No amount of post-processing can fix a model that transcribes your Mandarin as garbled English.

Auditory supports switching between models on a per-session basis, so you don't have to commit to one choice permanently. Use medium for your morning emails and switch to large-v3 when you sit down for technical documentation.

What to Do Monday Morning

Stop reading benchmarks and start collecting your own data. Here's your action plan for the next seven days.

Day 1: Switch to the medium model for all dictation. Don't change anything else about your workflow. At the end of each session, count the number of errors you need to correct. Write that number down.

Days 2 through 5: Continue with medium. Keep tracking corrections per session. Also note the content type (email, notes, technical, casual) for each session.

Day 6: Review your numbers. Calculate your average corrections per session by content type. If casual content averages below 3 corrections per 500 words, medium is confirmed for that use case. If technical content averages above 8, you have a clear case for large-v3 on those tasks.

Day 7: Set up your two-model workflow. Use medium as your default. Switch to large-v3 only for the specific content types where your correction count justified it.

The metric to track going forward is effective words per minute. Measure it once a week. If it's climbing, your model selection is working. If it plateaus or drops, reassess.

Remember where we started: I was running large-v3 for grocery lists because I assumed bigger was always better. After 847 dictation sessions and six weeks of data, I now use medium for about 80% of my dictation and large-v3 for the remaining 20%. My effective output is 40% higher than when I used large-v3 for everything. The best model isn't the biggest one. It's the one that matches your actual workflow.

Ready to try Auditory?

Privacy-first speech to text. Download free for macOS.

Download for Free