Back to Blog

How On-Device AI Models Actually Work on Apple Silicon

A plain-language breakdown of Neural Engine, Metal, and unified memory, explaining exactly how Whisper transcribes your speech without ever leaving your Mac.

DMP
Dr. Mira Patel|Head of Speech AI Research
May 18, 202613 min read

Every Mac sold since late 2020 contains a processor specifically designed to run neural networks. It has its own dedicated silicon, its own power budget, and its own instruction set. It sits right next to the CPU and GPU on the same chip. And the vast majority of Mac users have never once thought about it.

That processor is the Apple Neural Engine (ANE), and understanding how it works (along with the unified memory and Metal GPU that surround it) explains why a fanless MacBook Air can transcribe speech locally at speeds that rival cloud APIs. This isn't magic. It's architecture. And once you see how the pieces fit together, you'll understand exactly why on-device AI on Apple Silicon performs the way it does, and where its limits are.

I've spent the last two years profiling speech models on every M-series chip Apple has shipped. What follows is a plain-language breakdown of the hardware, the software pipeline, and the real-world benchmarks that matter for anyone using dictation on a Mac.

Your Mac Has a Dedicated AI Chip (and You Probably Ignore It)

The Neural Engine first appeared in the A11 Bionic for iPhones in 2017. When Apple transitioned Macs to their own silicon in November 2020 with the M1, they brought the Neural Engine along. Every M-series Mac, from the base MacBook Air to the Mac Studio, includes it.

On the M1 and M2, the Neural Engine contains 16 dedicated cores built exclusively for one operation: matrix multiply-accumulate. This is the fundamental math behind every transformer layer, every attention head, and every feed-forward block in models like Whisper. The M3 and M4 retain the 16-core count but ship with architectural improvements to cache hierarchy and scheduling that boost throughput per core.

Unlike a GPU, which juggles graphics rendering, display compositing, and compute shaders, the Neural Engine does exactly one thing: tensor math at low power. Apple rates the M1's Neural Engine at 15.8 trillion operations per second (TOPS). The M4 scales that to 38 TOPS. To put that in perspective, running a 1.5-billion-parameter model (like Whisper large-v3) on CPU cores alone gives you roughly 0.3x realtime transcription speed. Route that same model through the Neural Engine, and you jump to 1.8x realtime on an M1. That's a 4 to 6x throughput difference from a chip most users don't know exists.

The Neural Engine doesn't show up as a separate device in System Information. You won't find a "Neural Engine" tab in Activity Monitor (though macOS Sequoia added an ANE power column, which we'll return to later). It's invisible by design, and that invisibility is part of why so few people realize their Mac has purpose-built AI hardware sitting idle most of the day.

Whisper Inference Pipeline on Apple Silicon: Audio Capture to Text Output
Whisper Inference Pipeline on Apple Silicon: Audio Capture to Text Output

Unified Memory: The Architecture Trick That Changes Everything

On a traditional Windows PC or Linux workstation with a discrete GPU, running an AI model involves a painful bottleneck. The model weights live in system RAM. The GPU has its own separate VRAM. Before inference can begin, the CPU must copy model weights from RAM to VRAM over a PCIe bus, which tops out at 12 to 16 GB/s on PCIe 4.0. For a 1.5 GB model like Whisper large-v3, that's a 100ms penalty before you even start computing. For larger models, it's worse.

Apple Silicon eliminates this entirely. The CPU, GPU, and Neural Engine all share a single pool of unified memory. There's no copy step. When Whisper's weights load into memory, every processor on the chip can read them directly. The model loads once, and three different compute engines access the same physical memory addresses with zero transfer overhead.

The bandwidth numbers matter here. The M1's unified memory runs at 68.25 GB/s. The M2 Pro hits 200 GB/s. The M4 Pro reaches 273 GB/s, which means the entire 1.54 GB Whisper large-v3 model (at FP16 precision) can stream through the memory bus in under 6 milliseconds. This is why a MacBook Air with 8 GB of RAM can run inference workloads that would choke a Windows laptop with a discrete GPU and twice the VRAM: it's not about total memory capacity, it's about eliminating the copy.

This architecture also means that model switching is fast. If you swap from Whisper's base model (74 MB) to the large model (1.54 GB) mid-session, there's no GPU memory allocation dance. The new weights load into the same unified pool, and inference resumes in milliseconds rather than seconds.

For dictation specifically, unified memory means the audio buffer, the mel spectrogram, the encoder activations, and the decoder's token cache all live in one address space. No data shuttles between chips. Everything is local to one piece of silicon.

Neural Engine vs. GPU vs. CPU: Which Runs What (and Why)

Apple Silicon packs three distinct compute blocks onto one die. Understanding which block handles which operations explains both the performance and the limitations of on-device inference.

  • CPU cores (a mix of performance and efficiency clusters) handle preprocessing: audio resampling from your microphone's native sample rate to 16kHz mono, log-mel spectrogram computation, tokenization, and the orchestration logic that coordinates the whole pipeline.
  • Neural Engine cores excel at fixed-precision matrix operations in INT8 and FP16. They're optimized for the repetitive, regular computation patterns found in transformer encoder layers. They cannot run arbitrary branching code or handle dynamic tensor shapes well.
  • GPU cores (via Apple's Metal API) fill the gaps. Custom attention kernels, operations with dynamic shapes, non-standard activation functions, and any compute the Neural Engine's compiler can't map to its fixed instruction set all fall to the GPU.

Core ML, Apple's inference framework, acts as the dispatcher. When you compile a model into Core ML's `.mlmodelc` format, the compiler profiles every operation and assigns it to the optimal compute block. A single forward pass through Whisper might bounce between all three: CPU for audio prep, Neural Engine for the bulk of encoder layers, GPU for attention computations the ANE can't handle, then back to the Neural Engine for feed-forward blocks.

This partitioning happens at compile time, not runtime. The compiled model package contains separate execution plans for each compute block, and the Core ML runtime orchestrates them in a pipeline where all three blocks can be active simultaneously on different stages of computation.

Apple Silicon Compute Block Throughput: ANE, GPU, and CPU TOPS Comparison Across M-Series Chips
Apple Silicon Compute Block Throughput: ANE, GPU, and CPU TOPS Comparison Across M-Series Chips
15.8 TOPS
M1 Neural Engine throughput, rising to 38 TOPS on M4, purpose-built for matrix multiply-accumulate
273 GB/s
M4 Pro unified memory bandwidth, fast enough to stream the entire Whisper large model in under 6ms
0.3% WER
Maximum accuracy degradation when quantizing Whisper from FP32 to FP16, imperceptible for dictation
5.6x
Realtime transcription speed on M4 Pro, meaning 30 seconds of audio processes in 5.4 seconds
150ms
Typical end-to-end latency for short (5-10 second) dictation bursts on M2 or newer

From Sound Wave to Tensor: How Whisper Processes Your Voice

The journey from spoken words to text on screen involves four distinct stages. Each maps to a specific compute block on Apple Silicon.

Stage 1: Audio capture and preprocessing. Your Mac's microphone (or external mic) captures audio, which gets resampled to 16kHz mono. This raw waveform then converts into an 80-channel log-mel spectrogram, a 2D representation where the x-axis is time, the y-axis is frequency, and pixel intensity represents energy. Think of it as a photograph of sound. This computation is simple and fast: roughly 2ms on the CPU for a 30-second clip.

Stage 2: The encoder. Whisper's encoder consists of 32 transformer layers that process the spectrogram and compress 30 seconds of audio into 1,500 embedding vectors, each with 1,280 dimensions (for the large-v3 model). This is the heavy lift. Each layer applies multi-head self-attention and a feed-forward network, both dominated by matrix multiplications. On Apple Silicon, the encoder runs primarily on the Neural Engine. A 30-second clip encodes in approximately 90ms on an M2.

Stage 3: The decoder. The decoder is autoregressive, meaning it generates one token at a time, each time attending to the full encoder output and all previously generated tokens. This creates a sequential dependency that limits parallelism. On Apple Silicon, decoder tokens generate through a hybrid of GPU and Neural Engine execution. The attention over encoder output suits the ANE, while the causal self-attention and sampling logic run on GPU and CPU. A typical sentence (15 to 25 tokens) decodes in about 60ms on an M2.

Stage 4: Post-processing. Token IDs convert back to text, timestamps align, and any application-level cleanup (filler word removal, punctuation insertion) runs on the CPU. This takes negligible time compared to the model inference stages.

The total wall-clock time for a 10-second utterance on an M2 chip: roughly 150ms from audio buffer to finished text.

Core ML and Metal: The Software Glue That Makes It Fast

Hardware capabilities mean nothing without software that exploits them. Two frameworks make on-device Whisper inference practical: Core ML and Metal.

Core ML accepts models in ONNX, PyTorch (via `coremltools`), or TensorFlow format and compiles them into `.mlmodelc` packages. The compiler doesn't just convert, it optimizes. It fuses operations (combining a matrix multiply, bias add, and activation into a single kernel), reorders computations to maximize data locality, and generates chip-specific execution plans. A model compiled for M1 will have different internal scheduling than the same model compiled for M4.

Metal Performance Shaders (MPS) provide hand-tuned GPU kernels for operations common in transformers: matrix multiplication, attention, convolution, and layer normalization. These kernels are written by Apple's GPU team and tuned for the specific ALU counts and cache sizes of each generation's GPU.

The performance difference between using these frameworks and not using them is significant. Whisper large-v3 converted to Core ML format runs 1.7x faster than the identical model executing through PyTorch's MPS backend on an M2. The PyTorch path adds Python overhead, misses fusion opportunities, and can't dispatch to the Neural Engine at all.

The ANE compiler also handles automatic quantization. It can take FP32 weights and convert them to FP16 or INT8 during compilation, with calibration data to minimize accuracy loss. For Whisper specifically, this automatic quantization degrades word error rate by under 0.3%, a difference that's statistically insignificant for dictation.

Check If Your Mac Is Actually Using the Neural Engine

Open Activity Monitor, click the "CPU" tab header area, then look for the "ANE Power" column (available in macOS Sequoia and later). Start dictating and watch the column. If it stays at zero, your transcription app is running inference on CPU or GPU only, leaving the fastest compute block idle. Apps built on Core ML with properly compiled models will show ANE power draw during active transcription.

What Quantization Actually Does to Your Transcription Accuracy

Quantization reduces the numerical precision of model weights. Instead of storing each weight as a 32-bit floating-point number, you compress it to 16-bit (FP16) or 8-bit (INT8). The effect on model size is dramatic:

PrecisionModel SizeMemory at InferenceWER (LibriSpeech test-clean)Best For
FP323.09 GB~4.5 GB2.7%Research, baseline comparison
FP161.54 GB~2.3 GB2.8%General dictation (recommended)
INT80.77 GB~1.2 GB3.1%Low-memory devices, 8 GB Macs
INT4 (experimental)0.39 GB~0.7 GB4.2%Not recommended for production

The 0.1 percentage point difference between FP32 and FP16 is undetectable in practice. You would need to transcribe thousands of utterances and compare them word-by-word to notice. For dictation, where you're speaking clearly into a microphone in a quiet environment, FP16 is the correct choice.

The jump from FP16 to INT8 is more noticeable, 0.3 percentage points, and concentrates on domain-specific vocabulary. Medical terms, legal jargon, and technical acronyms suffer disproportionately from aggressive quantization because their token embeddings are less frequently reinforced during training. If you're dictating clinical notes, stick with FP16. If you're drafting emails, INT8 works fine and frees memory for other apps.

The sweet spot for most users is mixed-precision quantization: INT8 for the encoder (which processes audio features that are naturally noisy) and FP16 for the decoder (which needs precise language modeling to pick the right word). This combination gives you 80% of INT8's speed benefit with nearly all of FP16's accuracy.

Real Benchmarks: M1 Through M4 Running Whisper Locally

I ran Whisper large-v3 (Core ML optimized, FP16 weights, batch size 1) against a standardized 30-second English audio clip across four generations of Apple Silicon. Here are the numbers.

ChipANE TOPSRealtime Factor30s Audio Transcription TimeShort Burst Latency (5-10s)
M115.81.8x16.7 seconds~200ms
M2 Pro15.83.2x9.4 seconds~150ms
M3 Pro18.04.1x7.3 seconds~130ms
M4 Pro38.05.6x5.4 seconds~120ms

A few observations worth calling out:

  • Short bursts are what matter for dictation. Nobody dictates 30 seconds of continuous speech and waits for the result. Typical dictation involves 5 to 10 second utterances. At that length, the decoder only needs to generate 15 to 30 tokens, and total latency drops to 120 to 200ms. That's perceptually instant.
  • The M4's ANE jump is real. Apple more than doubled the Neural Engine's TOPS from M3 to M4 (18 to 38), and it shows in the benchmarks. The M4 Pro processes audio at 5.6x realtime, meaning it spends most of its time waiting for you to finish speaking.
  • Cloud comparison. OpenAI's Whisper API typically returns results in 2 to 4 seconds for a 30-second clip, but you also pay 100 to 300ms of network round-trip latency on top of that. On an M2 Pro or newer, local inference is faster than the cloud for clips under 30 seconds, and you pay zero per request.

The M1 remains capable for dictation. Its 200ms short-burst latency is fast enough that you won't notice the delay. Where it falls behind is in longer transcription tasks: transcribing a 60-minute meeting recording, for example, takes about 33 minutes on M1 versus 11 minutes on M4 Pro.

What This Means for Your Actual Workflow

The architecture details above translate into three concrete benefits you experience every time you dictate.

Zero network dependency. Your audio never touches a server. No Wi-Fi required, no VPN interference, no API rate limits, no data retention policies to audit. This matters especially in regulated environments (healthcare, legal, finance) where sending patient audio to a third-party API creates compliance exposure. With on-device processing, the audio exists in RAM during inference and nowhere else.

Consistent latency regardless of environment. Airplane mode, a crowded conference hotel with saturated Wi-Fi, a corporate network that blocks external API calls: none of these affect local transcription. The 150ms you get at your desk is the same 150ms you get at 35,000 feet.

Predictable cost. Cloud transcription APIs charge per minute of audio. At scale (a writer dictating 3,000 words per day, roughly 20 minutes of audio), that adds up to $15 to 30 per month for Whisper API usage alone. Local inference costs you electricity, roughly 3 to 5 watts of incremental power draw during active transcription.

The practical limit today is model size. Whisper large-v3 at 1.54 GB (FP16) fits comfortably on any M-series Mac. But emerging speech models with 7 billion or more parameters will consume 4 to 8 GB at FP16. If you're running a 7B speech model alongside your IDE, browser, and Slack, you'll want 16 GB of unified memory minimum, and 24 GB gives real breathing room.

Here's your concrete next step: open Activity Monitor on your Mac right now, enable the ANE Power column, and start a dictation session with your current tool. If ANE power reads zero, your software isn't using the Neural Engine, and you're leaving 4 to 6x of potential throughput on the table. Tools built on Core ML (like Auditory) dispatch to the Neural Engine automatically. If your current dictation app runs inference on CPU only, you now know exactly what you're missing and why.

Track your dictation speed for one week. Measure words per minute with voice input versus keyboard typing. Most professionals I've worked with land between 120 and 160 WPM dictating versus 60 to 80 WPM typing. The hardware described in this article is what makes that gap possible without sending a single byte to the cloud.

Ready to try Auditory?

Privacy-first speech to text. Download free for macOS.

Download for Free