04.04.2026 15:26Author: Viacheslav Vasipenok

Ollama Just Got Blazing Fast on Macs: Full MLX Support Brings 2× Speedups and NVIDIA-Quality 4-Bit Inference

News image

Ollama — the go-to open-source tool for running powerful LLMs locally — has officially switched to Apple’s MLX framework on Apple Silicon. The update (Ollama 0.19 preview) turns every Mac with sufficient unified memory into a significantly faster local AI machine.

The results speak for themselves. On an M5 Max, using the new Qwen3.5-35B-A3B model quantized with NVIDIA’s NVFP4 format:

On an M5 Max, using the new Qwen3.5-35B-A3B model quantized with NVIDIA’s NVFP4 format

Even higher numbers are possible with int4 quantization: up to 1851 tokens/s prefill and 134 tokens/s decode.

112 tokens per second on a 35-billion-parameter model is *fire*. That’s the kind of speed where a powerful coding agent can generate, review, and iterate on code changes faster than most developers can read them.


What Changed Under the Hood

1. Native MLX Backend

Ollama is now built directly on top of Apple’s open-source MLX framework. It takes full advantage of unified memory architecture and the new GPU Neural Accelerators on M5-series chips. No more fighting between CPU and GPU memory pools — everything just flies.

2. NVIDIA NVFP4 Quantization

For the first time, Ollama supports NVIDIA’s NVFP4 format. This 4-bit quantization delivers noticeably higher response quality than traditional 4-bit methods while slashing memory usage and bandwidth. The model feels almost full-precision in practice.

3. Smart KV Cache Reuse

Cache snapshots are now intelligently stored and reused across conversations. Shared prompts (common in agentic workflows or coding sessions) hit the cache far more often, dramatically cutting down on repeated prompt processing.

4. One-Command Model Launch

Starting the right model for a specific task is now dead simple.

Example:
```bash
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4


Currently Available

The spotlight is on Qwen3.5-35B-A3B (tagged `qwen3.5:35b-a3b-coding-nvfp4`), a sparse Mixture-of-Experts model heavily optimized for coding and agentic tasks. It requires a Mac with more than 32 GB of unified memory to run comfortably.

More models and easier import of custom fine-tunes are already in the works.

Also read:


Why This Matters

Local LLMs on Macs have always been convenient. Now they’re genuinely fast — fast enough to power serious daily workflows: instant code assistance, personal agents, research tools, or quick prototyping.

Whether you’re running Claude-style coding agents, OpenClaw, or your own fine-tuned models, the experience just jumped a generation ahead. And because it’s all local, private, and offline, you keep full control of your data.

Ready to try it? 
Download Ollama 0.19 preview from ollama.com/download and fire up the new Qwen3.5 coding model.

Mac users just got one of the best local LLM upgrades of 2026.  
The gap between “runs locally” and “feels like a cloud supercomputer” has never been smaller.


0 comments
Read more