github.com
March 22, 2026
6 min read
65/100
Summary
Flash-Moe is a pure C/Metal inference engine that runs the Qwen3.5-397B-A17B model, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at over 4.4 tokens per second. The 209GB model streams from SSD using a custom Metal compute pipeline without relying on Python or other frameworks.
Key Takeaways
Community Sentiment
Positives
Concerns

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS
Apr 1, 2026

DeepSeek 4 Flash local inference engine for Metal
May 7, 2026

Making LLM Training Faster with Unsloth and NVIDIA
May 7, 2026

We got 207 tok/s with Qwen3.5-27B on an RTX 3090
Apr 20, 2026

I ran Gemma 4 as a local model in Codex CLI
Apr 12, 2026