github.com
March 22, 2026
6 min read
65/100
Summary
Flash-Moe is a pure C/Metal inference engine that runs the Qwen3.5-397B-A17B model, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at over 4.4 tokens per second. The 209GB model streams from SSD using a custom Metal compute pipeline without relying on Python or other frameworks.
Key Takeaways
Community Sentiment
Positives
Concerns

A 10 year old Xeon is all you need
Jun 1, 2026

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS
Apr 1, 2026

DeepSeek 4 Flash local inference engine for Metal
May 7, 2026

Making LLM Training Faster with Unsloth and NVIDIA
May 7, 2026

We got 207 tok/s with Qwen3.5-27B on an RTX 3090
Apr 20, 2026