github.com
March 22, 2026
6 min read
Summary
Flash-Moe is a pure C/Metal inference engine that runs the Qwen3.5-397B-A17B model, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at over 4.4 tokens per second. The 209GB model streams from SSD using a custom Metal compute pipeline without relying on Python or other frameworks.
Key Takeaways
Community Sentiment
MixedPositives
Concerns
Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser
Feb 10, 2026

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe
Mar 24, 2026

LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
Mar 24, 2026
Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model
Feb 10, 2026

A CPU that runs entirely on GPU
Mar 4, 2026
Source
github.com
Published
March 22, 2026
Reading Time
6 minutes
Relevance Score
65/100
Why It Matters
This page is optimized for focused reading: quick context up top, a clean summary block, and a direct path to the original source when you want the full story.