Themata.AI
Themata.AI

Popular tags:

#developer-tools#ai-agents#llms#claude#ai-ethics#code-generation#openai#ai-safety#anthropic#open-source

AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

© 2026 Themata.AI • All Rights Reserved

Privacy

|

Cookies

|

Contact
llmsdeveloper-toolsai-inferencemacbook-pro

Flash-MoE: Running a 397B Parameter Model on a Laptop

GitHub - danveloper/flash-moe: Running a big model on a small laptop

github.com

March 22, 2026

6 min read

Summary

Flash-Moe is a pure C/Metal inference engine that runs the Qwen3.5-397B-A17B model, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at over 4.4 tokens per second. The 209GB model streams from SSD using a custom Metal compute pipeline without relying on Python or other frameworks.

Key Takeaways

  • A 397 billion parameter Mixture-of-Experts model, Qwen3.5-397B-A17B, runs on a MacBook Pro with 48GB RAM at over 4.4 tokens per second using a custom C/Metal inference engine.
  • The model utilizes a 209GB data size, streamed from SSD with no reliance on Python or frameworks, employing only C, Objective-C, and hand-tuned Metal shaders.
  • The architecture includes 60 transformer layers with 512 experts per layer, activating K=4 experts per token for processing.
  • The system achieves a 71% hit rate for expert data caching using the OS page cache, outperforming custom caching approaches.

Community Sentiment

Mixed

Positives

  • Running the 397B parameter Qwen 3.5 model on consumer devices is now feasible, showcasing advancements in model quantization and accessibility.
  • Achieving an 87.86% score on the MMLU benchmark indicates that the model performs well even on limited hardware, which is promising for broader AI applications.
  • The success of running the model on an M1 Ultra with substantial context length highlights the potential for high-performance AI on consumer-grade devices.

Concerns

  • Reducing the number of experts per token to fit the model on consumer hardware may significantly degrade performance, raising concerns about the trade-offs in quality.
  • The reliance on 2-bit quantization for running large models could lead to inadequate performance for real-world applications, limiting its practical usability.
Read original article

Source

github.com

Published

March 22, 2026

Reading Time

6 minutes

Relevance Score

65/100

🔥🔥🔥🔥🔥

Why It Matters

This page is optimized for focused reading: quick context up top, a clean summary block, and a direct path to the original source when you want the full story.