Themata.AI
Themata.AI

Popular tags:

#developer-tools#ai-agents#llms#claude#ai-ethics#code-generation#openai#ai-safety#anthropic#open-source

AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

© 2026 Themata.AI • All Rights Reserved

Privacy

|

Cookies

|

Contact
llmsdeveloper-toolsapple-siliconmodel-optimization

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

GitHub - t8/hypura: Run models too big for your Mac's memory

github.com

March 24, 2026

6 min read

🔥🔥🔥🔥🔥

59/100

Summary

Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon, allowing users to run large models that exceed their Mac's memory. It optimally distributes model tensors across GPU, RAM, and NVMe storage based on access patterns and hardware capabilities to prevent system crashes.

Key Takeaways

  • Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon that allows models exceeding physical memory to run without crashing the system.
  • The system optimizes tensor placement across GPU, RAM, and NVMe based on access patterns, achieving significant I/O reduction and enabling efficient model execution.
  • Hypura supports multiple inference modes, including full-resident, expert-streaming, and dense FFN-streaming, automatically selecting the best mode based on model size and available hardware resources.
  • Benchmarks show that Hypura can run large models like Mixtral 8x7B and Llama 70B on a 32 GB Mac Mini, which would otherwise crash under naive loading methods.
Read original article

Community Sentiment

Mixed

Positives

  • The project demonstrates an innovative approach to utilizing NVMe storage, effectively extending RAM capabilities for running large models, which could enhance accessibility for users with limited hardware.
  • Using a smart method to manage memory allocation could optimize performance, allowing for more efficient use of resources during model inference.

Concerns

  • Concerns about the potential stress on NVMe drives during intensive model generation highlight the risks to hardware longevity, which could deter users from adopting this approach.
  • The overhead associated with mmap in similar designs raises questions about the efficiency of this method, suggesting it may not perform as well as hoped in practice.

Related Articles

GitHub - AlexsJones/llmfit: Hundreds models & providers. One command to find what runs on your hardware.

Right-sizes LLM models to your system's RAM, CPU, and GPU

Mar 1, 2026

GitHub - danveloper/flash-moe: Running a big model on a small laptop

Flash-MoE: Running a 397B Parameter Model on a Laptop

Mar 22, 2026

GitHub - SharpAI/SwiftLM: ⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

Apr 1, 2026

GitHub - TrevorS/voxtral-mini-realtime-rs

Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

Feb 10, 2026

GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

Apr 20, 2026