AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

Privacy

Contact

Back to all news

llms developer-tools apple-silicon model-optimization

Run a 1T parameter model on a 32gb Mac by streaming tensors from NVMe

GitHub - t8/hypura: Run models too big for your Mac's memory

github.com

March 24, 2026

6 min read

🔥🔥🔥🔥🔥

59/100

Summary

Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon, allowing users to run large models that exceed their Mac's memory. It optimally distributes model tensors across GPU, RAM, and NVMe storage based on access patterns and hardware capabilities to prevent system crashes.

Key Takeaways

Hypura is a storage-tier-aware LLM inference scheduler designed for Apple Silicon that allows models exceeding physical memory to run without crashing the system.
The system optimizes tensor placement across GPU, RAM, and NVMe based on access patterns, achieving significant I/O reduction and enabling efficient model execution.
Hypura supports multiple inference modes, including full-resident, expert-streaming, and dense FFN-streaming, automatically selecting the best mode based on model size and available hardware resources.
Benchmarks show that Hypura can run large models like Mixtral 8x7B and Llama 70B on a 32 GB Mac Mini, which would otherwise crash under naive loading methods.

Read original article

Community Sentiment

Mixed

Positives

The project demonstrates an innovative approach to utilizing NVMe storage, effectively extending RAM capabilities for running large models, which could enhance accessibility for users with limited hardware.
Using a smart method to manage memory allocation could optimize performance, allowing for more efficient use of resources during model inference.

Concerns

Concerns about the potential stress on NVMe drives during intensive model generation highlight the risks to hardware longevity, which could deter users from adopting this approach.
The overhead associated with mmap in similar designs raises questions about the efficiency of this method, suggesting it may not perform as well as hoped in practice.