AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

Privacy

Contact

Back to all news

llms developer-tools optimization cuda

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.

github.com

April 20, 2026

5 min read

🔥🔥🔥🔥🔥

56/100

Summary

Lucebox is an optimization hub for hand-tuned LLM inference, specifically designed for individual consumer hardware. It features kernels, speculative decoding, and quantization tailored for each target, with the first megakernel for hybrid DeltaNet/Attention LLMs achieving 1.87 tokens per joule on a 2020 GPU.

Key Takeaways

Lucebox optimization hub enables hand-tuned LLM inference specifically for individual consumer hardware, improving performance without waiting for better silicon.
The first megakernel for hybrid DeltaNet/Attention LLMs achieves 1.87 tok/J on a 2020 GPU, matching the throughput of Apple's latest silicon.
DFlash speculative decoding on Qwen3.5-27B demonstrates a speedup of 3.43× over autoregressive methods on the HumanEval benchmark.
The project includes self-contained releases with benchmarks and writeups, focusing on optimizing software for existing hardware capabilities.

Read original article

Community Sentiment

Mixed

Positives

Running Qwen3.5-27B on an RTX 3090 demonstrates the potential for local AI deployment, allowing users to leverage existing hardware without vendor lock-in.
The ability to run models on Vulkan and Apple's hardware indicates a growing flexibility in AI infrastructure, which could enhance accessibility for developers.
The reported peak performance of 207.6 tok/s showcases significant advancements in model efficiency, potentially enabling faster inference for real-time applications.

Concerns

The claim of achieving 207 tok/s is misleading, as it relies on speculative decoding, which may compromise output quality compared to traditional methods.
The focus on a specific GPU like the RTX 3090 raises concerns about accessibility, as it may not represent the average developer's hardware capabilities.
The implementation appears to prioritize speed over quality, which could lead to inferior outputs and undermine the reliability of the model's performance.