Themata.AI
Themata.AI

Popular tags:

#developer-tools#ai-agents#llms#claude#code-generation#ai-ethics#ai-safety#openai#anthropic#open-source

AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

© 2026 Themata.AI • All Rights Reserved

Privacy

|

Cookies

|

Contact
llmsdeveloper-toolsoptimizationcuda

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

GitHub - Luce-Org/lucebox-hub: Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.

github.com

April 20, 2026

5 min read

🔥🔥🔥🔥🔥

54/100

Summary

Lucebox is an optimization hub for hand-tuned LLM inference, specifically designed for individual consumer hardware. It features kernels, speculative decoding, and quantization tailored for each target, with the first megakernel for hybrid DeltaNet/Attention LLMs achieving 1.87 tokens per joule on a 2020 GPU.

Key Takeaways

  • Lucebox optimization hub enables hand-tuned LLM inference specifically for individual consumer hardware, improving performance without waiting for better silicon.
  • The first megakernel for hybrid DeltaNet/Attention LLMs achieves 1.87 tok/J on a 2020 GPU, matching the throughput of Apple's latest silicon.
  • DFlash speculative decoding on Qwen3.5-27B demonstrates a speedup of 3.43× over autoregressive methods on the HumanEval benchmark.
  • The project includes self-contained releases with benchmarks and writeups, focusing on optimizing software for existing hardware capabilities.
Read original article

Community Sentiment

Mixed

Positives

  • Running Qwen3.5-27B on an RTX 3090 demonstrates the potential for local AI deployment, allowing users to leverage existing hardware without vendor lock-in.
  • The ability to run models on Vulkan and Apple's hardware indicates a growing flexibility in AI infrastructure, which could enhance accessibility for developers.
  • The reported peak performance of 207.6 tok/s showcases significant advancements in model efficiency, potentially enabling faster inference for real-time applications.

Concerns

  • The claim of achieving 207 tok/s is misleading, as it relies on speculative decoding, which may compromise output quality compared to traditional methods.
  • The focus on a specific GPU like the RTX 3090 raises concerns about accessibility, as it may not represent the average developer's hardware capabilities.
  • The implementation appears to prioritize speed over quality, which could lead to inferior outputs and undermine the reliability of the model's performance.

Related Articles

Qwen3.5 - How to Run Locally Guide | Unsloth Documentation

How to run Qwen 3.5 locally

Mar 7, 2026

LLM Neuroanatomy II: Modern LLM Hacking and hints of a Universal Language?

LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?

Mar 24, 2026

GitHub - SharpAI/SwiftLM: ⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.

TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

Apr 1, 2026

GitHub - danveloper/flash-moe: Running a big model on a small laptop

Flash-MoE: Running a 397B Parameter Model on a Laptop

Mar 22, 2026

Unsloth Dynamic 2.0 GGUFs | Unsloth Documentation

Unsloth Dynamic 2.0 GGUFs

Feb 28, 2026