MoE models like GPT-OSS-20B can produce more tokens per second on the same hardware, making them efficient for local AI applications.
Recent models like Qwen 3.5 show significant improvements in performance, indicating that advancements in AI are leading to better results even with smaller models.
Concerns
The estimation method for model performance based on memory bandwidth may not accurately reflect the capabilities of MoE models, leading to potential misunderstandings.
Perplexity is a more critical metric than tokens per second, as high-speed outputs can still yield poor quality results, emphasizing the importance of quality over speed.