Two different tricks for fast LLM inference

Hacker News

February 15, 2026

AI-Generated Deep Dive Summary

Anthropic and OpenAI have both introduced "fast mode" features to enhance the speed of their large language model (LLM) inferences. However, their approaches differ significantly. Anthropic's fast mode offers up to 2.5x tokens per second, providing a faster yet still reliable experience by using low-batch-size inference. This method ensures immediate processing without waiting for batches, though it comes at a higher cost. In contrast, OpenAI's fast mode leverages Cerebras' ultra-fast chips and introduces a new model, GPT-5.3-Codex-Spark, which processes over 1000 tokens per second—six times faster than Anthropic's speed but with reduced capability compared to their full model. Anthropic's strategy focuses on optimizing the actual model used in fast mode, ensuring users interact directly with Opus 4.6. This approach avoids delays caused by batching, offering a seamless experience despite higher costs. OpenAI, however, prioritizes raw speed through specialized hardware and a tailored model designed for performance. While Spark is faster, it sacrifices some accuracy in complex tasks like tool calls, highlighting the trade-offs between speed and capability. The choice of method reflects each company's priorities. Anthropic emphasizes quality and flexibility, catering to users willing to pay for immediate results. OpenAI, on the other hand, prioritizes scalability and efficiency, offering unmatched speed with a model optimized for performance rather than precision. These differences matter to developers and businesses, as they dictate which tool suits their needs—whether it's rapid coding tasks or more nuanced applications requiring accuracy. For tech enthusiasts and AI adopters, these advancements underscore the evolving

Verticals

techstartups

Originally published on Hacker News on 2/15/2026