GitHub - xaskasdf/ntransformer: High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

Hacker News

February 21, 2026

AI-Generated Deep Dive Summary

NTransformer is a high-efficiency LLM inference engine developed in C++/CUDA, designed to optimize the performance of large language models (LLMs) like Llama 70B on consumer-grade hardware such as an RTX 3090 with 24GB VRAM. By leveraging PCIe streaming and optional NVMe direct I/O, NTransformer efficiently manages memory usage, enabling the execution of extremely large models even when they exceed GPU VRAM limits. The engine supports multiple modes, including Resident Mode for fully loaded models in VRAM, Tiered Mode for hybrid caching across VRAM, pinned RAM, and NVMe storage, and Streaming Mode for models too large to fit entirely in memory. Key features of NTransformer include its 3-Tier Adaptive Caching system, which dynamically allocates resources based on hardware capabilities. This allows the engine to balance performance and efficiency by prioritizing GPU-resident layers while utilizing pinned RAM and NVMe SSDs for overflow data. The implementation eliminates traditional CPU bottlenecks in data transfer by enabling direct GPU access to NVMe storage via a userspace driver, significantly improving inference speeds. For instance, NTransformer achieves a 33x speedup over mmap baseline methods when running Llama 70B on an RTX 3090 paired with 48GB RAM and an NVMe SSD. The engine also supports various quantization formats (Q4_0, Q

Verticals

techstartups

Originally published on Hacker News on 2/21/2026