Founding Reliability & Performance Engineer at LiteLLM | Y Combinator

Hacker News

February 27, 2026

TLDR LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23. When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen. You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it What this job actually is We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be: Hunting a memory leak in our async streaming handler that causes OOMs after 4 hours under load Fixing a race condition where PodLockManager releases another pod's lock Profiling why update_database() does 7 deep copies per request in the spend tracking hot path Helping a Fortune 500 customer debug why their 20-pod deployment is exhausting Postgres connections Building soak tests that catch degradation before a release goes out Reviewing a PR that touches the request hot path and saying "this will add 50ms at P99, here's why" If you're looking for a pure optimization role where you sit in a profiler all day — this isn't it. If you want to own production health for one of the most widely deployed AI infrastructure projects in the world — keep reading. Why this matters We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway. Another has 150K users hitting us daily. When we ship a bad release, it doesn't just break a dashboard — it breaks production AI systems at companies you've heard of. The problems here are genuinely hard: Memory management in long-running Python async services — our proxy handles thousands of concurrent streaming connections. HTTP client sessions, response iterators, and background tasks all need careful lifecycle management. Database at scale — spend logging, auth, and rate limiting all interact with Postgres. At 100K+ requests/day, naive patterns fall apart. 100+ provider surface area — we translate between OpenAI, Anthropic, Bedrock, Vertex, and 100+ other APIs. Each has unique streaming behavior. A refactor that fixes one provider can break three others. You won't run out of interesting problems. What you'll own Production reliability On-call for critical issues (shared rotation with the team, not solo) Incident response and blameless post-mortems Customer escalation support for enterprise deployments Making the proxy self-healing when DB/Redis is temporarily unavailable Performance engineering Memory leak detection and prevention (soak tests, CI integration) Hot path optimization — our target is <10ms overhead at 5K+ RPS P50/P95/P99 latency benchmarks that block releases on regression Profiling and fixing bottlenecks (Pydantic validation, connection pools, async task scheduling) Observability & release safety Structured logging, distributed tracing, correlation IDs Prometheus metrics that are actually accurate and actionable Building toward canary deployments and automated rollback SLO definition and tracking for enterprise customers Who you are Must have: 2+ years of experience running Python services in production, with real exposure to debugging things that break at scale Strong understanding of Python async internals — asyncio event loop, aiohttp/httpx session management, connection pooling Experience debugging production memory leaks, OOMs, or latency degradation (bonus if you've used memray, py-spy, or tracemalloc) Solid PostgreSQL knowledge — connection pool tuning, query optimization, understanding how DB operations on the request path degrade under load Comfort with Kubernetes at an operational level — pod lifecycle, resource limits, health probes You've been on-call before and you didn't hate it Strong signals: You've worked on a proxy, API gateway, load balancer, or middleware service where overhead itself is what you optimize You've worked at Meta (Production Engineering), Cloudflare, Fastly, Datadog, Stripe, or a similar infrastructure company You've been an early reliability/infra hire at a startup and built production practices from scratch You've contributed to open-source infrastructure projects You understand HTTP/2, streaming responses (SSE), and how async Python handles them under concurrency Why LiteLLM Scale & impact: Your work is in the critical path for hundreds of millions of AI API calls daily. NASA, Netflix, Adobe, Stripe depend on this. Open source visibility: 36K GitHub stars. Your contributions are visible to the entire AI infrastructure community. Your GitHub profile will look incredible. Ownership: First dedicated reliability hire. You define what reliability means here. No bureaucracy, no tickets — you see a problem, you fix it. Trajectory: $7M ARR growing fast, 10-person team, YC W23. Meaningful equity at a stage where it can matter.

Verticals

techstartups

Originally published on Hacker News on 2/27/2026