Scaling Kubernetes to 2,500 nodes

OpenAI Blog
January 18, 2018
Our Dota⁠(opens in a new window) project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in Pending⁠(opens in a new window) for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found that kubelet⁠(opens in a new window) has a --serialize-image-pulls flag which defaults to true, meaning the Dota image pull blocked all other images. Changing to false required switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.
Verticals
airesearch
Originally published on OpenAI Blog on 1/18/2018