As the total number of clusters grows, so do individual clusters – in the number of nodes (hundreds), total compute resources, the number of pods, services, and other cluster resources. Managing an individual cluster requires learning the operation specifics of Kubernetes and its components in mid-sized and large clusters. That should be the base for developing scalable operation practices and a solution for the whole cluster fleet.
We’re working on providing the service for new Yandex.Cloud regions in several countries outside Russia and the CIS. That means developing and improving a unified service bootstrap and deployment system.
We’re looking for experienced Site Reliability Engineers to help us with the above tasks.