Our team develops and operates Managed Service for Kubernetes at Yandex.Cloud, which handles user infrastructure in a constantly growing number of Kubernetes clusters spanning several data centers (currently totaling approximately 2000 clusters with tens of thousands of cores).
Each cluster is a distributed system requiring proactive component monitoring, scaling, and upgrading with no negative impact on the infrastructure and applications deployed to the cluster. Managing the cluster fleet is impossible without all kinds of ops automation: managing monitoring, backup and restore, cluster maintenance, upgrades, etc.
As the total number of clusters grows, so do individual clusters – in the number of nodes (hundreds), total compute resources, the number of pods, services, and other cluster resources. Managing an individual cluster requires learning the operation specifics of Kubernetes and its components in mid-sized and large clusters. That should be the base for developing scalable operation practices and a solution for the whole cluster fleet.
We’re working on providing the service for new Yandex.Cloud regions in several countries outside Russia and the CIS. That means developing and improving a unified service bootstrap and deployment system.
We’re looking for experienced Site Reliability Engineers to help us with the above tasks.