Site Reliability Engineer, Yandex.Cloud Managed Service for Kubernetes

Our team develops and operates Managed Service for Kubernetes at Yandex.Cloud, which handles user infrastructure in a constantly growing number of Kubernetes clusters spanning several data centers (currently totaling approximately 2000 clusters with tens of thousands of cores). Each cluster is a distributed system requiring proactive component monitoring, scaling, and upgrading with no negative impact on the infrastructure and applications deployed to the cluster. Managing the cluster fleet is impossible without all kinds of ops automation: managing monitoring, backup and restore, cluster maintenance, upgrades, etc.

As the total number of clusters grows, so do individual clusters – in the number of nodes (hundreds), total compute resources, the number of pods, services, and other cluster resources. Managing an individual cluster requires learning the operation specifics of Kubernetes and its components in mid-sized and large clusters. That should be the base for developing scalable operation practices and a solution for the whole cluster fleet.

We’re working on providing the service for new Yandex.Cloud regions in several countries outside Russia and the CIS. That means developing and improving a unified service bootstrap and deployment system.

We’re looking for experienced Site Reliability Engineers to help us with the above tasks.

Tasks that await you

Research and fix problems with managed infrastructure and infrastructure for the service itself
Help with problem analysis and mitigation for
Problems on the border between our service and the underlying IaaS services
Problems on the border between our service and user applications deployed to clusters
Develop and implement infrastructure operation practices, improve monitoring and alerting systems
Participate in service design, suggesting corrections and solutions for better service operability

We expect that you

Software development experience
Experience operating mid-sized and large distributed systems
Practical experience designing and deploying infrastructure, skilled in "infrastructure as code," and specifically terraform
Practical experience designing and improving monitoring and alerting systems
Practical experience troubleshooting mid-sized and large distributed systems
Expertise in Linux network and container technology