Site Reliability Engineer, Yandex.Cloud Managed Service for Kubernetes

Our team develops and operates Managed Service for Kubernetes at Yandex.Cloud, which handles user infrastructure in a constantly growing number of Kubernetes clusters spanning several data centers (currently totaling approximately 2000 clusters with tens of thousands of cores). Each cluster is a distributed system requiring proactive component monitoring, scaling, and upgrading with no negative impact on the infrastructure and applications deployed to the cluster. Managing the cluster fleet is impossible without all kinds of ops automation: managing monitoring, backup and restore, cluster maintenance, upgrades, etc.

As the total number of clusters grows, so do individual clusters – in the number of nodes (hundreds), total compute resources, the number of pods, services, and other cluster resources. Managing an individual cluster requires learning the operation specifics of Kubernetes and its components in mid-sized and large clusters. That should be the base for developing scalable operation practices and a solution for the whole cluster fleet.

We’re working on providing the service for new Yandex.Cloud regions in several countries outside Russia and the CIS. That means developing and improving a unified service bootstrap and deployment system.

We’re looking for experienced Site Reliability Engineers to help us with the above tasks.

Responsibilities:

  • Research and fix problems with managed infrastructure and infrastructure for the service itself
  • Help with problem analysis and mitigation for:
  • Problems on the border between our service and the underlying IaaS services
  • Problems on the border between our service and user applications deployed to clusters
  • Develop and implement infrastructure operation practices, improve monitoring and alerting systems
  • Participate in service design, suggesting corrections and solutions for better service operability

Qualifications:

  • Software development experience
  • Experience operating mid-sized and large distributed systems
  • Practical experience designing and deploying infrastructure, skilled in "infrastructure as code," and specifically terraform
  • Practical experience designing and improving monitoring and alerting systems
  • Practical experience troubleshooting mid-sized and large distributed systems
  • Expertise in Linux network and container technology
Application submitted.
Thank You!
Sun Dec 05 2021 23:51:08 GMT+0300 (Moscow Standard Time)