SRE for Yandex Private Cloud
We are building one of the biggest private Linux container-based clouds in Russia.

Our team is responsible for the development and exploitation of the Yandex internal cloud cluster. Almost all Yandex production services live here. We have dozens of thousands of Linux hosts in multiple locations along with several HPC TOP500 clusters, which bring different challenges to our job. Our clients are Yandex developers who deploy their workload in cloud-based containers.

To provide safe and reliable service, we develop, support, and adopt many infrastructure services and agents. We also fix open-source system applications when needed and send patches upstream. Our main goal is to minimize routine work and improve automation to keep our system sustain level under control.

Tasks that await you

  • Perform critical operations and safely deploy solutions
  • Resolve automation conflicts and hiccups for routine operations
  • Optimize the automation of routine operations
  • Prepare infrastructure services for working with external computing resource environments (e.g. non-Yandex cloud provider VMs)
  • Look for hardware issue workarounds, improve hardware error detection and automation for fixing or resolving issues
  • Support and develop host authentication and authorization systems
  • Profile and fine-tune our hardware, Linux kernel, and system services
  • Support our package repository and host Ubuntu-based distro package sets

We expect that you

  • Experience developing in Python or Golang
  • Motivated to engage in infrastructure development tasks

It'd be a plus if you

  • Able to rapidly dive into new programming languages and technologies
  • Desire to get first-hand Linux system programming experience in a real production environment at scale
Thank you for your apply!

We will contact you within a week.

Mon Feb 12 2024 19:00:36 GMT+0300 (Moscow Standard Time)