Yandex Supercomputers

Yandex
Supercomputers

Machine learning helps Yandex improve people’s lives. Thanks to machine learning, Yandex users instantly get accurate search results, chat with their AI assistant as if she was a real person, watch foreign films in their own language and solve many other tasks.

The three Yandex supercomputers are dedicated specifically to machine learning tasks. Each of these supercomputers carries the name of a Russian scientist whose work defined the development of machine learning and shaped the way we work with big data today.

Chervonenkis

Named after Alexey Chervonenkis, one of the greatest theorists of machine learning.

19th place

Top500, November 2021

Nodes

199

Performance

21,530

Tflops

Cores

25,472

GPU

1,592

NVIDIA A100 80G

RAM

199

TB

Power

583

kW

Galushkin

Named after Alexander Galushkin, a leading researcher in neural network theory.

36th place

Top500, November 2021

Nodes

136

Performance

16,020

Tflops

Cores

17,408

GPU

1,088

NVIDIA A100 80G

RAM

136

TB

Power

330

kW

Lyapunov

Carries the name of Alexey Lyapunov, a famous mathematician whose groundbreaking work laid the foundation for computer science and machine learning.

40th place

Top500, November 2021

Nodes

137

Performance

12,810

Tflops

Cores

17,536

GPU

1,096

NVIDIA A100 40G

RAM

68.5

TB

Power

323

kW

To distribute machine learning tasks among these supercomputers, Yandex has developed YT, a platform for the storage and processing of big data.

About YTsaurus

YTsaurus is Yandex’s core platform for storing and processing big data, analogous to Hadoop MapReduce and HBase in terms of its features and functionality. In 2023 the platform went open-source.

The bottom layer of YTsaurus consists of a distributed file system (DFS), similar to HDFS or GFS. Crucial differences include support of transactionality, data storage in tables, and a blocking system on nodes, which allows using YTsaurus as a coordination service (analogous to Apache Zookeeper).

The layer above DFS is a scheduler, which can manage a group of hosts with thousands of GPUs and more than a million CPU cores. It splits large computations (operations) into separate blocks (jobs), distributes resources between them, monitors their execution, and restarts those jobs that fail.

The top layer of YTsaurus is a distributed key-value storage (comparable to BigTable and HBase). The storage and the file system share a common namespace, which looks for the end user like a special type of tables in DFS. These tables support efficient read and write operations on strings by primary key. The key features of the KV storage include transactionality, strict consistency (the snapshot isolation level), and support for distributed transactions.

Hardware

Yandex supercomputers run on NVIDIA A100 graphics accelerators with InfiniBand interconnect based on Mellanox switches. The standard NVIDIA HGX A100 architecture has been optimized for machine learning tasks by Yandex to increase the cluster size and train even the largest ML models about twice as fast.

Facts and figures

500+

employees use supercomputers

200,000+

solved tasks a month

up to 3,500

tasks executed simultaneously

32 seconds

the fastest task execution time

11 minutes

median task execution time

25 days

the longest task execution time

500+

employees use supercomputers

200,000+

solved tasks a month

up to 3,500

tasks executed simultaneously

32 seconds

the fastest task execution time

11 minutes

median task execution time

25 days

the longest task execution time

The data is calculated for the three Yandex supercomputers in October 2021.

The NVIDIA A100 accelerators used in Yandex supercomputers are also available for the customers of Yandex’s cloud computing platform, Yandex.Cloud. They can choose to use the NVIDIA A100 accelerators when creating a cloud infrastructure as part of a virtual machine or in DataSphere, a specialized Yandex service for machine learning tasks. Customers can use DataSphere to launch the development and operation of their machine learning models, while paying only for real computing time.

Try DataSphere