Yandex
Supercomputers
Machine learning helps Yandex improve people’s lives. Thanks to machine learning, Yandex users instantly get accurate search results, chat with their AI assistant as if she was a real person, watch foreign films in their own language and solve many other tasks.

The three Yandex supercomputers are dedicated specifically to machine learning tasks. Each of these supercomputers carries the name of a Russian scientist whose work defined the development of machine learning and shaped the way we work with big data today.
Chervonenkis
Named after Alexey Chervonenkis, one of the greatest theorists of machine learning.
19th place
Top500, November 2021
Nodes
199
Performance
21,530

Tflops

Cores
25,472
GPU
1,592

NVIDIA A100 80G

RAM
199

TB

Power
583

kW

Galushkin
Named after Alexander Galushkin, a leading researcher in neural network theory.
36th place
Top500, November 2021
Nodes
136
Performance
16,020

Tflops

Cores
17,408
GPU
1,088

NVIDIA A100 80G

RAM
136

TB

Power
330

kW

Lyapunov
Carries the name of Alexey Lyapunov, a famous mathematician whose groundbreaking work laid the foundation for computer science and machine learning.
40th place
Top500, November 2021
Nodes
137
Performance
12,810

Tflops

Cores
17,536
GPU
1,096

NVIDIA A100 40G

RAM
68.5

TB

Power
323

kW

To distribute machine learning tasks among these supercomputers, Yandex has developed YT, a platform for the storage and processing of big data.
About YT
YT is Yandex’s core platform for storing and processing big data, analogous to Hadoop MapReduce and HBase in terms of its features and functionality.

The bottom layer of YT consists of a distributed file system (DFS), similar to HDFS or GFS. Crucial differences include support of transactionality, data storage in tables, and a blocking system on nodes, which allows using YT as a coordination service (analogous to Apache Zookeeper).

The layer above DFS is a scheduler, which can manage a group of hosts with thousands of GPUs and more than a million CPU cores. It splits large computations (operations) into separate blocks (jobs), distributes resources between them, monitors their execution, and restarts those jobs that fail.

The top layer of YT is a distributed key-value storage (comparable to BigTable and HBase). The storage and the file system share a common namespace, which looks for the end user like a special type of tables in DFS. These tables support efficient read and write operations on strings by primary key. The key features of the KV storage include transactionality, strict consistency (the snapshot isolation level), and support for distributed transactions.

Hardware
Yandex supercomputers run on NVIDIA A100 graphics accelerators with InfiniBand interconnect based on Mellanox switches. The standard NVIDIA HGX A100 architecture has been optimized for machine learning tasks by Yandex to increase the cluster size and train even the largest ML models about twice as fast.
Facts and figures
The data is calculated for the three Yandex supercomputers in October 2021.
The NVIDIA A100 accelerators used in Yandex supercomputers are also available for the customers of Yandex’s cloud computing platform, Yandex.Cloud. They can choose to use the NVIDIA A100 accelerators when creating a cloud infrastructure as part of a virtual machine or in DataSphere, a specialized Yandex service for machine learning tasks. Customers can use DataSphere to launch the development and operation of their machine learning models, while paying only for real computing time.
Wed Nov 17 2021 19:25:25 GMT+0300 (Moscow Standard Time)