Yandex Blog

Yandex Data Factory and the Next Industrial Revolution: Steel, Oil, & AI

Jane Zavalishina is the CEO of Yandex Data Factory, a spin-off of Yandex founded in 2014 to provide machine learning solutions to enterprises. In this post, Jane explains how YDF’s business has evolved since its launch, and why industrial AI is now in focus of its strategy.

Since its inception, Yandex Data Factory (YDF) has pioneered an innovative way to create value for companies by applying our expertise in machine learning and artificial intelligence (AI) to help solve their business needs. YDF arose as a solution to the problem many businesses faced at the peak of the big data craze. Essentially, businesses had begun amassing huge amounts of information, but were struggling to extract tangible value from this data.

The solution, of course, is in machine learning. Our parent company, Yandex, was an early leader in machine learning technology and today, machine learning powers 70 percent of Yandex’s products and services. We realised that wherever large stores of data exist, so does the opportunity to use that data to reach measurable business improvements. The same algorithms that power Yandex’s services, can be used to help other businesses improve their operations, revenues and profitability.

Over the past two years, we have worked with a number of companies across multiple industries on various successful projects. Together with our clients, we discovered the best use cases where machine learning can be applied to increase the efficiency of existing processes in a measurable way – be it predicting demand for a retail chain, or using computer vision to cut moderation costs for online service. Along the way, we accumulated a huge amount of expertise on merging data science with business.

One such case included our work with Magnitogorsk Iron & Steel Works (MMK), that marked one of the first ever collaborations of its kind between a technology company and steel company. MMK, one of the world’s largest steel producers, wanted to reduce its production costs while maintaining the same high-quality product. YDF developed a machine learning-based service that recommends the optimal amount of ferroalloys—the ingredients needed to produce specific steel grades. Our predictive system demonstrated the reduction of ferroalloy use by an average of five percent, equating to annual savings of more than $4 million in production costs, while consistently maintaining the same high quality of steel.

Similarly, we are now optimising the operations of a gas fractionation unit for a petrochemical company. Our solution recommends the fractionation unit parameters for maintaining the best performance and energy savings, decreasing costs in the process. Last week, we also signed a collaboration agreement with Gazprom Neft, an integrated oil company. We plan to apply our technologies to well drilling and completion, and other production processes. These successful efforts demonstrate the high potential for collaboration between artificial intelligence and industrial manufacturing.

The industrial sector – responsible for one-third of global GDP – has proven to be the ideal vertical, perfectly positioned for the effective application of our technologies. The industrial sector has become YDF’s focal point through the combination of our own successful application of predictive analytics with industrial data and the fit of the industry. Put simply, manufacturers know the value of optimisation at their hearts. Industrial manufacturing is also a unique cultural fit. They value measurements above opinions, they have perfected integrating new technologies in the existing processes, and they know how to estimate their effect through properly designed experiments.

For decades, the cornerstone of competitiveness in manufacturing has been centered on the optimization of existing processes, reaching for each tenth of a percent of efficiency in each step. And when all traditional optimisation means have been applied, the next efficiency leap of five to ten percent is often prohibitively expensive and equally time-consuming. These improvements typically consist of equipment upgrades with multi-million dollar investments, years spent on construction, rigorous training and implementation, and a lengthy delay before seeing any tangible financial return. Compared to this, receiving the same level of optimization via machine learning in a matter of months with minimal upfront investment is nothing short of revolutionary.

These long-term benefits extend far beyond a simple profit and loss sheet, and can help conserve both human capital and natural resources. By training machines to focus on the mundane, routine decisions that keep a factory running, artificial intelligence and machine learning allow human employees time to tackle more important tasks. By applying these technologies to oil and gas, companies not only achieve time and material savings, they can also reduce their energy consumption by up to 25 percent.

Our AI-enhanced models create endless opportunities to add value to the manufacturing industry. These benefits are especially noticeable in process manufacturing, where materials and mixtures – metals, chemicals, etc. – are produced. Essentially, these are also the industries responsible for the highest resource consumption.

The AI revolution in manufacturing is happening right now, and we are thrilled to be leading the charge. As this future becomes a reality, we’ll be there – at the forefront – blazing new trails in the industrial sector and delivering far-reaching effects for both the companies we work with and the larger communities they serve.

Now We’re Looking for Lepton Flavour Violation

Wouldn’t we all like to think that the world that we’re living in is more or less stable? Isn’t there a certain pleasure to be sure that our feet will be pulled to the ground as firmly tomorrow as they are today? Isn’t it reassuring to know that the cup of tea we’ve just put on our desk won’t disappear instantly and reappear on the bottom of the sea on the other side of the planet having traveled its diameter on a straight line? In classical physics, Newton’s laws give us this reassurance. These laws bestow predictability on objects or events as they exist or happen in our reality - on a macroscopic level. On a microscopic level - in particle physics - Fermi’s interaction theory, for instance, postulates that the laws of physics remain the same even after a particle undergoes substantial transformation.

In 1964, however, it became apparent that this isn’t always the case. James Cronin and Val Fitch showed, by examining the decay of subatomic particles called kaons, that a reaction run in reverse does not necessarily retrace the path of the original reaction. This discovery opened a pathway to the theory of electroweak interaction, which in turn gave rise to the theory we all now know as the Standard Model of particle physics.

Although the Standard Model is currently the most convenient paradigm to live with, it doesn’t explain a number of problems, including gravity or dark matter. Other theories compete very actively for the leading role in describing the laws of nature in the most accurate and comprehensive way. To succeed, they have to provide evidence of something that happens outside the limitations of the Standard Model. A promising area to look for this kind of evidence is the decay of a charged lepton (tau lepton) into three lighter leptons (muons), which happen to have a certain characteristic - flavour - that is different from the same characteristic of their ‘mother’ particle. According to the Standard Model, the probability of this decay is vanishingly low, but it can be much higher in other theories.

One experiment at CERN, LHCb, aims at finding this τ → 3μ decay. How are they going to find it? By searching for statistically significant anomalies in an unthinkably large amount of data. How can they find statistically significant anomalies in an unthinkably large amount of data? By using algorithms. These can be trained to separate signal (lepton decays) from background (anything else, really) better than humans. The problem here, however, is not only to find these lepton decays, but also find them in statistically significant numbers. If the Standard Model is correct, the τ → 3μ decays are so rare that their observations are below experimental sensitivity.

To come up with a more sensitive and scale-appropriate solution that would help physicists find evidence of the tau lepton decay into three muons at a statistically significant level, Yandex and CERN’s LHCb experiment have launched a contest for a perfect algorithm. The contest, called ‘Flavours of Physics’, starts on July 20th with the deadline for code submissions on October 12th. It is co-organised with an associated member of the LHCb collaboration, the Yandex School of Data Analysis, and Yandex Data Factory - a big data analytics division of Yandex - and is hosted on a website for predictive modeling and analytics competitions, Kaggle. The winning team or participant will claim a cash prize of $7,000, with $5,000 and $3,000 awarded to the first and the second runners-up. An additional prize in the form of an opportunity to participate in an LHCb workshop at the University of Zurich and $2,000 provided by Intel will be given to the creator of an algorithm that will prove to be the most useful to the LHCb experiment. The data used in this contest will consist both of simulated and real data, acquired in 2011 and 2012, that was used for the τ → 3μ decay analysis in the LHCb experiment.

Contest participants can build on the algorithm provided by the Yandex School of Data Analysis and Yandex Data Factory to make an algorithm of their own.

The metric for evaluation of the algorithms submitted for this contest is very similar to the one used by physicists to evaluate significance of their results, but is much more simple and robust thanks to the collective effort of the Yandex School of Data Analysis and LHCb specialists who have adapted procedures routinely used in the LHCb experiment specifically for this contest. Our expectation is that this metric will help scientists choose the algorithms that they could use on data that will be collected in the LHCb experiment in 2015, and in a wide range of other experiments.

Finding the tau lepton decay might take us out of the comfort zone of the Standard Model, but it just as well may open the door to extra dimensions, shed light on dark matter, and finally explain how gravity works on a quantum level.


Collisions as seen within the LHCb experiment's detector (Image: LHCb/CERN)

Yandex Data Factory Predicts ‘Churn’ for World of Tanks

Customer loyalty and satisfaction is crucial in community-based gaming, where every single player matters, and devoted, experienced gamers are especially valuable for the game. Our big data unit,Yandex Data Factory, took game churn prediction – knowing how many gamers are likely to leave the game – to another level. Wargaming, an international MMOG developer, whose game World of Tanks, one of the world’s most financially successful games, with over 100 million registered players, can now determine more accurate which players are likely to stop playing soon and take measures to prevent that.


The challenge presented to the YDF team was to help increase WoT players’ loyalty and satisfaction with a minimal effort and at a minimal cost. To approach this challenge, a sample dataset of 100,000 random players who had 20 games or more in the past year was selected – this was done to exclude those who joined the game by accident or just to have a try. Based on a similar concept used in telecom and Wargaming’s own understanding, YDF analysts defined a ‘churner’ as a player who had zero games in the month following a gaming session. Next, the raw data for the ‘churners’, which included over 100 parameters – personal (obfuscated payment balances, purchase logs, etc.), as well as gaming (game logs, number of battles, battle types, number of destroyed tanks, clan battles data, free experience etc.) – was fed to our proprietary machine-learning algorithm, MatrixNet, to find similarities in gamers' behaviour and personal profiles. In result, a probability of churn was assigned to every gamer in the dataset.


WoT could then apply this churn prediction formula to the whole gaming community to spot top potential churners and target customer retention measures, such as special offers, new frictions, bonuses or community activities, specifically to them. The accuracy of YDF formula’s churn prediction measured at least 20-30% better than the current standard used in the gaming industry. Churn prevention – developing a formula for personalised retention measures – is the next challenge that YDF is ready to take on. Read more about YDF's churn prediction project for Wargaming.

Yandex Data Factory Opens for Business

As far as the laws of mathematics refer to reality, they are not certain,

and as far as they are certain, they do not refer to reality.

Albert Einstein

A search engine is all about very big data and very advanced mathematics. What we have been doing here at Yandex for more than 17 years already, is develop and implement technologies and algorithms which from a billion of pages on the internet would pick the one that would offer an answer to a web user’s question or solve their problem.

The technologies that power our search are based on machine learning – an approach that allows automating the process of making a decision. Our core machine learning technology, MatrixNet, not only makes its own decisions about whether a certain piece of information is a good answer to a user’s question or not, based on previous experience, but it does so based on a relatively limited experience.

At this point in time, when we can feel that our technologies can be put to use in spheres other than internet search, we are prepared to offer what we’ve got for a larger range of applications.

Today, at the LeWeb innovation conference in Paris, we’re cutting the red ribbon for Yandex Data Factory, our new B2B-service for corporate and enterprise clients, who would like, using our machine-learning technologies, to turn large volumes of data they posses into hands-on business tools, and, by doing so, increase sales, cut costs, optimise processes, prevent losses, forecast demand, develop new or improve existing methods of audience targeting.

We first branched out of our natural realm with our collaboration with CERN on their Large Hadron Collider beauty (LHCb) experiment. For this project we trained our MatrixNet to search for specific types of particle collisions, or events, among thousands of terabytes of information about these events registered by the detector in the LHCb. Yandex provided the LHCb researchers with an instant access to the details of any specific event.

The success of this project gave us reasons to believe it can be repeated in other areas of application. Any industry producing large amounts of data and focused on business goals could benefit from our expertise and our MatrixNet-based technologies: personalisation of search suggestions, recommendations or search resultsimage or speech recognition, road traffic monitoring and prediction, word form prediction and ranking for machine translation, demographic profiling for audience targeting.

Prior to today's announcement we have run pilot projects for about a year designing experimental custom-made solutions for clients all over the world. Most of these projects involved using the data that already exist, which we used for training a MatrixNet-based model, which then was applied to new data – depending on the goal of a client, to generate suggestions for buying a specific product, or predict, with a high degree of accuracy, based on behaviour of thousands or millions of shoppers with similar behaviour patterns, which product exactly will be bought.

Using this machine-learning technique, we helped one of the leading European banks increase their sales by matching each of their products that needed upselling with the best communication channel for each customer. By applying MatrixNet to behavior data on a few million of the bank’s clients, we created a model that could predict net present value of communication of a product to a specific client via a specific channel. This model was then applied to the bank’s new data to generate personalised product recommendations for each client paired with communication channel and ranked by potential net profit value. Preliminary results of the first wave of the bank’s marketing campaign, which was run on three million of clients, were used to fine-tune the original model, which, in its turn, was used in the second wave on a much larger number of the bank’s customers. The resulting sales increase beat the increase forecasted by the bank’s own analysts by 13%.

The same machine-learning approach, together with our own data and expertise in geolocation, helped a road and traffic management agency boost their accident prediction accuracy making it 30 times more accurate. To enable the agency take measures to prevent road traffic accidents, we provided them with one-hour forecasts for traffic jams, as well as alerts for high-risk traffic conditions, in real time, and visualized potential congestion on interactive maps. Using MatrixNet, we first trained predictive formulas on our own UGC information about almost 40,000 road accidents and 5bn speed tracks minded over 2.5 years, complemented by the information provided by the agency: traffic information (i.e., number of cars passing through a given segment of the road in any given time), information about road conditions (type of surface, number of lanes, gradient etc.), weather information. These formulas were then applied to larger data sets and a predictive system for road traffic accidents was developed and deployed in the agency’s situation rooms.

Currently, we’re continuing to work on about 20 projects in various stages of completion across the globe. In essence, we're continuing to experiment, but this time, we know in which direction, or rather – in which directions – we are to move. While the majority of our potential partners, as well as data, come from finance, telecommunications, retail, logistics, utilities, and even the new-fangled 'smart cities', anyone who has data and a business goal can discover new opportunities brought about by mathematics. No matter what industry your business is in, mathematics will work for you. Despite what Einstein said.

Screen Shot 2014-12-08 at 19.01.53.png