Introducing Yandex CatBoost, a state-of-the-art open-source gradient boosting library

18 July 2017, 12:56

By Misha Bilenko, Head of Machine Intelligence and Research

Recent developments in machine learning have accelerated its transition from a computer science research area to a technology that drives numerous customer applications. One of the most buzzed about methods leading this transition is deep learning. At Yandex, our homegrown deep neural networks are an important part of the machine learning portfolio that helps sustain our market-leading performance in search, speech recognition and synthesis, vision applications and machine translation. At the same time, we’ve also integrated many other forms of machine learning across our products and services.

One thing to remember about machine learning is that there is no singular best approach – it is a rich collection of algorithms that each have their own strengths and weaknesses for specific types of data and certain types of customer problems. Deep learning has unlocked amazing capabilities in the advancement of artificial intelligence, but, at the end of the day, it’s just one part of a much broader machine learning tech stack that also includes linear and tree-based models, factorization methods, and numerous other techniques that leverage statistics and optimization.

Gradient boosting is a machine learning algorithm that is widely applied to the kinds of problems businesses encounter every day like detecting fraud, predicting customer engagement and ranking recommended items like top web pages or most relevant ads. It delivers highly accurate results even in situations where there is relatively little data, unlike deep learning frameworks that need to learn from a massive amount of data. Gradient boosting is ideal for predictive models that analyze many different forms of data, including descriptive data formats with categorical features. In most applications, it is the most powerful “ultimate” model that integrates inputs from many different machine learning techniques, including those from deep learning models. Thus, it is the most important method in a practitioner’s tool case, one that can be used to leverage a wide range of data formats and combine a variety of more specialized models.

Today, we are thrilled to announce that we are open-sourcing CatBoost, a gradient boosting library. It is especially powerful in two ways: it yields state-of-the-art results without extensive data training typically required by other machine learning methods, and it provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within our services for ranking tasks, weather forecasting, fraud detection and making recommendations. We believe that it can be applied across a wide range of industrial machine learning tasks, in domains ranging from finance to scientific research.

CatBoost can be integrated with deep learning tools like Google’s TensorFlow, as demonstrated in the accompanying tutorials, where TensorFlow-trained models for text provide inputs to CatBoost. Models trained by CatBoost can be used in production via Apple’s Core ML framework. Apps can be built with CatBoost-trained models, bringing intelligent features directly to customers’ devices.

CatBoost delivers best-in-class accuracy unmatched by other gradient boosting algorithms today. It is an out-of-the-box solution that significantly improves data scientists’ ability to create predictive models using a variety of data sources, such as sensory, historical and transactional data. While most competing gradient boosting algorithms need to convert data descriptors to numerical form, CatBoost’s ability to support categorical data directly saves businesses time while increasing accuracy and efficiency.

Over the coming months, we will be rolling out CatBoost to benefit the majority of our Yandex services as we strive to deliver the best customer experience we possibly can. For example, today our weather forecasting tool Yandex.Weather uses MatrixNet to deliver minute-to-minute hyper-local forecasts, while in the near future, CatBoost will help provide our users with even more precise weather forecasting so people can better plan for quick weather changes.

Outside of Yandex, CatBoost is already being used by data scientists at the European Organization for Nuclear Research (CERN) in order to reduce the amount of particle identification errors in data produced by the Large Hadron Collider beauty experiment.

For 20 years now, Yandex has been pioneering innovation in machine learning and artificial intelligence to build intelligent products and services that help consumers and businesses better navigate the online and offline world. We feel it’s our duty to share our expertise in machine learning with the open-source community. By making CatBoost available as an open-source library, we hope to enable data scientists and engineers to obtain top-accuracy models with no effort, and ultimately define a new standard of excellence in machine learning. Learn more at http://catboost.yandex.