About machine translation

Yandex Translate uses a hybrid model of machine translation that includes both neural network (deep learning) and statistical approaches.

Statistical machine translation

The statistical approach is based on language and translation models:

• To create a translation model, the system compares hundreds of thousands of parallel texts that have the same meaning but are written in different languages. Words that are likely matches are extracted during comparison and stored in a matrix. For example, the system decides that «dog» and «собака» are probable translations of each other and saves this information. The resulting matrix helps determine which pairs of phrases in a sentence pair can serve as translations for each other.

• To create a language model, the system analyzes the texts in one language and makes lists of all the words and phrases that are used. Each word and phrase is given its own numeric ID, which defines its statistical frequency in the language (how often it is used).

During translation, each source sentence is divided into words and phrases that are translated independently of each other. Each part of the sentence is matched to a potential translation from the matrix. Then the system «puts together» several versions of the translated sentence and selects the statistically best option based on optimal combinations of words in natural language.

Statistical machine translation works well for remembering and translating short phrases and uncommon words. However, there is a drawback — phrases may be out of place or disjointed because it doesn't take context into account.

Neural machine translation

Similar to the statistical approach, the neural network also analyzes an array of parallel texts, learns to find patterns in them, and makes lists of all the words and phrases used.

However, instead of using simple identifiers like the statistical approach, neural machine translation uses what is called word embedding: a vector representation is formed for each word, consisting of numbers that identify its lexical and semantic features.

The neural network translates each source sentence as a whole, instead of breaking it down into words and phrases for separate translation. Each word in the sentence is mapped to a vector that is several hundred numbers long. As a result, the sentence is transformed into a vector space. This vector space allows the neural network to determine the semantics of words and their relationships, even if the words are in different parts of the original sentence.

For example:

  1. The system can recognize that «tea» and «coffee» often appear in similar contexts.

  2. Both words might be found in the context of the new word «bottled».

  3. However, the training data that contains the word «bottled» only has one of these words («tea»). As a result, machine translation uses the word «tea».

The advantage of neural machine translation is that it considers the relationship between words, which results in a smoother translation. The disadvantage of the neural approach is that it sometimes lacks information about words that don't occur frequently enough, and the system isn't able to build an acceptable vector representation. Rare words include uncommon names or toponyms.

Choosing the translation option and assessing the quality

As soon as the user enters text to translate, Yandex Translate sends this text to both systems: the neural network and the statistical translator.

The results obtained from both systems are evaluated by an algorithm based on the CatBoost machine learning method. The algorithm analyzes dozens of factors, from sentence length (short phrases and rare words are better translated by the statistical model) to syntax. The two translations are compared across all factors, and the best one is shown to the user.

• Yandex Translate: Two models are better than one. (habr.com)