How the Yandex search works

  1. Stage 1. Crawling the site
  2. Stage 2. Loading and processing (indexing) the data
  3. Stage 3. Creating a database of the pages that can be included in the search results
  4. Stage 4. Generating search results
  5. FAQ

To start displaying your site in search results, Yandex must find out about its existence using robots.

A robot is a system that crawls site pages and loads them into its database. Yandex has lots of robots. Saving pages to the database and their further processing using algorithms is called indexing. The loaded data is used to generate search results. They are regularly updated and may affect the site ranking.

There are several stages before a site appears in search results:

Stage 1. Crawling the site

Stage 2. Loading and processing (indexing) the data

Stage 3. Creating a database of the pages that can be included in the search results

Stage 4. Generating search results

Stage 1. Crawling the site

The robot determines which sites to crawl and how often, as well as how many pages to crawl on each of them.

When crawling them, the robot takes into account the list of already known pages, which is based on the following data:
Robots continually monitor the appearance of new links, content updates on previously downloaded pages, and page availability. They do this as long as:
  • The link is placed on your own or third-party site.
  • The page is not prohibited for indexing in the robots.txt file.

When the robot tries to load a site page, it receives a response from the server with the HTTP status code:

HTTP status code Note
200 OK The robot will crawl the page
3XX The robot needs to crawl the page that is the redirect target. Learn more about handling redirects
4XX and 5XX

A page with this code won't be included in the search. If it was before the robot crawled it, then it will be removed from the search.

Sometimes you may need to temporarily make a page unavailable to the robot and specify that it should access it again, rather than remove it from search results. For example, when the site page looks incorrect due to problems with the CMS. You want the robot to index the page after you fix the error. Configure the server for the incorrect page so that it responds with code 503. The robot will access the page within a few crawls. After the error is fixed, change the server response.

Note. If the page responds with code 503 for a long time, it will be removed from the search.
Useful tools

Stage 2. Loading and processing (indexing) the data

The robot determines the content of the page and saves it to its database. To do this, it analyzes the page content, for example:
  • The contents of the Description meta tag, the title element, and the Schema.org micro markup, which can be used to generate a page snippet.
  • The noindex directive in the robots meta tag. If it's found, the page won't be included in the search results.
  • The rel="canonical" attribute indicating the address that you consider a priority for displaying in the search results for a group of pages with the same content.
  • Text, images, and videos. If the robot determines that the content of several pages matches, it may treat them as duplicates.
Useful tools
  • Diagnostics — Helps check the quality of a site and fix errors, if any.
  • Crawl statistics — Shows which pages the robot has crawled and how often it accesses the site.
  • How to reindex a site — Allows you to report a new page on the site or an update of a page already included in the search.

Stage 3. Creating a database of the pages that can be included in the search results

Based on the information collected by the robot, the algorithms determine the pages that can be included in the search results. The algorithms take into account a variety of ranking and indexing factors that are used to make the final decision. For example, the database won't include pages with indexing disabled or duplicate pages.

A page may contain the original, structured text but the algorithm won't add it to the database, as it's highly unlikely that the page gets into the range of view in the search results. For example, due to lack of demand from users or high competition in this topic.

Useful tools
  • Pages in search — Helps you track the status of site pages, for example, HTTP response status codes or duplicate pages.
  • Site security — Provides information about violations and infected files.

To find out if a site subdomain appears in the search results, subscribe to notifications.

FAQ

The page description in the snippet differs from the content in the Description
The page description in the search results is based on the text that is most relevant to the search query. This can be the content of the Description meta tag or the text placed on the page. For more information, see Displaying the site title and description in search results.
Search results show links to internal site frames
Before loading the page in the browser console, check if the parent frame with navigation is open. If not, open it.
My server doesn't provide last-modified

Your site will still be indexed even if your server doesn't provide last-modified document dates. However, you should keep in mind the following:

  • The date won't be displayed in the search results next to your site pages.

  • The robot won't know if a site page has been updated since it was last indexed. Modified pages will be indexed less often, as the number of pages that the robot gets from a site each time is limited.

How does encoding affect indexing?
The type of encoding used on the site doesn't affect the site indexing. If your server doesn't pass the encoding in the header, the Yandex robot will identify the encoding itself.
Can I manage reindexing frequency with the Revisit-After directive?
No. The Yandex robot ignores it.
Does Yandex index a site on a foreign domain?
Yes. Sites containing pages in Russian, Ukrainian, and Belarusian ​​are indexed automatically. Sites in English, German, and French are indexed if they might be interesting to users.
Is the content of the frame and frameset elements indexed?
Yes. The Yandex robot indexes the content loaded in the frame and frameset elements and finds the source document.
How does a large number of URL parameters and the URL length affect indexing?

A large number of parameters and nested directories in the URL, or overly long URLs, may interfere with the site indexing.

The URL can be up to 1024 characters.

How do I switch the case of a page URL?
You can do this using one of these methods:
  • Set a 301 redirect to the pages with the correct case.
  • Specify a canonical URL on the page with the URL you want to change.
Does the robot index GZIP archives?
Yes, the robot indexes archives in GZIP format (GNU ZIP compression).
Does the robot index anchor URLs (#)?

The Yandex robot doesn't index anchor URLs of pages, except for AJAX pages (with the #! character). For example, the http://example.com/page/#title page won't get into the robot database. It will index the http://example.com/page/ page (URL before the # character).

How does the robot index paginated pages?
The robot ignores the rel attribute with the prev and next values. This means that pagination pages can be indexed and included in search without any restrictions.