Site indexing

  1. Add the site in Yandex.Webmaster.

  2. Sitemap. Sitemap is a special format developed for webmasters and search engines to describe the structure of the site. It is a list of links to the site's internal pages presented in XML format. Yandex supports this format as well. You can download the sitemap for your site in the special section of Yandex.Webmaster. Use it to set the priority of crawling certain pages for the robot. For example, if some pages are updated more often, make this clear so the robot plans crawling correctly.

  3. Robots.txt is a file for search engine robots. In this file, the webmaster can specify indexing parameters for all robots or for each search engine separately. Here are the three most important parameters specified in this file:

    • Disallow. This directive is used to prohibit indexing of certain site sections. Use it to prevent indexing for technical pages and pages that aren't important for the users and search engines. Examples of non-indexed pages are the search results page on our site, site statistics, duplicate pages, various logs, database service pages, and so on. Read more in the help section about the robots.txt file.

    • Crawl delay. Use it to specify a minimal interval (in seconds) between page requests. This option is useful for large projects containing tens of thousands of pages and more. The Yandex search robot can overload the site, which can lead to disruptions and delays in the site functioning. It may be necessary to limit the number of requests per second. For example, Crawl-delay: 2 tells the robot to wait 2 seconds between requests to the server.

    • Clean param. Use it to indicate which CGI parameters in the page URL are unimportant. Sometimes the page URLs contain session identifiers. Formally, pages with different IDs are different, but their content is still the same. If there are many pages of this kind on the site, the robot can start indexing such pages, rather than downloading the useful content. Read the help section for more information about using the "clean param" directive.

      You can view the list of indexed URLs on your site in Yandex.Webmaster. Check it regularly, as even small mistakes in the code can increase the number of unnecessary URLs on the site and overload the site.

  4. Yandex indexes the main types of documents distributed online. But there are limitations that affect how the document is indexed and whether it is indexed at all:

    • A large number of CGI parameters in a URL, a large number of nested directories, and overly long URLs may interfere with document indexing.

    • The size of the document is important for indexing. Documents more than 10 MB aren't indexed.

    • Indexing Flash:

      1. The robot indexes *.swf files if there is a direct link to them or they are embedded in the HTML with the "object" or "embed" tags.

      2. If the Flash contains useful content, the original HTML document can be found by the content indexed in the swf file.

    • In PDF documents, only text content is indexed. Text represented as images is not indexed.

    • Yandex indexes documents in the Open Office XML and OpenDocument formats (including the Microsoft Office and Open Office documents). But support for new formats can take some time.

    • You can use the <frameset> and <frame> tags. The Yandex robot indexes the content loaded in them and finds the source document based on the contents of frames.

  5. If you set a different server behavior for non-existent URLs, make sure that the server returns the 404 error code. Once the search engine receives the 404 code, it removes the document from the index. Make sure that all necessary pages on the site respond with the 200 OK code.

  6. Make sure that the HTTP headers are correct. The server response to the “if-modified-since” request is important. The Last-Modified header must contain the correct last modified date for the document.

  7. Place site versions adapted for mobile devices as well as language versions in subdomains.

Note.

Manage the Yandex robot and prohibit indexing for pages that are not intended for users.

Next