Webmaster
How Yandex indexes sites
A site on search results page

Using robots.txt

What is robots.txt?

Robots.txt is a text file that contains site indexing parameters for search engine robots.

How to set up robots.txt

  1. Create a file with the name robots.txt in the text editor and fill it in following the guidelines below.
  2. Check the file using the Yandex.Webmaster service (Robots.txt analysis in the menu).
  3. Upload the file to your site's root directory.

User-agent directive

The Yandex robot supports the robot exclusion standard with the enhanced capabilities that are described below.

The Yandex robot uses the session robot principle: for every session, a given pool of webpages is put together that the robot plans to visit.

A session begins when the robots.txt file is loaded. If the file is missing, is not a text file, or the robot's request returns an HTTP-status other than 200 OK, the robot assumes that it has unrestricted access to the site's documents.

In the robots.txt file, the robot checks for records beginning with User-agent: and looks for either the substring Yandex (case doesn't matter) or * . If it finds the line User-agent: Yandex, the directives for User-agent: * are disregarded. If the lines User-agent: Yandex and User-agent: * are absent, robot access is assumed to be unrestricted.

Separate directives can be entered for the following Yandex robots:

  • 'YandexBot' — the main indexing robot
  • 'YandexDirect' — downloads information about the content on Yandex Advertising Network partner sites for selecting relevant ads; interprets robots.txt in a special way
  • 'YandexDirectDyn' — generates dynamic banners and interprets robots.txt in a special way
  • 'YandexMedia' — robot used to index multimedia data
  • 'YandexImages' — indexing robot for Yandex.Images
  • 'YaDirectFetcher' — the Yandex.Direct robot; it interprets robots.txt in a special way
  • 'YandexBlogs'blog search — robot that indexes posts and comments
  • 'YandexNews' — robot
  • 'YandexPagechecker'micromarkup validator
  • ‘YandexMetrika’Yandex.Metrica robot
  • ‘YandexCalendar’ — Yandex.Calendar robot

If directives are found for a specific robot, User-agent: Yandex and User-agent: * are not used.

For example:

User-agent: YandexBot # will only use the main indexing robot
Disallow: /*id=

User-agent: Yandex # will use all Yandex robots
Disallow: /*sid= # besides the main indexing robot
User-agent: * # Yandex robots won't be used
Disallow: /cgi-bin 

Disallow and Allow directives

If you don't want to allow robots to access your site or certain sections of it, use the Disallow directive.

Examples:

User-agent: Yandex
Disallow: / # blocks access to whole site

User-agent: Yandex
Disallow: /cgi-bin # blocks access to pages  
                   # starting with '/cgi-bin'

In accordance with the standard, we recommend that you insert a blank line before every User-agent directive.

The # character designates commentary. Everything following this character, up to the first line break, is disregarded.

Use the Allow directive to allow the robot access to specific parts of the site or to the entire site.

Examples:

User-agent: Yandex
Allow: /cgi-bin
Disallow: /
# forbids downloads of anything except for pages 
# starting with '/cgi-bin'
Note. Empty line breaks are not allowed between the User-agent, Disallow and Allow directives.

Using directives jointly

The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot. Examples:

# Source robots.txt:
User-agent: Yandex
Allow: /catalog
Disallow: /
# Sorted robots.txt:
User-agent: Yandex
Disallow: /
Allow: /catalog
# only allows downloading pages
# starting with '/catalog'
# Source robots.txt:
User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalog
# Sorted robots.txt:
User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto
# disallows downloading pages starting with '/catalog',
# but allows downloading pages starting with '/catalog/auto'.
Note. If there is a conflict between two directives with prefixes of the same length, the Allow directive takes precedence.

Allow and Disallow directives without parameters

If the directives don't contain parameters, the robot handles data in the following manner:

User-agent: Yandex
Disallow: # the same as Allow: /

User-agent: Yandex
Allow: # not considered a robot

Using the special characters * and $

You can use the special characters * and $ when specifying paths for the Allow and Disallow directives, setting certain regular expressions this way. The * character indicates any sequence of characters (including blanks). Examples:

User-agent: Yandex
Disallow: /cgi-bin/*.aspx # disallow '/cgi-bin/example.aspx'
                          # and '/cgi-bin/private/test.aspx'
Disallow: /*private # disallow both '/private'
                    # and '/cgi-bin/private'

The $ character

By default, the * character is appended to the end of every rule described in the robots.txt file. For example:

User-agent: Yandex
Disallow: /cgi-bin* # blocks access to pages 
                    # starting with '/cgi-bin'
Disallow: /cgi-bin # the same

To cancel * at the end of the rule, you can use the $ character, for example:

User-agent: Yandex
Disallow: /example$ # disallows '/example', 
                    # but allows '/example.html'
User-agent: Yandex
Disallow: /example # disallows both '/example', 
                   # and '/example.html'
The character $ doesn't prohibit * if it is specified at the end, in other words:
User-agent: Yandex
Disallow: /example$  # prohibits only '/example'
Disallow: /example*$ # exactly the same as 'Disallow: /example' 
                     # prohibits both /example.html and /example

Sitemap directive

If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameter of the Sitemap directive (if you have multiple files, indicate all paths). Example:

User-agent: Yandex
Allow: /
Sitemap: http://example.com/site_structure/my_sitemaps1.xml
Sitemap: http://example.com/site_structure/my_sitemaps2.xml

The robot will remember the path to your file, process your data, and use the results during the next visit to your site.

Host directive

If your site has mirrors, special mirror bots ( Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots) ) detect them and form a mirror group for your site. Only the main mirror will participate in search. You can indicate which site is the main one in the robots.txt file. The name of the main mirror should be listed as the value of the Host directive.

The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision-making algorithm will assign it a high priority. For example:

#If www.main-mirror.com is your site's main mirror, then #robots.txt for all your sites from the mirror group will look like this: 
User-Agent: *
Disallow: /forum
Disallow: /cgi-bin
Host: www.main-mirror.com

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Host directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives. The Host directive argument is the domain name with the port number (80 by default), separated by a colon.

#Example of a well-formed robots.txt file, where
#the Host directive will be taken into account during processing

User-Agent: *
Disallow:
Host: www.myhost.com

However, the Host directive is intersectional, so it will be used by the robot regardless of its location in robots.txt.

Note. For every robots.txt file, only one Host directive is processed. If several directives are indicated in the file, the robot will use the first one.

For example:

Host: myhost.ru # uses

User-agent: *
Disallow: /cgi-bin

User-agent: Yandex
Disallow: /cgi-bin
Host: www.myhost.ru # is not used

The Host directive should contain:

  • The protocol set to HTTPS if the mirror is only available via a secure channel (Host: https://myhost.com).

  • One valid domain name that conforms to RFC 952 and is not an IP address.

  • The port number, if necessary (Host: myhost.com:8080).

An incorrectly formed Host directive will be ignored.

# Examples of Host directives that will be ignored

Host: www.myhost-.com
Host: www.-myhost.com
Host: www.myhost.com:100000
Host: www.my_host.com
Host: .my-host.com:8000
Host: my-host.com.
Host: my..host.com
Host: www.myhost.com:8080/
Host: 213.180.194.129
Host: www.firsthost.ru,www.secondhost.com
Host: www.firsthost.ru www.secondhost.com

Examples of Host directive use:

# domain.myhost.ru is the main mirror for
# www.domain.myhost.com, so the correct use of 
# the Host directive is:

User-Agent: *
Disallow:
Host: domain.myhost.ru

Crawl-delay directive

If the server is overloaded and it isn't possible to process downloading requests, use the Crawl-delay directive. You can specify the minimum interval (in seconds) for a search robot to wait after loading one page, before starting to load another.

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent entry right after the Disallow and Allow directives.

The Yandex search robot supports fractional values for Crawl-Delay, such as "0.5". This does not mean that the search robot will access your site every half a second, but it may speed up the site processing.

Examples:

User-agent: Yandex
Crawl-delay: 2 # sets a 2 second time-out

User-agent: *
Disallow: /search
Crawl-delay: 4.5 # sets a 4.5 second time-out

Clean-param directive

If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the Clean-param directive.

Using this information, the Yandex robot will not reload duplicate information again. This will improve how efficiently the robot processes your site and reduce the server load.

For example, your site contains the following pages:

www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

The ref parameter is only used to track which resource the request was sent from, and does not change the content. All three addresses will display the same page with book_id=123. Then, if you indicate the directive in the following way:

User-agent: Yandex
Disallow:
Clean-param: ref /some_dir/get_book.pl

the Yandex robot will converge all the page addresses into one:

www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123,

If a page without parameters is available on the site:

www.example.com/some_dir/get_book.pl?book_id=123

everything will go to that page after the robot indexes it. Other pages of your site will be processed more often, because there will be no need to update the pages:

www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

Directive syntax

Clean-param: p0[&p1&p2&..&pn] [path]

In the first field, list the parameters that must be disregarded, separated by the & symbol. In the second field, indicate the path prefix for the pages the rule should apply to.

Note. The Clean-Param directive is intersectional, so it can be indicated in any place within the robots.txt file. If several directives are specified, all of them will be taken into account by the robot.

The prefix can contain a regular expression in a format similar to the one used in the robots.txt file, but with a few restrictions: only the characters A-Za-z0-9.-/*_ can be used. However, * is interpreted in the same way as in robots.txt. A * is always implicitly appended to the end of the prefix. For example:

Clean-param: s /forum/showthread.php

means that the s parameter will be disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site. It is case sensitive. The maximum length of a rule is 500 characters. For example:

Clean-param: abc /forum/showthread.php
Clean-param: sid&sort /forumt/*.php
Clean-param: someTrash&otherTrash

Additional examples

#for these types of addresses:
www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain:
User-agent: Yandex
Disallow:
Clean-param: s /forum/showthread.php
#for these types of addresses:
www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df
www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae

#robots.txt will contain:
User-agent: Yandex
Disallow:
Clean-param: sid /index.php
#if there are several of these parameters:
www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311
www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896

#robots.txt will contain:
User-agent: Yandex
Disallow:
Clean-param: s&ref /forum*/showthread.php
#if the parameter is used in multiple scripts:
www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243

#robots.txt will contain:
User-agent: Yandex
Disallow:
Clean-param: s /forum/index.php
Clean-param: s /forum/showthread.php

Additional information

The Yandex robot doesn't support robots.txt directives that aren't shown on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in different ways.

The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:

User-agent: Yandex 
Allow: /
Disallow: /
# without extensions everything is disallowed since 'Allow: /' is ignored, 
# with extension support everything is allowed

User-agent: Yandex
Disallow: /private*html
# without extensions '/private*html' is disallowed, 
# but with extensions it disallows '/private*html', 
# and '/private/test.html', and '/private/html/test.aspx' etc.

User-agent: Yandex
Disallow: /private$
# without extensions, '/private$' and '/private$test' etc. are disallowed, 
# but with extensions, only '/private' is disallowed

User-agent: *
Disallow: /
User-agent: Yandex
Allow: /
# without extensions due to no empty line break, 
# 'User-agent: Yandex' would be ignored and  
# the result would be 'Disallow: /', but the Yandex robot 
# selects entries that have 'User-agent:' in the line, 
# so the result for the Yandex robot in this case is 'Allow: /'

User-agent: *
Disallow: /
# commentary1...
# commentary2...
# commentary3...
User-agent: Yandex
Allow: /
# same as in the previous example (see above)

Examples of extended robots.txt format use:

User-agent: Yandex
Allow: /archive
Disallow: /
# allows everything that contains '/archive'; everything else is disallowed

User-agent: Yandex
Allow: /obsolete/private/*.html$ # allows html files
                                 # at the path '/obsolete/private/...'
Disallow: /*.php$  # disallows all '*.php' on site
Disallow: /*/private/ # disallows all subpaths containing
                      # '/private/', but the Allow above negates
                      # part of the disallow
Disallow: /*/old/*.zip$ # disallows all '*.zip' files containing
                        # '/old/' in the path

User-agent: Yandex
Disallow: /add.php?*user= 
# disallows all 'add.php?' scripts with the 'user' parameter

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:

User-agent: Yandex
Disallow:

Similarly, robots.txt is assumed to allow everything if it couldn't be accessed (for example, if the HTTP headers are not set properly or a 404 Not found HTTP status message is returned).

Exceptions

A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by site owners, they may not follow the robots.txt limiting directives designed for random robots (User-agent: *).

It's also possible to partially ignore robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.

Attention! If such a robot downloads a document that the main Yandex robot can't access, this document will never be indexed and won't be found in search results.

Here is a list of Yandex robots that don't follow general limiting rules in robots.txt:

  • YaDirectFetcher downloads ad landing pages to check their availability and content. This is compulsory for placing ads in Yandex search results and YAN partner sites.
  • YandexCalendar regularly downloads calendar files requested by users, despite being located in directories that are blocked from indexing.
  • 'YandexDirect' downloads information about YAN partner site content in order to clarify what their topics are so that relevant ads can be selected.
  • YandexDirectDyn is the robot that generates dynamic banners.
  • YandexMobileBot downloads documents for analysis in order to determine if their page layouts are suitable for mobile devices.
  • YandexAccessibilityBot downloads pages to check how accessible they are for users.
  • YandexScreenshotBot takes a screenshot of a page.
  • Yandex.Metrika is the Yandex.Metrica robot.
  • YandexVideoParser is the Yandex.Video indexer.

To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:

User-agent: YaDirectFetcher
Disallow: /
User-agent: YandexMobileBot
Disallow: /private/*.txt$
Rate this article
Thank you for your feedback!