Robots.txt is a text file that contains site indexing parameters for search engine robots.
The Yandex robot supports the robot exclusion standard with the enhanced capabilities that are described below.
The Yandex robot uses the session robot principle: for every session, a given pool of webpages is put together that the robot plans to visit.
A session begins when the robots.txt file is loaded. If the file is missing, is not a text file, or the robot's request returns an HTTP-status other than 200 OK, the robot assumes that it has unrestricted access to the site's documents.
In the robots.txt file, the robot checks for records beginning with User-agent: and looks for either the substring Yandex (case doesn't matter) or * . If it finds the line User-agent: Yandex, the directives for User-agent: * are disregarded. If the lines User-agent: Yandex and User-agent: * are absent, robot access is assumed to be unrestricted.
Separate directives can be entered for the following Yandex robots:
If directives are found for a specific robot, User-agent: Yandex and User-agent: * are not used.
User-agent: YandexBot # will only use the main indexing robot Disallow: /*id= User-agent: Yandex # will use all Yandex robots Disallow: /*sid= # besides the main indexing robot User-agent: * # Yandex robots won't be used Disallow: /cgi-bin
If you don't want to allow robots to access your site or certain sections of it, use the Disallow directive.
User-agent: Yandex Disallow: / # blocks access to whole site User-agent: Yandex Disallow: /cgi-bin # blocks access to pages # starting with '/cgi-bin'
In accordance with the standard, we recommend that you insert a blank line before every User-agent directive.
The # character designates commentary. Everything following this character, up to the first line break, is disregarded.
Use the Allow directive to allow the robot access to specific parts of the site or to the entire site.
User-agent: Yandex Allow: /cgi-bin Disallow: / # forbids downloads of anything except for pages # starting with '/cgi-bin'
The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot. Examples:
# Source robots.txt: User-agent: Yandex Allow: /catalog Disallow: / # Sorted robots.txt: User-agent: Yandex Disallow: / Allow: /catalog # only allows downloading pages # starting with '/catalog'
# Source robots.txt: User-agent: Yandex Allow: / Allow: /catalog/auto Disallow: /catalog # Sorted robots.txt: User-agent: Yandex Allow: / Disallow: /catalog Allow: /catalog/auto # disallows downloading pages starting with '/catalog', # but allows downloading pages starting with '/catalog/auto'.
If the directives don't contain parameters, the robot handles data in the following manner:
User-agent: Yandex Disallow: # the same as Allow: / User-agent: Yandex Allow: # not considered a robot
You can use the special characters * and $ when specifying paths for the Allow and Disallow directives, setting certain regular expressions this way. The * character indicates any sequence of characters (including blanks). Examples:
User-agent: Yandex Disallow: /cgi-bin/*.aspx # disallow '/cgi-bin/example.aspx' # and '/cgi-bin/private/test.aspx' Disallow: /*private # disallow both '/private' # and '/cgi-bin/private'
By default, the * character is appended to the end of every rule described in the robots.txt file. For example:
User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages # starting with '/cgi-bin' Disallow: /cgi-bin # the same
To cancel * at the end of the rule, you can use the $ character, for example:
User-agent: Yandex Disallow: /example$ # disallows '/example', # but allows '/example.html'
User-agent: Yandex Disallow: /example # disallows both '/example', # and '/example.html'
User-agent: Yandex Disallow: /example$ # prohibits only '/example' Disallow: /example*$ # exactly the same as 'Disallow: /example' # prohibits both /example.html and /example
If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameter of the Sitemap directive (if you have multiple files, indicate all paths). Example:
User-agent: Yandex Allow: / Sitemap: http://example.com/site_structure/my_sitemaps1.xml Sitemap: http://example.com/site_structure/my_sitemaps2.xml
The robot will remember the path to your file, process your data, and use the results during the next visit to your site.
If your site has mirrors, special mirror bots ( Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots) ) detect them and form a mirror group for your site. Only the main mirror will participate in search. You can indicate which site is the main one in the robots.txt file. The name of the main mirror should be listed as the value of the Host directive.
The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision-making algorithm will assign it a high priority. For example:
#If www.main-mirror.com is your site's main mirror, then #robots.txt for all your sites from the mirror group will look like this: User-Agent: * Disallow: /forum Disallow: /cgi-bin Host: www.main-mirror.com
To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Host directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives. The Host directive argument is the domain name with the port number (80 by default), separated by a colon.
#Example of a well-formed robots.txt file, where #the Host directive will be taken into account during processing User-Agent: * Disallow: Host: www.myhost.com
However, the Host directive is intersectional, so it will be used by the robot regardless of its location in robots.txt.
Host: myhost.ru # uses User-agent: * Disallow: /cgi-bin User-agent: Yandex Disallow: /cgi-bin Host: www.myhost.ru # is not used
The Host directive should contain:
The protocol set to HTTPS if the mirror is only available via a secure channel (Host: https://myhost.com).
One valid domain name that conforms to RFC 952 and is not an IP address.
The port number, if necessary (Host: myhost.com:8080).
An incorrectly formed Host directive will be ignored.
# Examples of Host directives that will be ignored Host: www.myhost-.com Host: www.-myhost.com Host: www.myhost.com:100000 Host: www.my_host.com Host: .my-host.com:8000 Host: my-host.com. Host: my..host.com Host: www.myhost.com:8080/ Host: 126.96.36.199 Host: www.firsthost.ru,www.secondhost.com Host: www.firsthost.ru www.secondhost.com
Examples of Host directive use:
# domain.myhost.ru is the main mirror for # www.domain.myhost.com, so the correct use of # the Host directive is: User-Agent: * Disallow: Host: domain.myhost.ru
If the server is overloaded and it isn't possible to process downloading requests, use the Crawl-delay directive. You can specify the minimum interval (in seconds) for a search robot to wait after loading one page, before starting to load another.
To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent entry right after the Disallow and Allow directives.
The Yandex search robot supports fractional values for Crawl-Delay, such as "0.5". This does not mean that the search robot will access your site every half a second, but it may speed up the site processing.
User-agent: Yandex Crawl-delay: 2 # sets a 2 second time-out User-agent: * Disallow: /search Crawl-delay: 4.5 # sets a 4.5 second time-out
If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the Clean-param directive.
Using this information, the Yandex robot will not reload duplicate information again. This will improve how efficiently the robot processes your site and reduce the server load.
For example, your site contains the following pages:
www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123
The ref parameter is only used to track which resource the request was sent from, and does not change the content. All three addresses will display the same page with book_id=123. Then, if you indicate the directive in the following way:
User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl
the Yandex robot will converge all the page addresses into one:
If a page without parameters is available on the site:
everything will go to that page after the robot indexes it. Other pages of your site will be processed more often, because there will be no need to update the pages:
Clean-param: p0[&p1&p2&..&pn] [path]
In the first field, list the parameters that must be disregarded, separated by the & symbol. In the second field, indicate the path prefix for the pages the rule should apply to.
The prefix can contain a regular expression in a format similar to the one used in the robots.txt file, but with a few restrictions: only the characters A-Za-z0-9.-/*_ can be used. However, * is interpreted in the same way as in robots.txt. A * is always implicitly appended to the end of the prefix. For example:
Clean-param: s /forum/showthread.php
means that the s parameter will be disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site. It is case sensitive. The maximum length of a rule is 500 characters. For example:
Clean-param: abc /forum/showthread.php Clean-param: sid&sort /forumt/*.php Clean-param: someTrash&otherTrash
#for these types of addresses: www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/showthread.php
#for these types of addresses: www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: sid /index.php
#if there are several of these parameters: www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311 www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s&ref /forum*/showthread.php
#if the parameter is used in multiple scripts: www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/index.php Clean-param: s /forum/showthread.php
The Yandex robot doesn't support robots.txt directives that aren't shown on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in different ways.
The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:
User-agent: Yandex Allow: / Disallow: / # without extensions everything is disallowed since 'Allow: /' is ignored, # with extension support everything is allowed User-agent: Yandex Disallow: /private*html # without extensions '/private*html' is disallowed, # but with extensions it disallows '/private*html', # and '/private/test.html', and '/private/html/test.aspx' etc. User-agent: Yandex Disallow: /private$ # without extensions, '/private$' and '/private$test' etc. are disallowed, # but with extensions, only '/private' is disallowed User-agent: * Disallow: / User-agent: Yandex Allow: / # without extensions due to no empty line break, # 'User-agent: Yandex' would be ignored and # the result would be 'Disallow: /', but the Yandex robot # selects entries that have 'User-agent:' in the line, # so the result for the Yandex robot in this case is 'Allow: /' User-agent: * Disallow: / # commentary1... # commentary2... # commentary3... User-agent: Yandex Allow: / # same as in the previous example (see above)
Examples of extended robots.txt format use:
User-agent: Yandex Allow: /archive Disallow: / # allows everything that contains '/archive'; everything else is disallowed User-agent: Yandex Allow: /obsolete/private/*.html$ # allows html files # at the path '/obsolete/private/...' Disallow: /*.php$ # disallows all '*.php' on site Disallow: /*/private/ # disallows all subpaths containing # '/private/', but the Allow above negates # part of the disallow Disallow: /*/old/*.zip$ # disallows all '*.zip' files containing # '/old/' in the path User-agent: Yandex Disallow: /add.php?*user= # disallows all 'add.php?' scripts with the 'user' parameter
When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:
User-agent: Yandex Disallow:
Similarly, robots.txt is assumed to allow everything if it couldn't be accessed (for example, if the HTTP headers are not set properly or a 404 Not found HTTP status message is returned).
A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by site owners, they may not follow the robots.txt limiting directives designed for random robots (User-agent: *).
It's also possible to partially ignore robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.
Here is a list of Yandex robots that don't follow general limiting rules in robots.txt:
To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:
User-agent: YaDirectFetcher Disallow: /
User-agent: YandexMobileBot Disallow: /private/*.txt$