Using robots.txt

    What is robots.txt?

    Robots.txt is a text file that contains site indexing parameters for search engine robots.

    How to set up robots.txt

    1. Create a file with the name robots.txt in the text editor and fill it in following the guidelines below.
    2. Check the file using the Yandex.Webmaster service (Robots.txt analysis in the menu).
    3. Upload the file to your site's root directory.

    User-agent directive

    The Yandex robot supports the robot exclusion standard with the enhanced capabilities that are described below.

    The Yandex robot uses the session robot principle: for every session, a given pool of webpages is put together that the robot plans to visit.

    A session begins when the robots.txt file is loaded. If the file is missing, is not a text file, or the robot's request returns an HTTP-status other than 200 OK, the robot assumes that it has unrestricted access to the site's documents.

    In the robots.txt file, the robot checks for records beginning with User-agent: and looks for either the substring Yandex (case doesn't matter) or * . If it finds the line User-agent: Yandex, the directives for User-agent: * are disregarded. If the lines User-agent: Yandex and User-agent: * are absent, robot access is assumed to be unrestricted.

    Separate directives can be entered for the following Yandex robots:

    • 'YandexBot' — the main indexing robot
    • 'YandexDirect' — downloads information about the content on Yandex Advertising Network partner sites for selecting relevant ads; interprets robots.txt in a special way
    • 'YandexDirectDyn' — generates dynamic banners and interprets robots.txt in a special way
    • 'YandexMedia' — robot used to index multimedia data
    • 'YandexImages' — indexing robot for Yandex.Images
    • 'YaDirectFetcher' — the Yandex.Direct robot; it interprets robots.txt in a special way
    • 'YandexBlogs'blog search — robot that indexes posts and comments
    • 'YandexNews' — robot
    • 'YandexPagechecker'micromarkup validator
    • ‘YandexMetrika’Yandex.Metrica robot
    • ‘YandexCalendar’Yandex.Calendar robot

    If directives are found for a specific robot, User-agent: Yandex and User-agent: * are not used.

    For example:

    User-agent: YandexBot # will only use the main indexing robot
    Disallow: /*id=
    
    User-agent: Yandex # will use all Yandex robots
    Disallow: /*sid= # besides the main indexing robot
    User-agent: * # Yandex robots won't be used
    Disallow: /cgi-bin 

    Disallow and Allow directives

    If you don't want to allow robots to access your site or certain sections of it, use the Disallow directive.

    Examples:

    User-agent: Yandex
    Disallow: / # blocks access to whole site
    
    User-agent: Yandex
    Disallow: /cgi-bin # blocks access to pages  
                       # starting with '/cgi-bin'

    In accordance with the standard, we recommend that you insert a blank line before every User-agent directive.

    The # character designates commentary. Everything following this character, up to the first line break, is disregarded.

    Use the Allow directive to allow the robot access to specific parts of the site or to the entire site.

    Examples:

    User-agent: Yandex
    Allow: /cgi-bin
    Disallow: /
    # forbids downloads of anything except for pages 
    # starting with '/cgi-bin'
    Note. Empty line breaks are not allowed between the User-agent, Disallow and Allow directives.

    Using directives jointly

    The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot. Examples:

    # Source robots.txt:
    User-agent: Yandex
    Allow: /catalog
    Disallow: /
    # Sorted robots.txt:
    User-agent: Yandex
    Disallow: /
    Allow: /catalog
    # only allows downloading pages
    # starting with '/catalog'
    # Source robots.txt:
    User-agent: Yandex
    Allow: /
    Allow: /catalog/auto
    Disallow: /catalog
    # Sorted robots.txt:
    User-agent: Yandex
    Allow: /
    Disallow: /catalog
    Allow: /catalog/auto
    # disallows downloading pages starting with '/catalog',
    # but allows downloading pages starting with '/catalog/auto'.
    Note. If there is a conflict between two directives with prefixes of the same length, the Allow directive takes precedence.

    Allow and Disallow directives without parameters

    If the directives don't contain parameters, the robot handles data in the following manner:

    User-agent: Yandex
    Disallow: # the same as Allow: /
    
    User-agent: Yandex
    Allow: # not considered a robot

    Using the special characters * and $

    You can use the special characters * and $ when specifying paths for the Allow and Disallow directives, setting certain regular expressions this way. The * character indicates any sequence of characters (including blanks). Examples:

    User-agent: Yandex
    Disallow: /cgi-bin/*.aspx # disallow '/cgi-bin/example.aspx'
                              # and '/cgi-bin/private/test.aspx'
    Disallow: /*private # disallow both '/private'
                        # and '/cgi-bin/private'

    The $ character

    By default, the * character is appended to the end of every rule described in the robots.txt file. For example:

    User-agent: Yandex
    Disallow: /cgi-bin* # blocks access to pages 
                        # starting with '/cgi-bin'
    Disallow: /cgi-bin # the same

    To cancel * at the end of the rule, you can use the $ character, for example:

    User-agent: Yandex
    Disallow: /example$ # disallows '/example', 
                        # but allows '/example.html'
    User-agent: Yandex
    Disallow: /example # disallows both '/example', 
                       # and '/example.html'
    The character $ doesn't prohibit * if it is specified at the end, in other words:
    User-agent: Yandex
    Disallow: /example$  # prohibits only '/example'
    Disallow: /example*$ # exactly the same as 'Disallow: /example' 
                         # prohibits both /example.html and /example

    Sitemap directive

    If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameter of the Sitemap directive (if you have multiple files, indicate all paths). Example:

    User-agent: Yandex
    Allow: /
    Sitemap: http://example.com/site_structure/my_sitemaps1.xml
    Sitemap: http://example.com/site_structure/my_sitemaps2.xml

    The robot will remember the path to your file, process your data, and use the results during the next visit to your site.

    Host directive

    If your site has mirrors, special mirror bots ( Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots) ) detect them and form a mirror group for your site. Only the main mirror will participate in search. You can indicate which site is the main one in the robots.txt file. The name of the main mirror should be listed as the value of the Host directive.

    The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision-making algorithm will assign it a high priority. For example:

    #If www.main-mirror.com is your site's main mirror, then #robots.txt for all your sites from the mirror group will look like this: 
    User-Agent: *
    Disallow: /forum
    Disallow: /cgi-bin
    Host: www.main-mirror.com

    To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Host directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives. The Host directive argument is the domain name with the port number (80 by default), separated by a colon.

    #Example of a well-formed robots.txt file, where
    #the Host directive will be taken into account during processing
    
    User-Agent: *
    Disallow:
    Host: www.myhost.com

    However, the Host directive is intersectional, so it will be used by the robot regardless of its location in robots.txt.

    Note. For every robots.txt file, only one Host directive is processed. If several directives are indicated in the file, the robot will use the first one.

    For example:

    Host: myhost.ru # uses
    
    User-agent: *
    Disallow: /cgi-bin
    
    User-agent: Yandex
    Disallow: /cgi-bin
    Host: www.myhost.ru # is not used

    The Host directive should contain:

    • The protocol set to HTTPS if the mirror is only available via a secure channel (Host: https://myhost.com).

    • One valid domain name that conforms to RFC 952 and is not an IP address.

    • The port number, if necessary (Host: myhost.com:8080).

    An incorrectly formed Host directive will be ignored.

    # Examples of Host directives that will be ignored
    
    Host: www.myhost-.com
    Host: www.-myhost.com
    Host: www.myhost.com:100000
    Host: www.my_host.com
    Host: .my-host.com:8000
    Host: my-host.com.
    Host: my..host.com
    Host: www.myhost.com:8080/
    Host: 213.180.194.129
    Host: www.firsthost.ru,www.secondhost.com
    Host: www.firsthost.ru www.secondhost.com

    Examples of Host directive use:

    # domain.myhost.ru is the main mirror for
    # www.domain.myhost.com, so the correct use of 
    # the Host directive is:
    
    User-Agent: *
    Disallow:
    Host: domain.myhost.ru

    Crawl-delay directive

    If the server is overloaded and it isn't possible to process downloading requests, use the Crawl-delay directive. You can specify the minimum interval (in seconds) for a search robot to wait after loading one page, before starting to load another.

    To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent entry right after the Disallow and Allow directives.

    The Yandex search robot supports fractional values for Crawl-Delay, such as "0.5". This does not mean that the search robot will access your site every half a second, but it may speed up the site processing.

    Examples:

    User-agent: Yandex
    Crawl-delay: 2 # sets a 2 second time-out
    
    User-agent: *
    Disallow: /search
    Crawl-delay: 4.5 # sets a 4.5 second time-out

    Clean-param directive

    If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the Clean-param directive.

    Using this information, the Yandex robot will not reload duplicate information again. This will improve how efficiently the robot processes your site and reduce the server load.

    For example, your site contains the following pages:

    www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123
    www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
    www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

    The ref parameter is only used to track which resource the request was sent from, and does not change the content. All three addresses will display the same page with book_id=123. Then, if you indicate the directive in the following way:

    User-agent: Yandex
    Disallow:
    Clean-param: ref /some_dir/get_book.pl

    the Yandex robot will converge all the page addresses into one:

    www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123,

    If a page without parameters is available on the site:

    www.example.com/some_dir/get_book.pl?book_id=123

    everything will go to that page after the robot indexes it. Other pages of your site will be processed more often, because there will be no need to update the pages:

    www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123
    www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123

    Directive syntax

    Clean-param: p0[&p1&p2&..&pn] [path]

    In the first field, list the parameters that must be disregarded, separated by the & symbol. In the second field, indicate the path prefix for the pages the rule should apply to.

    Note. The Clean-Param directive is intersectional, so it can be indicated in any place within the robots.txt file. If several directives are specified, all of them will be taken into account by the robot.

    The prefix can contain a regular expression in a format similar to the one used in the robots.txt file, but with a few restrictions: only the characters A-Za-z0-9.-/*_ can be used. However, * is interpreted in the same way as in robots.txt. A * is always implicitly appended to the end of the prefix. For example:

    Clean-param: s /forum/showthread.php

    means that the s parameter will be disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site. It is case sensitive. The maximum length of a rule is 500 characters. For example:

    Clean-param: abc /forum/showthread.php
    Clean-param: sid&sort /forumt/*.php
    Clean-param: someTrash&otherTrash

    Additional examples

    #for these types of addresses:
    www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
    www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243
    
    #robots.txt will contain:
    User-agent: Yandex
    Disallow:
    Clean-param: s /forum/showthread.php
    #for these types of addresses:
    www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df
    www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae
    
    #robots.txt will contain:
    User-agent: Yandex
    Disallow:
    Clean-param: sid /index.php
    #if there are several of these parameters:
    www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311
    www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896
    
    #robots.txt will contain:
    User-agent: Yandex
    Disallow:
    Clean-param: s&ref /forum*/showthread.php
    #if the parameter is used in multiple scripts:
    www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243
    www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243
    
    #robots.txt will contain:
    User-agent: Yandex
    Disallow:
    Clean-param: s /forum/index.php
    Clean-param: s /forum/showthread.php

    Additional information

    The Yandex robot doesn't support robots.txt directives that aren't shown on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in different ways.

    The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:

    User-agent: Yandex 
    Allow: /
    Disallow: /
    # without extensions everything is disallowed since 'Allow: /' is ignored, 
    # with extension support everything is allowed
    
    User-agent: Yandex
    Disallow: /private*html
    # without extensions '/private*html' is disallowed, 
    # but with extensions it disallows '/private*html', 
    # and '/private/test.html', and '/private/html/test.aspx' etc.
    
    User-agent: Yandex
    Disallow: /private$
    # without extensions, '/private$' and '/private$test' etc. are disallowed, 
    # but with extensions, only '/private' is disallowed
    
    User-agent: *
    Disallow: /
    User-agent: Yandex
    Allow: /
    # without extensions due to no empty line break, 
    # 'User-agent: Yandex' would be ignored and  
    # the result would be 'Disallow: /', but the Yandex robot 
    # selects entries that have 'User-agent:' in the line, 
    # so the result for the Yandex robot in this case is 'Allow: /'
    
    User-agent: *
    Disallow: /
    # commentary1...
    # commentary2...
    # commentary3...
    User-agent: Yandex
    Allow: /
    # same as in the previous example (see above)

    Examples of extended robots.txt format use:

    User-agent: Yandex
    Allow: /archive
    Disallow: /
    # allows everything that contains '/archive'; everything else is disallowed
    
    User-agent: Yandex
    Allow: /obsolete/private/*.html$ # allows html files
                                     # at the path '/obsolete/private/...'
    Disallow: /*.php$  # disallows all '*.php' on site
    Disallow: /*/private/ # disallows all subpaths containing
                          # '/private/', but the Allow above negates
                          # part of the disallow
    Disallow: /*/old/*.zip$ # disallows all '*.zip' files containing
                            # '/old/' in the path
    
    User-agent: Yandex
    Disallow: /add.php?*user= 
    # disallows all 'add.php?' scripts with the 'user' parameter

    When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:

    User-agent: Yandex
    Disallow:

    Similarly, robots.txt is assumed to allow everything if it couldn't be accessed (for example, if the HTTP headers are not set properly or a 404 Not found HTTP status message is returned).

    Exceptions

    A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by site owners, they may not follow the robots.txt limiting directives designed for random robots (User-agent: *).

    It's also possible to partially ignore robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.

    Attention. If such a robot downloads a document that the main Yandex robot can't access, this document will never be indexed and won't be found in search results.

    Here is a list of Yandex robots that don't follow general limiting rules in robots.txt:

    • YaDirectFetcher downloads ad landing pages to check their availability and content. This is compulsory for placing ads in Yandex search results and YAN partner sites.
    • YandexCalendar regularly downloads calendar files requested by users, despite being located in directories that are blocked from indexing.
    • 'YandexDirect' downloads information about YAN partner site content in order to clarify what their topics are so that relevant ads can be selected.
    • YandexDirectDyn is the robot that generates dynamic banners.
    • YandexMobileBot downloads documents for analysis in order to determine if their page layouts are suitable for mobile devices.
    • YandexAccessibilityBot downloads pages to check how accessible they are for users.
    • YandexScreenshotBot takes a screenshot of a page.
    • Yandex.Metrika is the Yandex.Metrica robot.
    • YandexVideoParser is the Yandex.Video indexer.

    To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:

    User-agent: YaDirectFetcher
    Disallow: /
    User-agent: YandexMobileBot
    Disallow: /private/*.txt$