What is robots.txt?
Robots.txt is a text file that contains site indexing parameters for the search engine robots.
How to set up robots.txt
- Create a file named robots.txt in a text editor and fill it in using the guidelines below.
- Check the file in the Yandex.Webmaster service (Robots.txt analysis in the menu).
- Upload the file to your site's root directory.
The User-agent directive
The Yandex robot supports the robots exclusion standard with enhanced capabilities described below.
The Yandex robot's work is based on sessions: for every session, there is a pool of pages for the robot to download.
A session begins with the download of the robots.txt file. If the file is missing, is not a text file, or the robot's request returns an HTTP status other than
200 OK, the robot assumes that it has unrestricted access to the site's documents.
In the robots.txt file, the robot checks for records starting with
User-agent: and looks for either the substring
Yandex (the case doesn't matter) or
*. If a string
User-agent: Yandexis detected, directives for
User-agent: * are ignored. If the
User-agent: Yandex and
User-agent: * strings are not found, the robot is considered to have unlimited access.
You can enter separate directives for the following Yandex robots:
- YandexBot — The main indexing robot.
- YandexDirect — Downloads information about the content on Yandex Advertising Network partner sites for selecting relevant ads. Interprets robots.txt in a special way.
- 'YandexDirectDyn' — Generates dynamic banners. Interprets robots.txt in a special way.
- YandexMedia — Indexes multimedia data.
- YandexImages — Indexer of Yandex.Images.
- YaDirectFetcher — The Yandex.Direct robot. Interprets robots.txt in a special way.
- YandexBlogs — The Blog search Blog search robot. Indexes posts and comments.
- YandexNews — The Yandex.News robot.
- YandexPagechecker — Semantic markup validator.
- YandexMetrika — The Yandex.Metrica robot.
- YandexMarket — TheYandex.Market robot.
- YandexCalendar — The Yandex.CalendarYandex.Calendar robot.
If there are directives for a specific robot, directives
User-agent: Yahoo and
User-agent: * aren't used.
User-agent: YandexBot # will be used only by the main indexing robot Disallow: /*id= User-agent: Yandex # will be used by all Yandex robots Disallow: /*sid= # except for the main indexing robot User-agent: * # won't be used by Yandex robots Disallow: /cgi-bin
Disallow and Allow directives
To prohibit the robot from accessing your site or certain sections of it, use the
User-agent: YandexDisallow: / # blocks access to the whole site User-agent: YandexDisallow: /cgi-bin # blocks access to the pages # starting with '/cgi-bin'
According to the standard, you should insert a blank line before every
# character designates commentary. Everything following this character, up to the first line break, is disregarded.
Allow directive to allow the robot to access specific parts of the site or the entire site.
User-agent: Yandex Allow: /cgi-bin Disallow: / # prohibits downloading anything except for the pages # starting with '/cgi-bin'
Disallow directives from the corresponding
User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect the way they are used by the robot. Examples:
# Source robots.txt: User-agent: Yandex Allow: /catalog Disallow: / # Sorted robots.txt: User-agent: Yandex Disallow: / Allow: /catalog # only allows downloading pages # starting with '/catalog'
# Source robots.txt: User-agent: Yandex Allow: /Allow: /catalog/auto Disallow: /catalog # Sorted robots.txt: User-agent: Yandex Allow: / Disallow: /catalog Allow: /catalog/auto # prohibits downloading pages starting with '/catalog', # but allows downloading pages starting with '/catalog/auto'.
Allowdirective takes precedence.
Allow and Disallow directives without parameters
If the directives don't contain parameters, the robot handles the data as follows:
User-agent: Yandex Disallow: # same as Allow: / User-agent: Yandex Allow: # isn't taken into account by the robot
Using the special characters * and $
You can use the special characters
$ to set regular expressions when specifying paths for the
Disallow directives. The
* character indicates any sequence of characters (or none). Examples:
User-agent: Yandex Disallow: /cgi-bin/*.aspx # prohibits '/cgi-bin/example.aspx' # and '/cgi-bin/private/test.aspx' Disallow: /*private # prohibits both '/private', # and '/cgi-bin/private'
The $ character
By default, the
* character is appended to the end of every rule described in the robots.txt file. Example:
User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages # starting with '/cgi-bin'Disallow: /cgi-bin # the same
* at the end of the rule, use the
$ character, for example:
User-agent: Yandex Disallow: /example$ # prohibits '/example', # but allows '/example.html'
User-agent: Yandex Disallow: /example # prohibits both '/example', # and '/example.html'
$character doesn't forbid
*at the end, that is:
User-agent: Yandex Disallow: /example$ # prohibits only '/example' Disallow: /example*$ # exactly the same as 'Disallow: /example' # prohibits both /example.html and /example
The Sitemap directive
If you use a Sitemap file to describe your site's structure, indicate the path to the file as a parameterof the
Sitemap directive (if you have multiple files, indicate all paths). Example:
User-agent: Yandex Allow: / sitemap: https://example.com/site_structure/my_sitemaps1.xml sitemap: https://example.com/site_structure/my_sitemaps2.xml
The directive is intersectional, meaning it is used by the robot regardless of its location in robots.txt.
The robot remembers the path to your file, processes your data and uses the results during the next visit to your site.
The Crawl-delay directive
If the server is overloaded and it isn't possible to process downloading requests, use the
Crawl-delay directive. You can specify the minimum interval (in seconds) for the search robot to wait after downloading one page, before starting to download another.
To maintain compatibility with robots that may deviate from the standard when processing robots.txt, add the
Crawl-delay directive to the group that starts with the
User-Agent entry right after the
The Yandex search robot supports fractional values for
Crawl-Delay, such as "0.5". This doesn't mean that the search robot will access your site every half a second, but it may speed up the site processing.
User-agent: Yandex Crawl-delay: 2 # sets a 2-second timeout User-agent: * Disallow: /search Crawl-delay: 4.5 # sets a 4.5-second timeout
The Clean-param directive
If your site page addresses contain dynamic parameters that don't affect the content (for example, identifiers of sessions, users, referrers, and so on), you can describe them using the
The Yandex robot uses this information to avoid reloading duplicate information. This improves the robot's efficiently and reduces the server load.
For example, your site contains the following pages:
www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id=123
ref parameter is only used to track which resource the request was sent from. It doesn't change the page content. All three URLs will display the same page with the
book_id=123 book. Then, if you indicate the directive in the following way:
User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl
the Yandex robot will converge all the page addresses into one:
If a page without parameters is available on the site:
all other URLs are replaced with it after the robot indexes it. Other pages of your site will be crawled more often, because there will be no need to update the pages:
Clean-param: p0[&p1&p2&..&pn] [path]
In the first field, list the parameters that must be disregarded, separated by the
& character. In the second field, indicate the path prefix for the pages the rule should apply to.
The prefix can contain a regular expression in the format similar to the one used in the robots.txt file, but with some restrictions: you can only use the characters
A-Za-z0-9.-/*_. However, * is interpreted in the same way as in robots.txt.
* is always implicitly appended to the end of the prefix. For example:
Clean-param: s /forum/showthread.php
means that the
s parameter is disregarded for all URLs that begin with /forum/showthread.php. The second field is optional, and in this case the rule will apply to all pages on the site.
It is case sensitive.
The maximum length of the rule is 500 characters. For example:
Clean-param: abc /forum/showthread.php Clean-param: sid&sort /forum/*.php Clean-param: someTrash&otherTrash
#for addresses like:www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain the following: User-agent: Yandex Disallow: Clean-param: s /forum/showthread.php
#for addresses like:www.example2.com/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df www.example2.com/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae #robots.txt will contain the following: User-agent: Yandex Disallow: Clean-param: sid /index.php
#if there are several of these parameters:www.example1.com/forum_old/showthread.php?s=681498605&t=8243&ref=1311 www.example1.com/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896 #robots.txt will contain the following: User-agent: Yandex Disallow: Clean-param: s&ref /forum*/showthread.php
#if the parameter is used in multiple scripts:www.example1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.example1.com/forum/index.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain the following: User-agent: Yandex Disallow: Clean-param: s /forum/index.php Clean-param: s /forum/showthread.php
Using Cyrillic characters
The use of the Cyrillic alphabet is not allowed in robots.txt file and HTTP server headers.
For domain names, use Punycode. For page addresses, use the same encoding as the one used for the current site structure.
Example of the robots.txt file:
#Incorrect: User-agent: Yandex Disallow: /корзина Sitemap: сайт.рф/sitemap.xml #Correct: User-agent: Yandex Disallow: /%D0%BA%D0%BE%D1%80%D0%B7%D0%B8%D0%BD%D0%B0 Sitemap: http://xn--80aswg.xn--p1ai/sitemap.xml
The Yandex robot supports only the robots.txt directives listed on this page. The file processing rules described above represent an extension of the basic standard. Other robots may interpret robots.txt contents in a different way.
The results when using the extended robots.txt format may differ from results that use the basic standard, particularly:
User-agent: Yandex Allow: / Disallow: /# without extensions everything was prohibited because 'Allow: /' was ignored, # with extensions supported, everything is allowed User-agent: Yandex Disallow: /private*html # without extensions, '/private*html' was prohibited, # with extensions supported, '/private*html', # '/private/test.html', '/private/html/test.aspx', and so on are prohibited as well User-agent: Yandex Disallow: /private$ # without extensions supported, '/private$' and '/private$test', and so on were prohibited, # with extensions supported, only '/private' is prohibited User-agent: * Disallow: / User-agent: Yandex Allow: / # without extensions supported, because of the missing line break, # 'User-agent: Yandex' would be ignored # the result would be 'Disallow: /', but the Yandex robot # parses strings based on the 'User-agent:' substring. # In this case, the result for the Yandex robot is 'Allow: /' User-agent: * Disallow: / # comment1... # comment2... # comment3... User-agent: Yandex Allow: / # same as in the previous example (see above)
Examples using the extended robots.txt format:
User-agent: Yandex Allow: /archive Disallow: / # allows everything that contains '/archive'; the rest is prohibited User-agent: Yandex Allow: /obsolete/private/*.html$ # allows HTML files # in the '/obsolete/private/... path' Disallow: /*.php$ # probibits all '*.php' on siteDisallow: /*/private/ # prohibits all subpaths containing # '/private/', but the Allow above negates # part of the prohibition Disallow: /*/old/*.zip$ # prohibits all '*.zip' files containing # '/old/' in the path User-agent: Yandex Disallow: /add.php?*user= # prohibits all 'add.php?' scripts with the ' user ' option
When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything, meaning it is interpreted the same way as:
User-agent: Yandex Disallow:
Similarly, robots.txt is assumed to allow everything if it couldn't be downloaded (for example, if HTTP headers are not set properly or a
404 Not found status is returned).
A number of Yandex robots download web documents for purposes other than indexing. To avoid being unintentionally blocked by the site owners, they may ignore the robots.txt directives designed for random robots (
In addition, robots may ignore some robots.txt restrictions for certain sites if there is an agreement between “Yandex” and the owners of those sites.
Yandex robots that don't follow common disallow directives in robots.txt:
- YaDirectFetcher downloads ad landing pages to check their availability and content. This is needed for placing ads in the Yandex search results and on partner sites. When crawling a site, the robot does not use the robots.txt file and ignores the directives set for it.
- YandexCalendar regularly downloads calendar files by users' requests. These files are often located in directories prohibited from indexing.
- YandexDirect downloads information about the content of Yandex Advertising network partner sites to identify their topic categories to match relevant advertising.
- YandexDirectDyn is the robot that generates dynamic banners.
- YandexFavicons — The favicons indexer.
- YandexMobileBot downloads documents to determine if their layout is suitable for mobile devices.
- YandexAccessibilityBot downloads pages to check their accessibility for users.
- YandexScreenshotBot takes a screenshot of a page.
- YandexMetrika is the Yandex.Metrica robot.
- YandexVideoParser is the Yandex video indexer.
- YandexSearchShop regularly downloads product catalogs in YML files by users' requests. These files are often placed in directories prohibited for indexing.
To prevent this behavior, you can restrict access for these robots to some pages or the whole site using the robots.txt directives, for example:
User-agent: YandexCalendar Disallow: /
User-agent: YandexMobileBot Disallow: /private/*.txt$