Disallow and Allow directives
Disallow
Use this directive to prohibit crawling of sections or individual pages of a site. For example:
- Pages that contain confidential data.
- Pages with site search results.
- Site traffic statistics.
- Duplicate pages.
- Various logs.
- Database service pages.
Note
When selecting a directive for pages that need to be left out from the search, if their addresses contain GET parameters, it's recommended to use the Clean-param directive rather than Disallow. If you use Disallow, you may not be able to identify duplicate link URLs without the parameter and send some metrics of forbidden pages.
Examples:
User-agent: Yandex
Disallow: / # prohibits crawling of the entire site
User-agent: Yandex
Disallow: /catalogue # prohibits crawling of pages whose addresses start with /catalogue
User-agent: Yandex
Disallow: /page? # prohibits crawling of pages whose URLs contain parameters
Allow
This directive allows crawling of sections or individual pages of a site.
Examples:
User-agent: Yandex
Allow: /cgi-bin
Disallow: /
# prohibits downloading of everything except pages
# starting with “/cgi-bin”
User-agent: Yandex
Allow: /file.xml
# allows downloading of the file.xml file
Note
Empty line breaks aren't allowed between the User-agent
, Disallow
and Allow
directives.
Combining directives
The Allow
and Disallow
directives from the corresponding User-agent
block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt
file doesn't affect the way they are used by the robot.
Note
If there is a conflict between two directives with prefixes of the same length, the Allow
directive takes precedence.
# Source robots.txt:
User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalog
# Sorted robots.txt:
User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto
# prohibits downloading of pages starting with “/catalog”,
# but allows downloading of pages starting with ’/catalog/auto”.
Common example:
User-agent: Yandex
Allow: /archive
Disallow: /
# allows everything containing “/archive”, the rest is prohibited
User-agent: Yandex
Allow: /obsolete/private/*.html$ # allows html files
# in “/obsolete/private/...”
Disallow: /*.php$ # prohibits all '*.php” on this site
Disallow: /*/private/ # prohibits all subpaths containing
# “/private/”, but Allow above negates
# part of this prohibition
Disallow: /*/old/*.zip$ # prohibits all '*.zip” files containing
# in “/old/”
User-agent: Yandex
Disallow: /add.php?*user=
# prohibits all “add.php?' scripts with the “user” parameter
Allow and Disallow directives without parameters
If directives don't contain parameters, the robot handles the data as follows:
User-agent: Yandex
Disallow: # same as Allow: /
User-agent: Yandex
Allow: # isn’t taken into account by the bot
Using the special characters * and $
You can use special characters when specifying the paths of the Allow and Disallow directives *
and $
to set certain regular expressions.
The *
character indicates any sequence of characters (or none). Examples:
User-agent: Yandex
Disallow: /cgi-bin/*.aspx # prohibits “/cgi-bin/example.aspx”
# and “/cgi-bin/private/test.aspx”
Disallow: /*private # prohibits both “/private”
# and “/cgi-bin/private”
By default, the *
character is appended to the end of every rule described in the robots.txt
file. Example:
User-agent: Yandex
Disallow: /cgi-bin* # blocks access to pages
# starting with “/cgi-bin”
Disallow: /cgi-bin # the same
To cancel *
at the end of the rule, use the $
character, for example:
User-agent: Yandex
Disallow: /example$ # prohibits “/example”,
# but allows “/example.html”
User-agent: Yandex
Disallow: /example # prohibits both “/example”
# and “/example.html”
The $
character doesn't forbid the *
at the end, that is:
User-agent: Yandex
Disallow: /example$ # only prohibits “/example”
Disallow: /example*$ # same as “Disallow: /example”
# prohibits both /example.html and /example
Processing the # character
According to the standard, you should insert a blank line before every User-agent directive. The #
character designates commentary. Everything following this character, up to the first line break, is disregarded.
Pages with addresses like https://example.com/page#part_1
are not indexed by the search bot and will be crawled at the https://example.com/page
address. Therefore, it is okay to specify the page address without the anchor in the directive.
If you do not take this feature into account and write a disallow directive with the #
character, it may block the entire site from indexing. For example, a directive in the Disallow: /#
format will be interpreted by the search system as Disallow: /
, i.e., a complete ban on indexing.
Examples of how directives are interpreted
User-agent: Yandex
Allow: /
Disallow: /
# everything is allowed
User-agent: Yandex
Allow: /$
Disallow: /
# everything is prohibited except the home page
User-agent: Yandex
Disallow: /private*html
# prohibits “/private*html”,
#, “/private/test.html”, “/private/html/test.aspx”, etc.
User-agent: Yandex
Disallow: /private$
# only prohibits “/private”
User-agent: *
Disallow: /
User-agent: Yandex
Allow: /
# since the Yandex bot
# selects entries that include “User-agent:',
# everything is allowed