URL and hashing
Safe Browsing lists consist of variable length SHA-256 hash prefixes (from 4 to 32 bytes) mapped to URLs of internet resources that pose threats to users. To see if a URL is on the list (in the local database or on the server), compute the hash prefix. To do this:
Search the local Safe Browsing database. If you found a hash prefix, send it in your request.
Canonicalized URL
The URL must conform to RFC 2396. Make sure that you convert the URLs that use internationalized domain names (IDN), such as .рф
, to ASCII using the Punycode method. The URL must contain a path component. This means it must end with a slash, for example: https://yandex.ru/
.
To canonicalize a URL:
Delete the tab characters (
0x09
), carriage return characters (0x0d
), and line feed characters (0x0a
).Delete the anchor. For example, reduce
https://yandex.ru/#anchor
tohttps://yandex.ru/
.Separately process:
- Host
-
- Delete the leading and trailing dots.
- Replace several consecutive dots with one dot.
- If the host name is an IP address, convert it to four decimals, separated by a dot.
- Make it lowercase.
- Path
- Note. Don't apply these actions to the request parameters.
- Delete the
/../
sequences and replace/./
with/
. - Replace multiple consecutive slashes with one slash.
- Escape all the characters with ASCII codes less than 32 and more than 127, as well as the
#
and%
characters. Use uppercase hexadecimal notation.
Source URL | Canonicalized URL |
---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Source URL | Canonicalized URL |
---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Expressions for the host and path
- Expressions for the host and path.
- Combinations of those expressions.
- Expressions for the host
- Compose no more than 5 different strings:
- A fully qualified host name.
- Up to 4 host names, starting with the last five components and removing the first component sequentially. You can skip the top-level domain. Don't create those strings if the host name is an IP address.
- Expressions for the path
- Compose no more than 6 different strings:
- Full path with the request parameters.
- Full path without the request parameters.
- 4 paths, starting from the root and sequentially adding path components with a trailing slash.
- Combinations
-
Compose up to 30 different combinations of expressions for the host and path. Those combinations must include only the host name and path: remove the schema (protocol), username, password, and port. If the URL contains request parameters, at least one combination must contain the fully qualified path and request parameters.
- Example
- For the URL
http://a.b.c/1/2.html?param=1
, compose the following strings:a.b.c/1/2.html?param=1 a.b.c/1/2.html a.b.c/ a.b.c/1/ b.c/1/2.html?param=1 b.c/1/2.html b.c/ b.c/1/
For the URL
http://a.b.c.d.e.f.g/1.html
, compose the following strings:a.b.c.d.e.f.g/1.html a.b.c.d.e.f.g/ (Skip b.c.d.e.f.g since only the last five host name components and the fully qualified host name are used) c.d.e.f.g/1.html c.d.e.f.g/ d.e.f.g/1.html d.e.f.g/ e.f.g/1.html e.f.g/ f.g/1.html f.g/
For the URL
http://1.2.3.4/1/
, compose the following strings:1.2.3.4/1/ 1.2.3.4/
Hashing
After you compose the host and path combinations, compute the full-length SHA256 hash for each of them.
Then compute the prefix for each full-length hash. The hash prefix length is 4-32 bytes.
Examples from the FIPS-180-4 standard:
- Hashing of the
abc
string -
- The hash digest is
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
. - The 32-bit prefix is
ba7816bf
.
- Hashing of the
abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq
string -
- The hash digest is
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
. - The 48-bit prefix is
248d6a61 d206
.