URL and hashing

Safe Browsing lists consist of variable length SHA-256 hash prefixes (from 4 to 32 bytes) mapped to URLs of internet resources that pose threats to users. To see if a URL is on the list (in the local database or on the server), compute the hash prefix. To do this:

Code sample in Golang.

Search the local Safe Browsing database. If you found a hash prefix, send it in your request.

Canonicalized URL

The URL must conform to RFC 2396. Make sure that you convert the URLs that use internationalized domain names (IDN), such as .рф, to ASCII using the Punycode method. The URL must contain a path component. This means it must end with a slash, for example: https://yandex.ru/.

To canonicalize a URL:

  1. Delete the tab characters (0x09), carriage return characters (0x0d), and line feed characters (0x0a).

  2. Delete the anchor. For example, reduce https://yandex.ru/#anchor to https://yandex.ru/.

  3. Separately process:

    Host
    • Delete the leading and trailing dots.
    • Replace several consecutive dots with one dot.
    • If the host name is an IP address, convert it to four decimals, separated by a dot.
    • Make it lowercase.
    Path
    Note. Don't apply these actions to the request parameters.
    • Delete the /../ sequences and replace /./ with /.
    • Replace multiple consecutive slashes with one slash.
  4. Escape all the characters with ASCII codes less than 32 and more than 127, as well as the # and % characters. Use uppercase hexadecimal notation.
Example
Source URL Canonicalized URL
http://host/%25%32%35
http://host/%25
http://host/%25%32%35%25%32%35
http://host/%25%25
http://host/%2525252525252525
http://host/%25
http://host/asdf%25%32%35asd
http://host/asdf%25asd
http://host/%%%25%32%35asd%%
http://host/%25%25%25asd%25%25
http://%32%31%33%2e%31%38%30%2e%32%31%30%2e5/%2e%73%65%63%75%72%65%2f%77%77%77%2e%6d%6f%69%6b%72%75%67%2e%72%75/
http://213.180.210.5/.secure/www.moikrug.ru/
http://host%23.com/%257Ea%2521b%2540c%2523d%2524e%25f%255E00%252611%252A22%252833%252944_55%252B
http://host%23.com/~a!b@c%23d$e%25f^00&11*22(33)44_55+
http://3279880203/smth
http://195.127.0.11/smth
http://www.yandex.ru/smth/..
http://www.yandex.ru/
www.yandex.ru
http://www.yandex.ru/
http://www.zlo.com/smth#ancor
http://www.zlo.com/smth
http://www.YANdex.ru/
http://www.yandex.ru/
http://www.yandex.ru.../
http://www.yandex.ru/
http://www.yandex.ru/m\ta\rp\ns
http://www.yandex.ru/maps
http://zlo.com/smth#more#again
http://zlo.com/smth
http://\x01\x80.com/
http://%01%80.com/
http://www.hostport.com:1234/
http://www.hostport.com/
  http://www.yandex.ru/  
http://www.yandex.ru/
http:// probel.ru/
http://%20probel.ru/
http://host.com//dvaslesha?more//slashes
http://host.com/dvaslesha?more//slashes
Source URL Canonicalized URL
http://host/%25%32%35
http://host/%25
http://host/%25%32%35%25%32%35
http://host/%25%25
http://host/%2525252525252525
http://host/%25
http://host/asdf%25%32%35asd
http://host/asdf%25asd
http://host/%%%25%32%35asd%%
http://host/%25%25%25asd%25%25
http://%32%31%33%2e%31%38%30%2e%32%31%30%2e5/%2e%73%65%63%75%72%65%2f%77%77%77%2e%6d%6f%69%6b%72%75%67%2e%72%75/
http://213.180.210.5/.secure/www.moikrug.ru/
http://host%23.com/%257Ea%2521b%2540c%2523d%2524e%25f%255E00%252611%252A22%252833%252944_55%252B
http://host%23.com/~a!b@c%23d$e%25f^00&11*22(33)44_55+
http://3279880203/smth
http://195.127.0.11/smth
http://www.yandex.ru/smth/..
http://www.yandex.ru/
www.yandex.ru
http://www.yandex.ru/
http://www.zlo.com/smth#ancor
http://www.zlo.com/smth
http://www.YANdex.ru/
http://www.yandex.ru/
http://www.yandex.ru.../
http://www.yandex.ru/
http://www.yandex.ru/m\ta\rp\ns
http://www.yandex.ru/maps
http://zlo.com/smth#more#again
http://zlo.com/smth
http://\x01\x80.com/
http://%01%80.com/
http://www.hostport.com:1234/
http://www.hostport.com/
  http://www.yandex.ru/  
http://www.yandex.ru/
http:// probel.ru/
http://%20probel.ru/
http://host.com//dvaslesha?more//slashes
http://host.com/dvaslesha?more//slashes

Expressions for the host and path

After you canonicalize the URL, create:
  1. Expressions for the host and path.
  2. Combinations of those expressions.
Expressions for the host
Compose no more than 5 different strings:
  • A fully qualified host name.
  • Up to 4 host names, starting with the last five components and removing the first component sequentially. You can skip the top-level domain. Don't create those strings if the host name is an IP address.
Expressions for the path
Compose no more than 6 different strings:
  • Full path with the request parameters.
  • Full path without the request parameters.
  • 4 paths, starting from the root and sequentially adding path components with a trailing slash.
Combinations

Compose up to 30 different combinations of expressions for the host and path. Those combinations must include only the host name and path: remove the schema (protocol), username, password, and port. If the URL contains request parameters, at least one combination must contain the fully qualified path and request parameters.

Example
For the URL http://a.b.c/1/2.html?param=1, compose the following strings:
a.b.c/1/2.html?param=1
a.b.c/1/2.html
a.b.c/
a.b.c/1/
b.c/1/2.html?param=1
b.c/1/2.html
b.c/
b.c/1/ 

For the URL http://a.b.c.d.e.f.g/1.html, compose the following strings:

a.b.c.d.e.f.g/1.html
a.b.c.d.e.f.g/
(Skip b.c.d.e.f.g since only the last five host name components and the fully qualified host name are used)
c.d.e.f.g/1.html
c.d.e.f.g/
d.e.f.g/1.html
d.e.f.g/
e.f.g/1.html
e.f.g/
f.g/1.html
f.g/

For the URL http://1.2.3.4/1/, compose the following strings:

1.2.3.4/1/
1.2.3.4/

Hashing

After you compose the host and path combinations, compute the full-length SHA256 hash for each of them.

Then compute the prefix for each full-length hash. The hash prefix length is 4-32 bytes.

Examples from the FIPS-180-4 standard:

Hashing of the abc string
  • The hash digest is ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad.
  • The 32-bit prefix is ba7816bf.
Hashing of the abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq string
  • The hash digest is ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad.
  • The 48-bit prefix is 248d6a61 d206.