The Yandex search engine responds to user queries with relevant web documents it finds on the internet. However, the size of the internet is currently calculated in terms of exabytes – quintillions, or billions of billions, of bytes of information. Needless to say, Yandex Search does not trawl through this enormous pile of data every time it responds to a new search query. The system, so to say, does its homework.
To perform a search, Yandex uses a search index, which is basically a database of all the words and their locations known to the search engine. A word’s location is a combination of its position on a web page and the web page’s address on the internet. A search index is like a glossary or a telephone directory. Unlike a glossary, which only contains selected terms, a search index registers every word the search engine has ever come across. And, unlike a phonebook, which lists names and addresses, a search index has more than one ‘registered address’ for every word.
A web search engine operates in two stages. First, it crawls the web, saving its ‘copy’ on its servers. Second, it responds to a user’s search query by retrieving an answer from its servers.
Background work
Before a search engine can start the search, it needs to prepare the information it finds on the internet for searching. This process is called indexing. A special computer system – web crawler – browses the internet regularly, downloads new web pages and processes them. It creates a kind of ‘carbon copy’ of the internet, which is stored on the search engine’s servers and is updated after every crawl.
Yandex has two crawlers – one of them, the main crawler, indexes all the web pages it comes across, while the other one, known as Orange, performs express indexing to ensure that the most recent documents, including those that appeared on the web minutes or even seconds before the crawl, are available in the search engine’s index. Both crawlers have ‘waiting lists’ of web pages that need to be indexed. The lists continually add new links that the crawlers find on the pages they visit. New links can also appear on the waiting lists after website owners add their pages to the index using the Yandex.Webmaster service. Website administrators can also provide the additional details such as, for instance, how often their website is updated etc.
Before the crawling process can start, a special program – scheduler – creates a schedule, the order according to which web pages will be visited. Scheduling is based on a number of factors necessary for information retrieval, such as link popularity or page update frequency. After a schedule has been made, the other component of the search engine – spider – takes over. The spider regularly visits pages according to the schedule. If a website is accessible to the spider and is functioning, the program downloads the website’s pages as scheduled. It identifies the format (html, pdf, swf etc.), code and language of the downloaded document and then sends this information to the servers for storage.
On the storage server, another program clears the web document of the html-markup leaving only text. It then extracts information about each word’s location and adds all the words in this web document to the index. The original document is also stored on the server until the next crawl. This allows Yandex to offer its users the opportunity to view web documents even if the website is temporarily unavailable. If a website shuts down or a web document gets deleted or updated, Yandex removes it from its servers or replaces it with a newer version.
The search index, together with copies of all the indexed documents, including their type, code and language, forms the search database. To keep up with the ever-changing nature of internet content and make sure that the search engine can find the latest and the most relevant information in response to user search queries, the search database needs to be updated regularly. Before the search engine can find and return results to end users, each new database update first goes to the ‘basic search’ servers. The basic search servers contain only the essential part of the search database – free from spam, mirror sites or other irrelevant documents. This is the part of the search database that responds to user queries directly.
The search database updates are sent from the main crawler’s storage servers to the basic search servers in ‘packages’ once every few days. This is a very resource intensive process. To reduce the load on servers, the data is transferred at night – when search traffic on Yandex is at its lowest. The new portions of the database are compared using a number of parameters against the latest version available from the previous crawl to ensure that the update does not spoil the quality of search results. After a successful quality control check, the old version is replaced with the latest update.
The Orange crawler is designed for real time searches. Both its scheduler and spider are tuned to finding the latest web documents and picking from a vast number of pages those that might be of some interest. These documents are processed instantly and sent straight to the basic search servers. As the number of these documents is relatively low, the update can happen in real time even during the day without the risk of overloading the servers.
A web search engine, roughly, operates in two stages. The first one is crawling the web, indexing pages preparing them to be searched. The other is searching for an answer to a specific user query in the previously created search database.