blog/ribw/how-does-googles-search-engine-work/index.html (view raw)
1<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=description content="Official Lonami's website"><meta name=viewport content="width=device-width, initial-scale=1.0, user-scalable=yes"><title> How does Google’s Search Engine work? | Lonami's Blog </title><link rel=stylesheet href=/style.css><body><article><nav class=sections><ul><li><a href=/>lonami's site</a><li><a href=/blog class=selected>blog</a><li><a href=/golb>golb</a><li><a href=/blog/atom.xml>rss</a></ul></nav><main><h1 class=title>How does Google’s Search Engine work?</h1><div class=time><p>2020-03-18T01:00:00+00:00<p>last updated 2020-03-28T10:17:09+00:00</div><p>The original implementation was written in C/++ for Linux/Solaris.<p>There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.<p><img src=https://lonami.dev/blog/ribw/how-does-googles-search-engine-work/image-1024x649.png><p>But before we talk about the different components, let’s take a look at how they store all of this information.<h2 id=data-structures>Data structures</h2><p>A «BigFile» is a virtual file addressable by 64 bits.<p>There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.<table><tbody><tr><td>sync<td>length<td>compressed packet</table><p>The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.<table><tbody><tr><td>doc id<td>ecode<td>url len<td>page len<td>url<td>page</table><p>A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.<table><tbody><tr><td>word id<td>n docs<tr><td>word id<td>n docs</table><p>The Hit Lists store occurences of a word in a document.<table><tbody><tr><td><strong> plain </strong><td>cap: 1<td>imp: 3<td>pos: 12<tr><td><strong> fancy </strong><td>cap: 1<td>imp: 7<td>type: 4<td>pos: 8<tr><td><strong> anchor </strong><td>cap: 1<td>imp: 7<td>type: 4<td>hash: 4<td>pos: 8</table><p>The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).<table><tbody><tr><td rowspan=3>doc id<td>word id: 24<td>n hits: 8<td>hit hit hit hit hit hit hit hit<tr><td>word id: 24<td>n hits: 8<td>hit hit hit hit hit hit hit hit<tr><td>null word id</table><p>The Inverted Index can be sorted by either document identifier or by ranking of word occurence.<table><tbody><tr><td>doc id: 23<td>n hits: 5<td>hit hit hit hit hit<tr><td>doc id: 23<td>n hits: 3<td>hit hit hit<tr><td>doc id: 23<td>n hits: 4<td>hit hit hit hit<tr><td>doc id: 23<td>n hits: 2<td>hit hit</table><p>Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.<h2 id=crawling>Crawling</h2><p>The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.<p>Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.<p>The crawled pages need parsing to deal with typos or formatting issues.<h2 id=indexing>Indexing</h2><p>Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.<p>Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.<h2 id=searching>Searching</h2><p>We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:<p><img src=https://lonami.dev/blog/ribw/how-does-googles-search-engine-work/8e1e61b119e107fcb4bdd7e78f649985.png> <em>PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))</em><p>Where:<ul><li><code>A</code> is a given page<li><code>T<sub>n</sub></code> are pages that point to A<li><code>d</code> is the damping factor in the range <code>[0, 1]</code> (often 0.85)<li><code>C(A)</code> is the number of links going out of page <code>A</code><li><code>PR(A)</code> is the page rank of page <code>A</code> This formula indicates the probability that a random surfer visits a certain page, and <code>1 - d</code> is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.</ul><p>The anchor text in the links also help provide a better description and helps indexing for even better results.<p>While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.<p>Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.<h2 id=conclusion>Conclusion</h2><p>Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.<h2 id=references>References</h2><ul><li><a href=https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf>The anatomy of a large-scale hypertextual Web search engine</a><li><a href=https://www.site.uottawa.ca/%7Ediana/csi4107/Google_SearchEngine.pdf>The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)</a></ul></main><footer><div><p>Share your thoughts, or simply come hang with me <a href=https://t.me/LonamiWebs><img src=/img/telegram.svg alt=Telegram></a> <a href=mailto:totufals@hotmail.com><img src=/img/mail.svg alt=Mail></a></div></footer></article><p class=abyss>Glaze into the abyss… Oh hi there!