git rekt — gemini-redirect.git (4e76ffb372acbefd1eed19ac74caf7f564084240): blog/ribw/how-does-googles-search-engine-work/index.html

blog/ribw/how-does-googles-search-engine-work/index.html (view raw)
  1<!DOCTYPE html>
  2<html>
  3<head>
  4<meta charset="utf-8" />
  5<meta name="viewport" content="width=device-width, initial-scale=1" />
  6<title>How does Google’s Search Engine work?</title>
  7<link rel="stylesheet" href="../css/style.css">
  8</head>
  9<body>
 10<main>
 11<p>The original implementation was written in C/++ for Linux/Solaris.</p>
 12<div class="date-created-modified">Created 2020-03-18<br>
 13Modified 2020-03-28</div>
 14<p>There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.</p>
 15<p><img src="image-1024x649.png" alt="" /></p>
 16<p>But before we talk about the different components, let’s take a look at how they store all of this information.</p>
 17<h2 class="title" id="data_structures"><a class="anchor" href="#data_structures">¶</a>Data structures</h2>
 18<p>A «BigFile» is a virtual file addressable by 64 bits.</p>
 19<p>There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.</p>
 20<table class="">
 21 <tbody>
 22  <tr>
 23   <td>
 24    sync
 25   </td>
 26   <td>
 27    length
 28   </td>
 29   <td>
 30    compressed packet
 31   </td>
 32  </tr>
 33 </tbody>
 34</table>
 35<p>The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.</p>
 36<table class="">
 37 <tbody>
 38  <tr>
 39   <td>
 40    doc id
 41   </td>
 42   <td>
 43    ecode
 44   </td>
 45   <td>
 46    url len
 47   </td>
 48   <td>
 49    page len
 50   </td>
 51   <td>
 52    url
 53   </td>
 54   <td>
 55    page
 56   </td>
 57  </tr>
 58 </tbody>
 59</table>
 60<p>A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.</p>
 61<table class="">
 62 <tbody>
 63  <tr>
 64   <td>
 65    word id
 66   </td>
 67   <td>
 68    n docs
 69   </td>
 70  </tr>
 71  <tr>
 72   <td>
 73    word id
 74   </td>
 75   <td>
 76    n docs
 77   </td>
 78  </tr>
 79 </tbody>
 80</table>
 81<p>The Hit Lists store occurences of a word in a document.</p>
 82<table class="">
 83 <tbody>
 84  <tr>
 85   <td>
 86    <strong>
 87     plain
 88    </strong>
 89   </td>
 90   <td>
 91    cap: 1
 92   </td>
 93   <td>
 94    imp: 3
 95   </td>
 96   <td>
 97    pos: 12
 98   </td>
 99  </tr>
100  <tr>
101   <td>
102    <strong>
103     fancy
104    </strong>
105   </td>
106   <td>
107    cap: 1
108   </td>
109   <td>
110    imp: 7
111   </td>
112   <td>
113    type: 4
114   </td>
115   <td>
116    pos: 8
117   </td>
118  </tr>
119  <tr>
120   <td>
121    <strong>
122     anchor
123    </strong>
124   </td>
125   <td>
126    cap: 1
127   </td>
128   <td>
129    imp: 7
130   </td>
131   <td>
132    type: 4
133   </td>
134   <td>
135    hash: 4
136   </td>
137   <td>
138    pos: 8
139   </td>
140  </tr>
141 </tbody>
142</table>
143<p>The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).</p>
144<table class="">
145 <tbody>
146  <tr>
147   <td rowspan="3">
148    doc id
149   </td>
150   <td>
151    word id: 24
152   </td>
153   <td>
154    n hits: 8
155   </td>
156   <td>
157    hit hit hit hit hit hit hit hit
158   </td>
159  </tr>
160  <tr>
161   <td>
162    word id: 24
163   </td>
164   <td>
165    n hits: 8
166   </td>
167   <td>
168    hit hit hit hit hit hit hit hit
169   </td>
170  </tr>
171  <tr>
172   <td>
173    null word id
174   </td>
175  </tr>
176 </tbody>
177</table>
178<p>The Inverted Index can be sorted by either document identifier or by ranking of word occurence.</p>
179<table class="">
180 <tbody>
181  <tr>
182   <td>
183    doc id: 23
184   </td>
185   <td>
186    n hits: 5
187   </td>
188   <td>
189    hit hit hit hit hit
190   </td>
191  </tr>
192  <tr>
193   <td>
194    doc id: 23
195   </td>
196   <td>
197    n hits: 3
198   </td>
199   <td>
200    hit hit hit
201   </td>
202  </tr>
203  <tr>
204   <td>
205    doc id: 23
206   </td>
207   <td>
208    n hits: 4
209   </td>
210   <td>
211    hit hit hit hit
212   </td>
213  </tr>
214  <tr>
215   <td>
216    doc id: 23
217   </td>
218   <td>
219    n hits: 2
220   </td>
221   <td>
222    hit hit
223   </td>
224  </tr>
225 </tbody>
226</table>
227<p>Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.</p>
228<h2 id="crawling"><a class="anchor" href="#crawling">¶</a>Crawling</h2>
229<p>The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.</p>
230<p>Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.</p>
231<p>The crawled pages need parsing to deal with typos or formatting issues.</p>
232<h2 id="indexing"><a class="anchor" href="#indexing">¶</a>Indexing</h2>
233<p>Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.</p>
234<p>Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.</p>
235<h2 id="searching"><a class="anchor" href="#searching">¶</a>Searching</h2>
236<p>We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:</p>
237<p><img src="8e1e61b119e107fcb4bdd7e78f649985.png" alt="" />
238<em>PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))</em></p>
239<p>Where:</p>
240<ul>
241<li><code>A</code> is a given page</li>
242<li><code>T&lt;sub&gt;n&lt;/sub&gt;</code> are pages that point to A</li>
243<li><code>d</code> is the damping factor in the range <code>[0, 1]</code> (often 0.85)</li>
244<li><code>C(A)</code> is the number of links going out of page <code>A</code></li>
245<li><code>PR(A)</code> is the page rank of page <code>A</code>
246This formula indicates the probability that a random surfer visits a certain page, and <code>1 - d</code> is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.</li>
247</ul>
248<p>The anchor text in the links also help provide a better description and helps indexing for even better results.</p>
249<p>While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.</p>
250<p>Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.</p>
251<h2 id="conclusion"><a class="anchor" href="#conclusion">¶</a>Conclusion</h2>
252<p>Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.</p>
253<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
254<ul>
255<li><a href="https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf">The anatomy of a large-scale hypertextual Web search engine</a></li>
256<li><a href="https://www.site.uottawa.ca/%7Ediana/csi4107/Google_SearchEngine.pdf">The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)</a></li>
257</ul>
258</main>
259</body>
260</html>
261