git rekt — gemini-redirect.git (12c4c2886f3cb8692493c92fde959f8561c5b8fd): content/blog/ribw/how-does-googles-search-engine-work/post.md

content/blog/ribw/how-does-googles-search-engine-work/post.md (view raw)
  1```meta
  2title: How does Google’s Search Engine work?
  3published: 2020-03-18T01:00:00+00:00
  4updated: 2020-03-28T10:17:09+00:00
  5```
  6
  7The original implementation was written in C/++ for Linux/Solaris.
  8
  9There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.
 10
 11![](image-1024x649.png)
 12
 13But before we talk about the different components, let’s take a look at how they store all of this information.
 14
 15## Data structures
 16
 17A «BigFile» is a virtual file addressable by 64 bits.
 18
 19There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.
 20
 21<table class="">
 22 <tbody>
 23  <tr>
 24   <td>
 25    sync
 26   </td>
 27   <td>
 28    length
 29   </td>
 30   <td>
 31    compressed packet
 32   </td>
 33  </tr>
 34 </tbody>
 35</table>
 36
 37The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.
 38
 39<table class="">
 40 <tbody>
 41  <tr>
 42   <td>
 43    doc id
 44   </td>
 45   <td>
 46    ecode
 47   </td>
 48   <td>
 49    url len
 50   </td>
 51   <td>
 52    page len
 53   </td>
 54   <td>
 55    url
 56   </td>
 57   <td>
 58    page
 59   </td>
 60  </tr>
 61 </tbody>
 62</table>
 63
 64A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.
 65
 66<table class="">
 67 <tbody>
 68  <tr>
 69   <td>
 70    word id
 71   </td>
 72   <td>
 73    n docs
 74   </td>
 75  </tr>
 76  <tr>
 77   <td>
 78    word id
 79   </td>
 80   <td>
 81    n docs
 82   </td>
 83  </tr>
 84 </tbody>
 85</table>
 86
 87The Hit Lists store occurences of a word in a document.
 88
 89<table class="">
 90 <tbody>
 91  <tr>
 92   <td>
 93    <strong>
 94     plain
 95    </strong>
 96   </td>
 97   <td>
 98    cap: 1
 99   </td>
100   <td>
101    imp: 3
102   </td>
103   <td>
104    pos: 12
105   </td>
106  </tr>
107  <tr>
108   <td>
109    <strong>
110     fancy
111    </strong>
112   </td>
113   <td>
114    cap: 1
115   </td>
116   <td>
117    imp: 7
118   </td>
119   <td>
120    type: 4
121   </td>
122   <td>
123    pos: 8
124   </td>
125  </tr>
126  <tr>
127   <td>
128    <strong>
129     anchor
130    </strong>
131   </td>
132   <td>
133    cap: 1
134   </td>
135   <td>
136    imp: 7
137   </td>
138   <td>
139    type: 4
140   </td>
141   <td>
142    hash: 4
143   </td>
144   <td>
145    pos: 8
146   </td>
147  </tr>
148 </tbody>
149</table>
150
151The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).
152
153<table class="">
154 <tbody>
155  <tr>
156   <td rowspan="3">
157    doc id
158   </td>
159   <td>
160    word id: 24
161   </td>
162   <td>
163    n hits: 8
164   </td>
165   <td>
166    hit hit hit hit hit hit hit hit
167   </td>
168  </tr>
169  <tr>
170   <td>
171    word id: 24
172   </td>
173   <td>
174    n hits: 8
175   </td>
176   <td>
177    hit hit hit hit hit hit hit hit
178   </td>
179  </tr>
180  <tr>
181   <td>
182    null word id
183   </td>
184  </tr>
185 </tbody>
186</table>
187
188The Inverted Index can be sorted by either document identifier or by ranking of word occurence.
189
190<table class="">
191 <tbody>
192  <tr>
193   <td>
194    doc id: 23
195   </td>
196   <td>
197    n hits: 5
198   </td>
199   <td>
200    hit hit hit hit hit
201   </td>
202  </tr>
203  <tr>
204   <td>
205    doc id: 23
206   </td>
207   <td>
208    n hits: 3
209   </td>
210   <td>
211    hit hit hit
212   </td>
213  </tr>
214  <tr>
215   <td>
216    doc id: 23
217   </td>
218   <td>
219    n hits: 4
220   </td>
221   <td>
222    hit hit hit hit
223   </td>
224  </tr>
225  <tr>
226   <td>
227    doc id: 23
228   </td>
229   <td>
230    n hits: 2
231   </td>
232   <td>
233    hit hit
234   </td>
235  </tr>
236 </tbody>
237</table>
238
239Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.
240
241## Crawling
242
243The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.
244
245Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.
246
247The crawled pages need parsing to deal with typos or formatting issues.
248
249## Indexing
250
251Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.
252
253Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.
254
255## Searching
256
257We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:
258
259![](8e1e61b119e107fcb4bdd7e78f649985.png)
260_PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))_
261
262Where:
263
264* `A` is a given page
265* `T<sub>n</sub>` are pages that point to A
266* `d` is the damping factor in the range `[0, 1]` (often 0.85)
267* `C(A)` is the number of links going out of page `A`
268* `PR(A)` is the page rank of page `A`
269This formula indicates the probability that a random surfer visits a certain page, and `1 - d` is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.
270
271The anchor text in the links also help provide a better description and helps indexing for even better results.
272
273While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.
274
275Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.
276
277## Conclusion
278
279Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.
280
281## References
282
283* [The anatomy of a large-scale hypertextual Web search engine](https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf)
284* [The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)](https://www.site.uottawa.ca/~diana/csi4107/Google_SearchEngine.pdf)