content/blog/ribw/how-does-googles-search-engine-work/index.md (view raw)
1```meta
2title: How does Google’s Search Engine work?
3published: 2020-03-18T01:00:00+00:00
4updated: 2020-03-28T10:17:09+00:00
5```
6
7The original implementation was written in C/++ for Linux/Solaris.
8
9There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.
10
11![](image-1024x649.png)
12
13But before we talk about the different components, let’s take a look at how they store all of this information.
14
15## Data structures
16
17A «BigFile» is a virtual file addressable by 64 bits.
18
19There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.
20
21<table class="">
22 <tbody>
23 <tr>
24 <td>
25 sync
26 </td>
27 <td>
28 length
29 </td>
30 <td>
31 compressed packet
32 </td>
33 </tr>
34 </tbody>
35</table>
36
37The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.
38
39<table class="">
40 <tbody>
41 <tr>
42 <td>
43 doc id
44 </td>
45 <td>
46 ecode
47 </td>
48 <td>
49 url len
50 </td>
51 <td>
52 page len
53 </td>
54 <td>
55 url
56 </td>
57 <td>
58 page
59 </td>
60 </tr>
61 </tbody>
62</table>
63
64A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.
65
66<table class="">
67 <tbody>
68 <tr>
69 <td>
70 word id
71 </td>
72 <td>
73 n docs
74 </td>
75 </tr>
76 <tr>
77 <td>
78 word id
79 </td>
80 <td>
81 n docs
82 </td>
83 </tr>
84 </tbody>
85</table>
86
87The Hit Lists store occurences of a word in a document.
88
89<table class="">
90 <tbody>
91 <tr>
92 <td>
93 <strong>
94 plain
95 </strong>
96 </td>
97 <td>
98 cap: 1
99 </td>
100 <td>
101 imp: 3
102 </td>
103 <td>
104 pos: 12
105 </td>
106 </tr>
107 <tr>
108 <td>
109 <strong>
110 fancy
111 </strong>
112 </td>
113 <td>
114 cap: 1
115 </td>
116 <td>
117 imp: 7
118 </td>
119 <td>
120 type: 4
121 </td>
122 <td>
123 pos: 8
124 </td>
125 </tr>
126 <tr>
127 <td>
128 <strong>
129 anchor
130 </strong>
131 </td>
132 <td>
133 cap: 1
134 </td>
135 <td>
136 imp: 7
137 </td>
138 <td>
139 type: 4
140 </td>
141 <td>
142 hash: 4
143 </td>
144 <td>
145 pos: 8
146 </td>
147 </tr>
148 </tbody>
149</table>
150
151The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).
152
153<table class="">
154 <tbody>
155 <tr>
156 <td rowspan="3">
157 doc id
158 </td>
159 <td>
160 word id: 24
161 </td>
162 <td>
163 n hits: 8
164 </td>
165 <td>
166 hit hit hit hit hit hit hit hit
167 </td>
168 </tr>
169 <tr>
170 <td>
171 word id: 24
172 </td>
173 <td>
174 n hits: 8
175 </td>
176 <td>
177 hit hit hit hit hit hit hit hit
178 </td>
179 </tr>
180 <tr>
181 <td>
182 null word id
183 </td>
184 </tr>
185 </tbody>
186</table>
187
188The Inverted Index can be sorted by either document identifier or by ranking of word occurence.
189
190<table class="">
191 <tbody>
192 <tr>
193 <td>
194 doc id: 23
195 </td>
196 <td>
197 n hits: 5
198 </td>
199 <td>
200 hit hit hit hit hit
201 </td>
202 </tr>
203 <tr>
204 <td>
205 doc id: 23
206 </td>
207 <td>
208 n hits: 3
209 </td>
210 <td>
211 hit hit hit
212 </td>
213 </tr>
214 <tr>
215 <td>
216 doc id: 23
217 </td>
218 <td>
219 n hits: 4
220 </td>
221 <td>
222 hit hit hit hit
223 </td>
224 </tr>
225 <tr>
226 <td>
227 doc id: 23
228 </td>
229 <td>
230 n hits: 2
231 </td>
232 <td>
233 hit hit
234 </td>
235 </tr>
236 </tbody>
237</table>
238
239Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.
240
241## Crawling
242
243The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.
244
245Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.
246
247The crawled pages need parsing to deal with typos or formatting issues.
248
249## Indexing
250
251Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.
252
253Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.
254
255## Searching
256
257We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:
258
259![](8e1e61b119e107fcb4bdd7e78f649985.png)
260_PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))_
261
262Where:
263
264* `A` is a given page
265* `T<sub>n</sub>` are pages that point to A
266* `d` is the damping factor in the range `[0, 1]` (often 0.85)
267* `C(A)` is the number of links going out of page `A`
268* `PR(A)` is the page rank of page `A`
269This formula indicates the probability that a random surfer visits a certain page, and `1 - d` is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.
270
271The anchor text in the links also help provide a better description and helps indexing for even better results.
272
273While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.
274
275Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.
276
277## Conclusion
278
279Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.
280
281## References
282
283* [The anatomy of a large-scale hypertextual Web search engine](https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf)
284* [The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)](https://www.site.uottawa.ca/~diana/csi4107/Google_SearchEngine.pdf)