content/blog/ribw/about-boolean-retrieval/index.md (view raw)
1+++
2title = "About Boolean Retrieval"
3date = 2020-02-25T00:00:29+00:00
4updated = 2020-03-18T09:38:02+00:00
5+++
6
7This entry will discuss the section on the _[Boolean retrieval](https://nlp.stanford.edu/IR-book/pdf/01bool.pdf)_ section of the book _[An Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf)_.
8
9## Summary on the topic
10
11Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called _search_.
12
13A simple way to retrieve information is to _grep_ through the text (term named after the Unix tool `grep`), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.
14
15To avoid linear scanning, we build an _index_ and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document _incidence matrix_. Such a possible matrix is:
16
17<table class="">
18 <tbody>
19 <tr>
20 <td>
21 <em>
22 word/play
23 </em>
24 </td>
25 <td>
26 <strong>
27 Antony and Cleopatra
28 </strong>
29 </td>
30 <td>
31 <strong>
32 Julius Caesar
33 </strong>
34 </td>
35 <td>
36 <strong>
37 The Tempest
38 </strong>
39 </td>
40 <td>
41 <strong>
42 …
43 </strong>
44 </td>
45 </tr>
46 <tr>
47 <td>
48 <strong>
49 Antony
50 </strong>
51 </td>
52 <td>
53 1
54 </td>
55 <td>
56 1
57 </td>
58 <td>
59 0
60 </td>
61 <td>
62 </td>
63 </tr>
64 <tr>
65 <td>
66 <strong>
67 Brutus
68 </strong>
69 </td>
70 <td>
71 1
72 </td>
73 <td>
74 1
75 </td>
76 <td>
77 0
78 </td>
79 <td>
80 </td>
81 </tr>
82 <tr>
83 <td>
84 <strong>
85 Caesar
86 </strong>
87 </td>
88 <td>
89 1
90 </td>
91 <td>
92 1
93 </td>
94 <td>
95 0
96 </td>
97 <td>
98 </td>
99 </tr>
100 <tr>
101 <td>
102 <strong>
103 Calpurnia
104 </strong>
105 </td>
106 <td>
107 0
108 </td>
109 <td>
110 1
111 </td>
112 <td>
113 0
114 </td>
115 <td>
116 </td>
117 </tr>
118 <tr>
119 <td>
120 <strong>
121 Cleopatra
122 </strong>
123 </td>
124 <td>
125 1
126 </td>
127 <td>
128 0
129 </td>
130 <td>
131 0
132 </td>
133 <td>
134 </td>
135 </tr>
136 <tr>
137 <td>
138 <strong>
139 mercy
140 </strong>
141 </td>
142 <td>
143 1
144 </td>
145 <td>
146 0
147 </td>
148 <td>
149 1
150 </td>
151 <td>
152 </td>
153 </tr>
154 <tr>
155 <td>
156 <strong>
157 worser
158 </strong>
159 </td>
160 <td>
161 1
162 </td>
163 <td>
164 0
165 </td>
166 <td>
167 1
168 </td>
169 <td>
170 </td>
171 </tr>
172 <tr>
173 <td>
174 <strong>
175 …
176 </strong>
177 </td>
178 <td>
179 </td>
180 <td>
181 </td>
182 <td>
183 </td>
184 <td>
185 </td>
186 </tr>
187 </tbody>
188</table>
189
190We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.
191
192Now, answering a query such as `Brutus AND Caesar AND NOT Calpurnia` becomes trivial:
193
194```
195VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
196= 110 AND 110 AND COMPLEMENT(010)
197= 110 AND 110 AND 101
198= 100
199```
200
201The query is only satisfied for our first column.
202
203The _Boolean retrieval model_ is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with `OR`, `AND`, and `NOT`.
204
205Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be _missing_ when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the _positions_ of the 1’s. This is known as an _inverted index_.
206
207The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (_postings_). Applied to boolean retrieval, we would:
208
2091. Collects the documents to be indexed, assign a unique identifier each
2102. Tokenize the text in the documents into a list of terms
2113. Normalize the tokens, which now become indexing terms
2124. Index the documents
213
214<table class="">
215 <tbody>
216 <tr>
217 <td>
218 <strong>
219 Dictionary
220 </strong>
221 </td>
222 <td>
223 <strong>
224 Postings
225 </strong>
226 </td>
227 </tr>
228 <tr>
229 <td>
230 Brutus
231 </td>
232 <td>
233 1, 2, 4, 11, 31, 45, 173, 174
234 </td>
235 </tr>
236 <tr>
237 <td>
238 Caesar
239 </td>
240 <td>
241 1, 2, 4, 5, 6, 16, 57, 132, …
242 </td>
243 </tr>
244 <tr>
245 <td>
246 Calpurnia
247 </td>
248 <td>
249 2, 31, 54, 101
250 </td>
251 </tr>
252 <tr>
253 <td>
254 …
255 </td>
256 <td>
257 </td>
258 </tr>
259 </tbody>
260</table>
261
262Sort the pairs `(term, document_id)` so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.
263
264<table class="">
265 <tbody>
266 <tr>
267 <td>
268 <strong>
269 term
270 </strong>
271 </td>
272 <td>
273 <strong>
274 document_id
275 </strong>
276 </td>
277 </tr>
278 <tr>
279 <td>
280 I
281 </td>
282 <td>
283 1
284 </td>
285 </tr>
286 <tr>
287 <td>
288 did
289 </td>
290 <td>
291 1
292 </td>
293 </tr>
294 <tr>
295 <td>
296 …
297 </td>
298 <td>
299 </td>
300 </tr>
301 <tr>
302 <td>
303 with
304 </td>
305 <td>
306 2
307 </td>
308 </tr>
309 </tbody>
310</table>
311
312<table class="">
313 <tbody>
314 <tr>
315 <td>
316 <strong>
317 term
318 </strong>
319 </td>
320 <td>
321 <strong>
322 document_id
323 </strong>
324 </td>
325 </tr>
326 <tr>
327 <td>
328 be
329 </td>
330 <td>
331 2
332 </td>
333 </tr>
334 <tr>
335 <td>
336 brutus
337 </td>
338 <td>
339 1
340 </td>
341 </tr>
342 <tr>
343 <td>
344 brutus
345 </td>
346 <td>
347 2
348 </td>
349 </tr>
350 <tr>
351 <td>
352 …
353 </td>
354 <td>
355 </td>
356 </tr>
357 </tbody>
358</table>
359
360<table class="">
361 <tbody>
362 <tr>
363 <td>
364 <strong>
365 term
366 </strong>
367 </td>
368 <td>
369 <strong>
370 frequency
371 </strong>
372 </td>
373 <td>
374 <strong>
375 postings list
376 </strong>
377 </td>
378 </tr>
379 <tr>
380 <td>
381 be
382 </td>
383 <td>
384 1
385 </td>
386 <td>
387 2
388 </td>
389 </tr>
390 <tr>
391 <td>
392 brutus
393 </td>
394 <td>
395 2
396 </td>
397 <td>
398 1, 2
399 </td>
400 </tr>
401 <tr>
402 <td>
403 capitol
404 </td>
405 <td>
406 1
407 </td>
408 <td>
409 1
410 </td>
411 </tr>
412 <tr>
413 <td>
414 …
415 </td>
416 <td>
417 </td>
418 <td>
419 </td>
420 </tr>
421 </tbody>
422</table>
423
424Intersecting posting lists now becomes of transversing both lists in order:
425
426```
427Brutus : 1 -> 2 -> 4 -> 11 -> 31 -> 45 -> 173 -> 174
428Calpurnia: 2 -> 31 -> 54 -> 101
429Intersect: 2 -> 31
430```
431
432A simple conjunctive query (e.g. `Brutus AND Calpurnia`) is executed as follows:
433
4341. Locate `Brutus` in the dictionary
4352. Retrieve its postings
4363. Locate `Calpurnia` in the dictionary
4374. Retrieve its postings
4385. Intersect (_merge_) both postings
439
440Since the lists are sorted, walking both of them can be done in _O(n)_ time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.
441
442## Thoughts
443
444The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.
445
446However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.
447
448All in all, it’s an interesting way to look at the data and query it efficiently.