git rekt — gemini-redirect.git (292825669970b7c009ac202b096f9a6960f7d265): content/blog/ribw/about-boolean-retrieval/index.md

content/blog/ribw/about-boolean-retrieval/index.md (view raw)
  1+++
  2title = "About Boolean Retrieval"
  3date = 2020-02-25T00:00:29+00:00
  4updated = 2020-03-18T09:38:02+00:00
  5+++
  6
  7This entry will discuss the section on the _[Boolean retrieval](https://nlp.stanford.edu/IR-book/pdf/01bool.pdf)_ section of the book _[An Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf)_.
  8
  9## Summary on the topic
 10
 11Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called _search_.
 12
 13A simple way to retrieve information is to _grep_ through the text (term named after the Unix tool `grep`), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.
 14
 15To avoid linear scanning, we build an _index_ and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document _incidence matrix_. Such a possible matrix is:
 16
 17<table class="">
 18 <tbody>
 19  <tr>
 20   <td>
 21    <em>
 22     word/play
 23    </em>
 24   </td>
 25   <td>
 26    <strong>
 27     Antony and Cleopatra
 28    </strong>
 29   </td>
 30   <td>
 31    <strong>
 32     Julius Caesar
 33    </strong>
 34   </td>
 35   <td>
 36    <strong>
 37     The Tempest
 38    </strong>
 39   </td>
 40   <td>
 41    <strong>
 42     …
 43    </strong>
 44   </td>
 45  </tr>
 46  <tr>
 47   <td>
 48    <strong>
 49     Antony
 50    </strong>
 51   </td>
 52   <td>
 53    1
 54   </td>
 55   <td>
 56    1
 57   </td>
 58   <td>
 59    0
 60   </td>
 61   <td>
 62   </td>
 63  </tr>
 64  <tr>
 65   <td>
 66    <strong>
 67     Brutus
 68    </strong>
 69   </td>
 70   <td>
 71    1
 72   </td>
 73   <td>
 74    1
 75   </td>
 76   <td>
 77    0
 78   </td>
 79   <td>
 80   </td>
 81  </tr>
 82  <tr>
 83   <td>
 84    <strong>
 85     Caesar
 86    </strong>
 87   </td>
 88   <td>
 89    1
 90   </td>
 91   <td>
 92    1
 93   </td>
 94   <td>
 95    0
 96   </td>
 97   <td>
 98   </td>
 99  </tr>
100  <tr>
101   <td>
102    <strong>
103     Calpurnia
104    </strong>
105   </td>
106   <td>
107    0
108   </td>
109   <td>
110    1
111   </td>
112   <td>
113    0
114   </td>
115   <td>
116   </td>
117  </tr>
118  <tr>
119   <td>
120    <strong>
121     Cleopatra
122    </strong>
123   </td>
124   <td>
125    1
126   </td>
127   <td>
128    0
129   </td>
130   <td>
131    0
132   </td>
133   <td>
134   </td>
135  </tr>
136  <tr>
137   <td>
138    <strong>
139     mercy
140    </strong>
141   </td>
142   <td>
143    1
144   </td>
145   <td>
146    0
147   </td>
148   <td>
149    1
150   </td>
151   <td>
152   </td>
153  </tr>
154  <tr>
155   <td>
156    <strong>
157     worser
158    </strong>
159   </td>
160   <td>
161    1
162   </td>
163   <td>
164    0
165   </td>
166   <td>
167    1
168   </td>
169   <td>
170   </td>
171  </tr>
172  <tr>
173   <td>
174    <strong>
175     …
176    </strong>
177   </td>
178   <td>
179   </td>
180   <td>
181   </td>
182   <td>
183   </td>
184   <td>
185   </td>
186  </tr>
187 </tbody>
188</table>
189
190We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.
191
192Now, answering a query such as `Brutus AND Caesar AND NOT Calpurnia` becomes trivial:
193
194```
195VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
196= 110 AND 110 AND COMPLEMENT(010)
197= 110 AND 110 AND 101
198= 100
199```
200
201The query is only satisfied for our first column.
202
203The _Boolean retrieval model_ is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with `OR`, `AND`, and `NOT`.
204
205Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be _missing_ when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the _positions_ of the 1’s. This is known as an _inverted index_.
206
207The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (_postings_). Applied to boolean retrieval, we would:
208
2091. Collects the documents to be indexed, assign a unique identifier each
2102. Tokenize the text in the documents into a list of terms
2113. Normalize the tokens, which now become indexing terms
2124. Index the documents
213
214<table class="">
215 <tbody>
216  <tr>
217   <td>
218    <strong>
219     Dictionary
220    </strong>
221   </td>
222   <td>
223    <strong>
224     Postings
225    </strong>
226   </td>
227  </tr>
228  <tr>
229   <td>
230    Brutus
231   </td>
232   <td>
233    1, 2, 4, 11, 31, 45, 173, 174
234   </td>
235  </tr>
236  <tr>
237   <td>
238    Caesar
239   </td>
240   <td>
241    1, 2, 4, 5, 6, 16, 57, 132, …
242   </td>
243  </tr>
244  <tr>
245   <td>
246    Calpurnia
247   </td>
248   <td>
249    2, 31, 54, 101
250   </td>
251  </tr>
252  <tr>
253   <td>
254    …
255   </td>
256   <td>
257   </td>
258  </tr>
259 </tbody>
260</table>
261
262Sort the pairs `(term, document_id)` so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.
263
264<table class="">
265 <tbody>
266  <tr>
267   <td>
268    <strong>
269     term
270    </strong>
271   </td>
272   <td>
273    <strong>
274     document_id
275    </strong>
276   </td>
277  </tr>
278  <tr>
279   <td>
280    I
281   </td>
282   <td>
283    1
284   </td>
285  </tr>
286  <tr>
287   <td>
288    did
289   </td>
290   <td>
291    1
292   </td>
293  </tr>
294  <tr>
295   <td>
296    …
297   </td>
298   <td>
299   </td>
300  </tr>
301  <tr>
302   <td>
303    with
304   </td>
305   <td>
306    2
307   </td>
308  </tr>
309 </tbody>
310</table>
311
312<table class="">
313 <tbody>
314  <tr>
315   <td>
316    <strong>
317     term
318    </strong>
319   </td>
320   <td>
321    <strong>
322     document_id
323    </strong>
324   </td>
325  </tr>
326  <tr>
327   <td>
328    be
329   </td>
330   <td>
331    2
332   </td>
333  </tr>
334  <tr>
335   <td>
336    brutus
337   </td>
338   <td>
339    1
340   </td>
341  </tr>
342  <tr>
343   <td>
344    brutus
345   </td>
346   <td>
347    2
348   </td>
349  </tr>
350  <tr>
351   <td>
352    …
353   </td>
354   <td>
355   </td>
356  </tr>
357 </tbody>
358</table>
359
360<table class="">
361 <tbody>
362  <tr>
363   <td>
364    <strong>
365     term
366    </strong>
367   </td>
368   <td>
369    <strong>
370     frequency
371    </strong>
372   </td>
373   <td>
374    <strong>
375     postings list
376    </strong>
377   </td>
378  </tr>
379  <tr>
380   <td>
381    be
382   </td>
383   <td>
384    1
385   </td>
386   <td>
387    2
388   </td>
389  </tr>
390  <tr>
391   <td>
392    brutus
393   </td>
394   <td>
395    2
396   </td>
397   <td>
398    1, 2
399   </td>
400  </tr>
401  <tr>
402   <td>
403    capitol
404   </td>
405   <td>
406    1
407   </td>
408   <td>
409    1
410   </td>
411  </tr>
412  <tr>
413   <td>
414    …
415   </td>
416   <td>
417   </td>
418   <td>
419   </td>
420  </tr>
421 </tbody>
422</table>
423
424Intersecting posting lists now becomes of transversing both lists in order:
425
426```
427Brutus   : 1 -> 2 -> 4 -> 11 -> 31 -> 45 ->           173 -> 174
428Calpurnia:      2 ->            31 ->       54 -> 101
429Intersect:      2 ->            31
430```
431
432A simple conjunctive query (e.g. `Brutus AND Calpurnia`) is executed as follows:
433
4341. Locate `Brutus` in the dictionary
4352. Retrieve its postings
4363. Locate `Calpurnia` in the dictionary
4374. Retrieve its postings
4385. Intersect (_merge_) both postings
439
440Since the lists are sorted, walking both of them can be done in _O(n)_ time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.
441
442## Thoughts
443
444The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.
445
446However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.
447
448All in all, it’s an interesting way to look at the data and query it efficiently.