git rekt — gemini-redirect.git (299dad40a464132c708bff0aa139c6ddebc55409): blog/ribw/about-boolean-retrieval/index.html

blog/ribw/about-boolean-retrieval/index.html (view raw)
  1<!DOCTYPE html>
  2<html>
  3<head>
  4<meta charset="utf-8" />
  5<meta name="viewport" content="width=device-width, initial-scale=1" />
  6<title>About Boolean Retrieval</title>
  7<link rel="stylesheet" href="../css/style.css">
  8</head>
  9<body>
 10<main>
 11<p>This entry will discuss the section on the <em><a href="https://nlp.stanford.edu/IR-book/pdf/01bool.pdf">Boolean retrieval</a></em> section of the book <em><a href="https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf">An Introduction to Information Retrieval</a></em>.</p>
 12<div class="date-created-modified">Created 2020-02-25<br>
 13Modified 2020-03-18</div>
 14<h2 class="title" id="summary_on_the_topic"><a class="anchor" href="#summary_on_the_topic">¶</a>Summary on the topic</h2>
 15<p>Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called <em>search</em>.</p>
 16<p>A simple way to retrieve information is to <em>grep</em> through the text (term named after the Unix tool <code>grep</code>), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.</p>
 17<p>To avoid linear scanning, we build an <em>index</em> and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document <em>incidence matrix</em>. Such a possible matrix is:</p>
 18<table class="">
 19 <tbody>
 20  <tr>
 21   <td>
 22    <em>
 23     word/play
 24    </em>
 25   </td>
 26   <td>
 27    <strong>
 28     Antony and Cleopatra
 29    </strong>
 30   </td>
 31   <td>
 32    <strong>
 33     Julius Caesar
 34    </strong>
 35   </td>
 36   <td>
 37    <strong>
 38     The Tempest
 39    </strong>
 40   </td>
 41   <td>
 42    <strong>
 43     …
 44    </strong>
 45   </td>
 46  </tr>
 47  <tr>
 48   <td>
 49    <strong>
 50     Antony
 51    </strong>
 52   </td>
 53   <td>
 54    1
 55   </td>
 56   <td>
 57    1
 58   </td>
 59   <td>
 60    0
 61   </td>
 62   <td>
 63   </td>
 64  </tr>
 65  <tr>
 66   <td>
 67    <strong>
 68     Brutus
 69    </strong>
 70   </td>
 71   <td>
 72    1
 73   </td>
 74   <td>
 75    1
 76   </td>
 77   <td>
 78    0
 79   </td>
 80   <td>
 81   </td>
 82  </tr>
 83  <tr>
 84   <td>
 85    <strong>
 86     Caesar
 87    </strong>
 88   </td>
 89   <td>
 90    1
 91   </td>
 92   <td>
 93    1
 94   </td>
 95   <td>
 96    0
 97   </td>
 98   <td>
 99   </td>
100  </tr>
101  <tr>
102   <td>
103    <strong>
104     Calpurnia
105    </strong>
106   </td>
107   <td>
108    0
109   </td>
110   <td>
111    1
112   </td>
113   <td>
114    0
115   </td>
116   <td>
117   </td>
118  </tr>
119  <tr>
120   <td>
121    <strong>
122     Cleopatra
123    </strong>
124   </td>
125   <td>
126    1
127   </td>
128   <td>
129    0
130   </td>
131   <td>
132    0
133   </td>
134   <td>
135   </td>
136  </tr>
137  <tr>
138   <td>
139    <strong>
140     mercy
141    </strong>
142   </td>
143   <td>
144    1
145   </td>
146   <td>
147    0
148   </td>
149   <td>
150    1
151   </td>
152   <td>
153   </td>
154  </tr>
155  <tr>
156   <td>
157    <strong>
158     worser
159    </strong>
160   </td>
161   <td>
162    1
163   </td>
164   <td>
165    0
166   </td>
167   <td>
168    1
169   </td>
170   <td>
171   </td>
172  </tr>
173  <tr>
174   <td>
175    <strong>
176     …
177    </strong>
178   </td>
179   <td>
180   </td>
181   <td>
182   </td>
183   <td>
184   </td>
185   <td>
186   </td>
187  </tr>
188 </tbody>
189</table>
190<p>We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.</p>
191<p>Now, answering a query such as <code>Brutus AND Caesar AND NOT Calpurnia</code> becomes trivial:</p>
192<pre><code>VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
193= 110 AND 110 AND COMPLEMENT(010)
194= 110 AND 110 AND 101
195= 100
196</code></pre>
197<p>The query is only satisfied for our first column.</p>
198<p>The <em>Boolean retrieval model</em> is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with <code>OR</code>, <code>AND</code>, and <code>NOT</code>.</p>
199<p>Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be <em>missing</em> when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the <em>positions</em> of the 1’s. This is known as an <em>inverted index</em>.</p>
200<p>The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (<em>postings</em>). Applied to boolean retrieval, we would:</p>
201<ol>
202<li>Collects the documents to be indexed, assign a unique identifier each</li>
203<li>Tokenize the text in the documents into a list of terms</li>
204<li>Normalize the tokens, which now become indexing terms</li>
205<li>Index the documents</li>
206</ol>
207<table class="">
208 <tbody>
209  <tr>
210   <td>
211    <strong>
212     Dictionary
213    </strong>
214   </td>
215   <td>
216    <strong>
217     Postings
218    </strong>
219   </td>
220  </tr>
221  <tr>
222   <td>
223    Brutus
224   </td>
225   <td>
226    1, 2, 4, 11, 31, 45, 173, 174
227   </td>
228  </tr>
229  <tr>
230   <td>
231    Caesar
232   </td>
233   <td>
234    1, 2, 4, 5, 6, 16, 57, 132, …
235   </td>
236  </tr>
237  <tr>
238   <td>
239    Calpurnia
240   </td>
241   <td>
242    2, 31, 54, 101
243   </td>
244  </tr>
245  <tr>
246   <td>
247    …
248   </td>
249   <td>
250   </td>
251  </tr>
252 </tbody>
253</table>
254<p>Sort the pairs <code>(term, document_id)</code> so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.</p>
255<table class="">
256 <tbody>
257  <tr>
258   <td>
259    <strong>
260     term
261    </strong>
262   </td>
263   <td>
264    <strong>
265     document_id
266    </strong>
267   </td>
268  </tr>
269  <tr>
270   <td>
271    I
272   </td>
273   <td>
274    1
275   </td>
276  </tr>
277  <tr>
278   <td>
279    did
280   </td>
281   <td>
282    1
283   </td>
284  </tr>
285  <tr>
286   <td>
287    …
288   </td>
289   <td>
290   </td>
291  </tr>
292  <tr>
293   <td>
294    with
295   </td>
296   <td>
297    2
298   </td>
299  </tr>
300 </tbody>
301</table>
302<table class="">
303 <tbody>
304  <tr>
305   <td>
306    <strong>
307     term
308    </strong>
309   </td>
310   <td>
311    <strong>
312     document_id
313    </strong>
314   </td>
315  </tr>
316  <tr>
317   <td>
318    be
319   </td>
320   <td>
321    2
322   </td>
323  </tr>
324  <tr>
325   <td>
326    brutus
327   </td>
328   <td>
329    1
330   </td>
331  </tr>
332  <tr>
333   <td>
334    brutus
335   </td>
336   <td>
337    2
338   </td>
339  </tr>
340  <tr>
341   <td>
342    …
343   </td>
344   <td>
345   </td>
346  </tr>
347 </tbody>
348</table>
349<table class="">
350 <tbody>
351  <tr>
352   <td>
353    <strong>
354     term
355    </strong>
356   </td>
357   <td>
358    <strong>
359     frequency
360    </strong>
361   </td>
362   <td>
363    <strong>
364     postings list
365    </strong>
366   </td>
367  </tr>
368  <tr>
369   <td>
370    be
371   </td>
372   <td>
373    1
374   </td>
375   <td>
376    2
377   </td>
378  </tr>
379  <tr>
380   <td>
381    brutus
382   </td>
383   <td>
384    2
385   </td>
386   <td>
387    1, 2
388   </td>
389  </tr>
390  <tr>
391   <td>
392    capitol
393   </td>
394   <td>
395    1
396   </td>
397   <td>
398    1
399   </td>
400  </tr>
401  <tr>
402   <td>
403    …
404   </td>
405   <td>
406   </td>
407   <td>
408   </td>
409  </tr>
410 </tbody>
411</table>
412<p>Intersecting posting lists now becomes of transversing both lists in order:</p>
413<pre><code>Brutus   : 1 -&gt; 2 -&gt; 4 -&gt; 11 -&gt; 31 -&gt; 45 -&gt;           173 -&gt; 174
414Calpurnia:      2 -&gt;            31 -&gt;       54 -&gt; 101
415Intersect:      2 -&gt;            31
416</code></pre>
417<p>A simple conjunctive query (e.g. <code>Brutus AND Calpurnia</code>) is executed as follows:</p>
418<ol>
419<li>Locate <code>Brutus</code> in the dictionary</li>
420<li>Retrieve its postings</li>
421<li>Locate <code>Calpurnia</code> in the dictionary</li>
422<li>Retrieve its postings</li>
423<li>Intersect (<em>merge</em>) both postings</li>
424</ol>
425<p>Since the lists are sorted, walking both of them can be done in <em>O(n)</em> time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.</p>
426<h2 id="thoughts"><a class="anchor" href="#thoughts">¶</a>Thoughts</h2>
427<p>The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.</p>
428<p>However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.</p>
429<p>All in all, it’s an interesting way to look at the data and query it efficiently.</p>
430</main>
431</body>
432</html>
433