git rekt — gemini-redirect.git (e10a8c515048915835c5dd29bd2614a277fb6045): blog/ribw/about-boolean-retrieval/index.html

blog/ribw/about-boolean-retrieval/index.html (view raw)
1<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=description content="Official Lonami's website"><meta name=viewport content="width=device-width, initial-scale=1.0, user-scalable=yes"><title> About Boolean Retrieval | Lonami's Blog </title><link rel=stylesheet href=/style.css><body><article><nav class=sections><ul class=left><li><a href=/>lonami's site</a><li><a href=/blog class=selected>blog</a><li><a href=/golb>golb</a></ul><div class=right><a href=https://github.com/LonamiWebs><img src=/img/github.svg alt=github></a><a href=/blog/atom.xml><img src=/img/rss.svg alt=rss></a></div></nav><main><h1 class=title>About Boolean Retrieval</h1><div class=time><p>2020-02-25T00:00:29+00:00<p>last updated 2020-03-18T09:38:02+00:00</div><p>This entry will discuss the section on the <em><a href=https://nlp.stanford.edu/IR-book/pdf/01bool.pdf>Boolean retrieval</a></em> section of the book <em><a href=https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf>An Introduction to Information Retrieval</a></em>.<h2 id=summary-on-the-topic>Summary on the topic</h2><p>Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called <em>search</em>.<p>A simple way to retrieve information is to <em>grep</em> through the text (term named after the Unix tool <code>grep</code>), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.<p>To avoid linear scanning, we build an <em>index</em> and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document <em>incidence matrix</em>. Such a possible matrix is:<table><tbody><tr><td><em> word/play </em><td><strong> Antony and Cleopatra </strong><td><strong> Julius Caesar </strong><td><strong> The Tempest </strong><td><strong> … </strong><tr><td><strong> Antony </strong><td>1<td>1<td>0<td><tr><td><strong> Brutus </strong><td>1<td>1<td>0<td><tr><td><strong> Caesar </strong><td>1<td>1<td>0<td><tr><td><strong> Calpurnia </strong><td>0<td>1<td>0<td><tr><td><strong> Cleopatra </strong><td>1<td>0<td>0<td><tr><td><strong> mercy </strong><td>1<td>0<td>1<td><tr><td><strong> worser </strong><td>1<td>0<td>1<td><tr><td><strong> … </strong><td><td><td><td></table><p>We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.<p>Now, answering a query such as <code>Brutus AND Caesar AND NOT Calpurnia</code> becomes trivial:<pre><code>VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
2= 110 AND 110 AND COMPLEMENT(010)
3= 110 AND 110 AND 101
4= 100
5</code></pre><p>The query is only satisfied for our first column.<p>The <em>Boolean retrieval model</em> is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with <code>OR</code>, <code>AND</code>, and <code>NOT</code>.<p>Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be <em>missing</em> when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the <em>positions</em> of the 1’s. This is known as an <em>inverted index</em>.<p>The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (<em>postings</em>). Applied to boolean retrieval, we would:<ol><li>Collects the documents to be indexed, assign a unique identifier each<li>Tokenize the text in the documents into a list of terms<li>Normalize the tokens, which now become indexing terms<li>Index the documents</ol><table><tbody><tr><td><strong> Dictionary </strong><td><strong> Postings </strong><tr><td>Brutus<td>1, 2, 4, 11, 31, 45, 173, 174<tr><td>Caesar<td>1, 2, 4, 5, 6, 16, 57, 132, …<tr><td>Calpurnia<td>2, 31, 54, 101<tr><td>…<td></table><p>Sort the pairs <code>(term, document_id)</code> so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.<table><tbody><tr><td><strong> term </strong><td><strong> document_id </strong><tr><td>I<td>1<tr><td>did<td>1<tr><td>…<td><tr><td>with<td>2</table><table><tbody><tr><td><strong> term </strong><td><strong> document_id </strong><tr><td>be<td>2<tr><td>brutus<td>1<tr><td>brutus<td>2<tr><td>…<td></table><table><tbody><tr><td><strong> term </strong><td><strong> frequency </strong><td><strong> postings list </strong><tr><td>be<td>1<td>2<tr><td>brutus<td>2<td>1, 2<tr><td>capitol<td>1<td>1<tr><td>…<td><td></table><p>Intersecting posting lists now becomes of transversing both lists in order:<pre><code>Brutus   : 1 -> 2 -> 4 -> 11 -> 31 -> 45 ->           173 -> 174
6Calpurnia:      2 ->            31 ->       54 -> 101
7Intersect:      2 ->            31
8</code></pre><p>A simple conjunctive query (e.g. <code>Brutus AND Calpurnia</code>) is executed as follows:<ol><li>Locate <code>Brutus</code> in the dictionary<li>Retrieve its postings<li>Locate <code>Calpurnia</code> in the dictionary<li>Retrieve its postings<li>Intersect (<em>merge</em>) both postings</ol><p>Since the lists are sorted, walking both of them can be done in <em>O(n)</em> time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.<h2 id=thoughts>Thoughts</h2><p>The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.<p>However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.<p>All in all, it’s an interesting way to look at the data and query it efficiently.</main><footer><div><p>Share your thoughts, or simply come hang with me <a href=https://t.me/LonamiWebs><img src=/img/telegram.svg alt=Telegram></a> <a href=mailto:totufals@hotmail.com><img src=/img/mail.svg alt=Mail></a></div></footer></article><p class=abyss>Glaze into the abyss… Oh hi there!