blog/ribw/atom.xml (view raw)
1<feed xmlns="http://www.w3.org/2005/Atom"><title>pagong</title><id>pagong</id><updated>2020-07-02T22:00:00+00:00</updated><entry><title>Information Retrieval and Web Search</title><id>dist/index/index.html</id><updated>2020-07-02T22:00:00+00:00</updated><published>2020-07-02T22:00:00+00:00</published><summary>During 2020 at university, this subject ("Recuperación de la Información y Búsqueda en la Web")</summary><content type="html" src="dist/index/index.html"><!DOCTYPE html>
2<html>
3<head>
4<meta charset="utf-8" />
5<meta name="viewport" content="width=device-width, initial-scale=1" />
6<title>Information Retrieval and Web Search</title>
7<link rel="stylesheet" href="../css/style.css">
8</head>
9<body>
10<main>
11<h1 class="title" id="information_retrieval_and_web_search"><a class="anchor" href="#information_retrieval_and_web_search">¶</a>Information Retrieval and Web Search</h1>
12<div class="date-created-modified">2020-07-03</div>
13<p>During 2020 at university, this subject (&quot;Recuperación de la Información y Búsqueda en la Web&quot;)
14had us write blog posts as assignments. I think it would be really fun and I wanted to preserve
15that work here, with the hopes it's interesting to someone.</p>
16<p>The posts were auto-generated from the original HTML files and manually anonymized later.</p>
17</main>
18</body>
19</html>
20 </content></entry><entry><title>Privado: Final NoSQL evaluation</title><id>dist/final-nosql-evaluation/index.html</id><updated>2020-05-13T22:00:00+00:00</updated><published>2020-05-12T22:00:00+00:00</published><summary>This evaluation is a bit different to my </summary><content type="html" src="dist/final-nosql-evaluation/index.html"><!DOCTYPE html>
21<html>
22<head>
23<meta charset="utf-8" />
24<meta name="viewport" content="width=device-width, initial-scale=1" />
25<title>Privado: Final NoSQL evaluation</title>
26<link rel="stylesheet" href="../css/style.css">
27</head>
28<body>
29<main>
30<p>This evaluation is a bit different to my <a href="/blog/ribw/16/nosql-evaluation/">previous one</a> because this time I have been tasked to evaluate the student <code>a(i - 2)</code>, and because I am <code>a = 9</code> that happens to be <code>a(7) =</code> Classmate.</p>
31<div class="date-created-modified">Created 2020-05-13<br>
32Modified 2020-05-14</div>
33<p>Unfortunately for Classmate, the only entry related to NoSQL I have found in their blog is Prima y segunda Actividad: Base de datos NoSQL which does not develop an application as requested for the third entry (as of 14th of May).</p>
34<p>This means that, instead, I will evaluate <code>a(i - 3)</code> which happens to be <code>a(6) =</code> Classmate and they do have an entry.</p>
35<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s Evaluation</h2>
36<p><strong>Grading: B.</strong></p>
37<p>The post I have evaluated is BB.DD. NoSQL RethinkDB 3ª Fase. Aplicación.</p>
38<p>It starts with an introduction, properly explaining what database they have chosen and why, but not what application they will be making.</p>
39<p>This is detailed just below in the next section, although it’s a bit vague.</p>
40<p>The next section talks about the Python dependencies that are required, but they never said they would be making a Python application or that we need to install Python!</p>
41<p>The next section talks about the file structure of the project, and they detail what everything part does, although I have missed some code snippets.</p>
42<p>The final result is pretty cool and contains many interesting graphs, they provide a download to the source code and list all the relevant references used.</p>
43<p>Except for a weird «necesario falta» in the text, it’s otherwise well-written, although given the issues above I cannot grade it with the highest score.</p>
44</main>
45</body>
46</html>
47 </content></entry><entry><title>Developing a Python application for MongoDB</title><id>dist/developing-a-python-application-for-mongodb/index.html</id><updated>2020-04-15T22:00:00+00:00</updated><published>2020-03-24T23:00:00+00:00</published><summary>This is the third and last post in the MongoDB series, where we will develop a Python application to process and store OpenData inside Mongo.</summary><content type="html" src="dist/developing-a-python-application-for-mongodb/index.html"><!DOCTYPE html>
48<html>
49<head>
50<meta charset="utf-8" />
51<meta name="viewport" content="width=device-width, initial-scale=1" />
52<title>Developing a Python application for MongoDB</title>
53<link rel="stylesheet" href="../css/style.css">
54</head>
55<body>
56<main>
57<p>This is the third and last post in the MongoDB series, where we will develop a Python application to process and store OpenData inside Mongo.</p>
58<div class="date-created-modified">Created 2020-03-25<br>
59Modified 2020-04-16</div>
60<p>Other posts in this series:</p>
61<ul>
62<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a></li>
63<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a></li>
64<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a> (this post)</li>
65</ul>
66<p>This post is co-authored wih a Classmate.</p>
67<hr />
68<h2 class="title" id="what_are_we_making_"><a class="anchor" href="#what_are_we_making_">¶</a>What are we making?</h2>
69<p>We are going to develop a web application that renders a map, in this case, the town of Cáceres, with which users can interact. When the user clicks somewhere on the map, the selected location will be sent to the server to process. This server will perform geospatial queries to Mongo and once the results are ready, the information is presented back at the webpage.</p>
70<p>The data used for the application comes from <a href="https://opendata.caceres.es/">Cáceres’ OpenData</a>, and our goal is that users will be able to find information about certain areas in a quick and intuitive way, such as precise coordinates, noise level, and such.</p>
71<h2 id="what_are_we_using_"><a class="anchor" href="#what_are_we_using_">¶</a>What are we using?</h2>
72<p>The web application will be using <a href="https://python.org/">Python</a> for the backend, <a href="https://svelte.dev/">Svelte</a> for the frontend, and <a href="https://www.mongodb.com/">Mongo</a> as our storage database and processing center.</p>
73<ul>
74<li><strong>Why Python?</strong> It’s a comfortable language to write and to read, and has a great ecosystem with <a href="https://pypi.org/">plenty of libraries</a>.</li>
75<li><strong>Why Svelte?</strong> Svelte is the New Thing<strong>™</strong> in the world of component frameworks for JavaScript. It is similar to React or Vue, but compiled and with a lot less boilerplate. Check out their <a href="https://svelte.dev/blog/svelte-3-rethinking-reactivity">Svelte post</a> to learn more.</li>
76<li><strong>Why Mongo?</strong> We believe NoSQL is the right approach for doing the kind of processing and storage that we expect, and it’s <a href="https://docs.mongodb.com/">very easy to use</a>. In addition, we will be making Geospatial Queries which <a href="https://docs.mongodb.com/manual/geospatial-queries/">Mongo supports</a>.</li>
77</ul>
78<p>Why didn’t we choose to make a smaller project, you may ask? You will be shocked to hear that we do not have an answer for that!</p>
79<p>Note that we will not be embedding <strong>all</strong> the code of the project in this post, or it would be too long! We will include only the relevant snippets needed to understand the core ideas of the project, and not the unnecessary parts of it (for example, parsing configuration files to easily change the port where the server runs is not included).</p>
80<h2 id="python_dependencies"><a class="anchor" href="#python_dependencies">¶</a>Python dependencies</h2>
81<p>Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the <a href="https://www.python.org/downloads/">Python downloads section</a>, but if you’re on Linux, chances are you have it installed already.</p>
82<p>Once Python 3.7 or above is installed, install <a href="https://motor.readthedocs.io/en/stable/"><code>motor</code> (Asynchronous Python driver for MongoDB)</a> and the <a href="https://docs.aiohttp.org/en/stable/web.html"><code>aiohttp</code> server</a> through <code>pip</code>:</p>
83<pre><code>pip install aiohttp motor
84</code></pre>
85<p>Make sure that Mongo is running in the background (this has been described in previous posts), and we should be able to get to work.</p>
86<h2 id="web_dependencies"><a class="anchor" href="#web_dependencies">¶</a>Web dependencies</h2>
87<p>To work with Svelte and its dependencies, we will need <code>[npm](https://www.npmjs.com/)</code> which comes with <a href="https://nodejs.org/en/">NodeJS</a>, so go and <a href="https://nodejs.org/en/download/">install Node from their site</a>. The download will be different depending on your operating system.</p>
88<p>Following <a href="https://svelte.dev/blog/the-easiest-way-to-get-started">the easiest way to get started with Svelte</a>, we will put our project in a <code>client/</code> folder (because this is what the clients see, the frontend). Feel free to tinker a bit with the configuration files to change the name and such, although this isn’t relevant for the rest of the post.</p>
89<h2 id="finding_the_data"><a class="anchor" href="#finding_the_data">¶</a>Finding the data</h2>
90<p>We are going to work with the JSON files provided by <a href="http://opendata.caceres.es/">OpenData Cáceres</a>. In particular, we want information about the noise, census, vias and trees. To save you the time from <a href="http://opendata.caceres.es/dataset">searching each of these</a>, we will automate the download with code.</p>
91<p>If you want to save the data offline or just know what data we’ll be using for other purposes though, you can right click on the following links and select «Save Link As…» with the name of the link:</p>
92<ul>
93<li><code>[noise.json](http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&amp;format=json)</code></li>
94<li><code>[census.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&amp;year=2017&amp;format=json)</code></li>
95<li><code>[vias.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&amp;year=2017&amp;format=json)</code></li>
96<li><code>[trees.json](http://opendata.caceres.es/GetData/GetData?dataset=om:Arbol&amp;format=json)</code></li>
97</ul>
98<h2 id="backend"><a class="anchor" href="#backend">¶</a>Backend</h2>
99<p>It’s time to get started with some code! We will put it in a <code>server/</code> folder because it will contain the Python server, that is, the backend of our application.</p>
100<p>We are using <code>aiohttp</code> because we would like our server to be <code>async</code>. We don’t expect a lot of users at the same time, but it’s good to know our server would be well-designed for that use-case. As a bonus, it makes IO points clear in the code, which can help reason about it. The implicit synchronization between <code>await</code> is also a nice bonus.</p>
101<h3 id="saving_the_data_in_mongo"><a class="anchor" href="#saving_the_data_in_mongo">¶</a>Saving the data in Mongo</h3>
102<p>Before running the server, we must ensure that the data we need is already stored and indexed in Mongo. Our <code>server/data.py</code> will take care of downloading the files, cleaning them up a little (Cáceres’ OpenData can be a bit awkward sometimes), inserting them into Mongo and indexing them.</p>
103<p>Downloading the JSON data can be done with <code>[ClientSession.get](https://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientSession.get)</code>. We also take this opportunity to clean up the messy encoding from the JSON, which does not seem to be UTF-8 in some cases.</p>
104<pre><code>async def load_json(session, url):
105 fixes = [(old, new.encode('utf-8')) for old, new in [
106 (b'\xc3\x83\\u2018', 'Ñ'),
107 (b'\xc3\x83\\u0081', 'Á'),
108 (b'\xc3\x83\\u2030', 'É'),
109 (b'\xc3\x83\\u008D', 'Í'),
110 (b'\xc3\x83\\u201C', 'Ó'),
111 (b'\xc3\x83\xc5\xa1', 'Ú'),
112 (b'\xc3\x83\xc2\xa1', 'á'),
113 ]]
114
115 async with session.get(url) as resp:
116 data = await resp.read()
117
118 # Yes, this feels inefficient, but it's not really worth improving.
119 for old, new in fixes:
120 data = data.replace(old, new)
121
122 data = data.decode('utf-8')
123 return json.loads(data)
124</code></pre>
125<p>Later on, it can be reused for the various different URLs:</p>
126<pre><code>import aiohttp
127
128NOISE_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&amp;format=json'
129# (...other needed URLs here)
130
131async def insert_to_db(db):
132 async with aiohttp.ClientSession() as session:
133 data = await load_json(session, NOISE_URL)
134 # now we have the JSON data cleaned up, ready to be parsed
135</code></pre>
136<h3 id="data_model"><a class="anchor" href="#data_model">¶</a>Data model</h3>
137<p>With the JSON data in our hands, it’s time to parse it. Always remember to <a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">parse, don’t validate</a>. With <a href="https://docs.python.org/3/library/dataclasses.html">Python 3.7 <code>dataclasses</code></a> it’s trivial to define classes that will store only the fields we care about, typed, and with proper names:</p>
138<pre><code>from dataclasses import dataclass
139
140Longitude = float
141Latitude = float
142
143@dataclass
144class GSON:
145 type: str
146 coordinates: (Longitude, Latitude)
147
148@dataclass
149class Noise:
150 id: int
151 geo: GSON
152 level: float
153</code></pre>
154<p>This makes it really easy to see that, if we have a <code>Noise</code>, we can access its <code>geo</code> data which is a <code>GSON</code> with a <code>type</code> and <code>coordinates</code>, having <code>Longitude</code> and <code>Latitude</code> respectively. <code>dataclasses</code> and <code>[typing](https://docs.python.org/3/library/typing.html)</code> make dealing with this very easy and clear.</p>
155<p>Every dataclass will be on its own collection inside Mongo, and these are:</p>
156<ul>
157<li>
158<p>Noise</p>
159</li>
160<li>
161<p>Integer <code>id</code></p>
162</li>
163<li>
164<p>GeoJSON <code>geo</code></p>
165</li>
166<li>
167<p>String <code>type</code></p>
168</li>
169<li>
170<p>Longitude-latitude pair <code>coordinates</code></p>
171</li>
172<li>
173<p>Floating-point number <code>level</code></p>
174</li>
175<li>
176<p>Tree</p>
177</li>
178<li>
179<p>String <code>name</code></p>
180</li>
181<li>
182<p>String <code>gender</code></p>
183</li>
184<li>
185<p>Integer <code>units</code></p>
186</li>
187<li>
188<p>Floating-point number <code>height</code></p>
189</li>
190<li>
191<p>Floating-point number <code>cup_diameter</code></p>
192</li>
193<li>
194<p>Floating-point number <code>trunk_diameter</code></p>
195</li>
196<li>
197<p>Optional string <code>variety</code></p>
198</li>
199<li>
200<p>Optional string <code>distribution</code></p>
201</li>
202<li>
203<p>GeoJSON <code>geo</code></p>
204</li>
205<li>
206<p>Optional string <code>irrigation</code></p>
207</li>
208<li>
209<p>Census</p>
210</li>
211<li>
212<p>Integer <code>year</code></p>
213</li>
214<li>
215<p>Via <code>via</code></p>
216</li>
217<li>
218<p>String <code>name</code></p>
219</li>
220<li>
221<p>String <code>kind</code></p>
222</li>
223<li>
224<p>Integer <code>code</code></p>
225</li>
226<li>
227<p>Optional string <code>history</code></p>
228</li>
229<li>
230<p>Optional string <code>old_name</code></p>
231</li>
232<li>
233<p>Optional floating-point number <code>length</code></p>
234</li>
235<li>
236<p>Optional GeoJSON <code>start</code></p>
237</li>
238<li>
239<p>GeoJSON <code>middle</code></p>
240</li>
241<li>
242<p>Optional GeoJSON <code>end</code></p>
243</li>
244<li>
245<p>Optional list with geometry pairs <code>geometry</code></p>
246</li>
247<li>
248<p>Integer <code>count</code></p>
249</li>
250<li>
251<p>Mapping year-to-count <code>count_per_year</code></p>
252</li>
253<li>
254<p>Mapping gender-to-count <code>count_per_gender</code></p>
255</li>
256<li>
257<p>Mapping nationality-to-count <code>count_per_nationality</code></p>
258</li>
259<li>
260<p>Integer <code>time_year</code></p>
261</li>
262</ul>
263<p>Now, let’s define a method to actually parse the JSON and yield instances from these new data classes:</p>
264<pre><code>@classmethod
265def iter_from_json(cls, data):
266 for row in data['results']['bindings']:
267 noise_id = int(row['uri']['value'].split('/')[-1])
268 long = float(row['geo_long']['value'])
269 lat = float(row['geo_lat']['value'])
270 level = float(row['om_nivelRuido']['value'])
271
272 yield cls(
273 id=noise_id,
274 geo=GSON(type='Point', coordinates=[long, lat]),
275 level=level
276 )
277</code></pre>
278<p>Here we iterate over the input JSON <code>data</code> bindings and <code>yield cls</code> instances with more consistent naming than the original one. We also extract the data from the many unnecessary nested levels of the JSON and have something a lot flatter to work with.</p>
279<p>For those of you who don’t know what <code>yield</code> does (after all, not everyone is used to seeing generators), here’s two functions that work nearly the same:</p>
280<pre><code>def squares_return(n):
281 result = []
282 for i in range(n):
283 result.append(n ** 2)
284 return result
285
286def squares_yield(n):
287 for i in range(n):
288 yield n ** 2
289</code></pre>
290<p>The difference is that the one with <code>yield</code> is «lazy» and doesn’t need to do all the work up-front. It will generate (yield) more values as they are needed when you use a <code>for</code> loop. Generally, it’s a better idea to create generator functions than do all the work early which may be unnecessary. See <a href="https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do">What does the «yield» keyword do?</a> if you still have questions.</p>
291<p>With everything parsed, it’s time to insert the data into Mongo. If the data was not present yet (0 documents), then we will download the file, parse it, insert it as documents into the given Mongo <code>db</code>, and index it:</p>
292<pre><code>from dataclasses import asdict
293
294async def insert_to_db(db):
295 async with aiohttp.ClientSession() as session:
296 if await db.noise.estimated_document_count() == 0:
297 data = await load_json(session, NOISE_URL)
298
299 await db.noise.insert_many(asdict(noise) for noise in Noise.iter_from_json(data))
300 await db.noise.create_index([('geo', '2dsphere')])
301</code></pre>
302<p>We repeat this process for all the other data, and just like that, Mongo is ready to be used in our server.</p>
303<h3 id="indices"><a class="anchor" href="#indices">¶</a>Indices</h3>
304<p>In order to execute our geospatial queries we have to create an index on the attribute that represents the location, because the operators that we will use requires it. This attribute can be a <a href="https://docs.mongodb.com/manual/reference/geojson/">GeoJSON object</a> or a legacy coordinate pair.</p>
305<p>We have decided to use a GeoJSON object because we want to avoid legacy features that may be deprecated in the future.</p>
306<p>The attribute is called <code>geo</code> for the <code>Tree</code> and <code>Noise</code> objects and <code>start</code>, <code>middle</code> or <code>end</code> for the <code>Via</code> class. In the <code>Via</code> we are going to index the attribute <code>middle</code> because it is the most representative field for us. Because the <code>Via</code> is inside the <code>Census</code> and it doesn’t have its own collection, we create the index on the <code>Census</code> collection.</p>
307<p>The used index type is <code>2dsphere</code> because it supports queries that work on geometries on an earth-like sphere. Another option is the <code>2d</code> index but it’s not a good fit for our because it is for queries that calculate geometries on a two-dimensional plane.</p>
308<h3 id="running_the_server"><a class="anchor" href="#running_the_server">¶</a>Running the server</h3>
309<p>If we ignore the configuration part of the server creation, our <code>server.py</code> file is pretty simple. Its job is to create a <a href="https://aiohttp.readthedocs.io/en/stable/web.html">server application</a>, setup Mongo and return it to the caller so that they can run it:</p>
310<pre><code>import asyncio
311import subprocess
312import motor.motor_asyncio
313
314from aiohttp import web
315
316from . import rest, data
317
318def create_app():
319 ret = subprocess.run('npm run build', cwd='../client', shell=True).returncode
320 if ret != 0:
321 exit(ret)
322
323 db = motor.motor_asyncio.AsyncIOMotorClient().opendata
324 loop = asyncio.get_event_loop()
325 loop.run_until_complete(data.insert_to_db(db))
326
327 app = web.Application()
328 app['db'] = db
329
330 app.router.add_routes([
331 web.get('/', lambda r: web.HTTPSeeOther('/index.html')),
332 *rest.ROUTES,
333 web.static('/', os.path.join(config['www']['root'], 'public')),
334 ])
335
336 return app
337</code></pre>
338<p>There’s a bit going on here, but it’s nothing too complex:</p>
339<ul>
340<li>We automatically run <code>npm run build</code> on the frontend because it’s very comfortable to have the frontend built automatically before the server runs.</li>
341<li>We create a Motor client and access the <code>opendata</code> database. Into it, we load the data, effectively saving it in Mongo for the server to use.</li>
342<li>We create the server application and save a reference to the Mongo database in it, so that it can be used later on any endpoint without needing to recreate it.</li>
343<li>We define the routes of our app: root, REST and static (where the frontend files live). We’ll get to the <code>rest</code> part soon.
344Running the server is now simple:</li>
345</ul>
346<pre><code>def main():
347 from aiohttp import web
348 from . import server
349
350 app = server.create_app()
351 web.run_app(app)
352
353if __name__ == '__main__':
354 main()
355</code></pre>
356<h3 id="rest_endpoints"><a class="anchor" href="#rest_endpoints">¶</a>REST endpoints</h3>
357<p>The frontend will communicate with the backend via <a href="https://en.wikipedia.org/wiki/Representational_state_transfer">REST</a> calls, so that it can ask for things like «give me the information associated with this area», and the web server can query the Mongo server to reply with a HTTP response. This little diagram should help:</p>
358<p><img src="bitmap.png" alt="" /></p>
359<p>What we need to do, then, is define those REST endpoints we mentioned earlier when creating the server. We will process the HTTP request, ask Mongo for the data, and return the HTTP response:</p>
360<pre><code>import asyncio
361import pymongo
362
363from aiohttp import web
364
365async def get_area_info(request):
366 try:
367 long = float(request.query['long'])
368 lat = float(request.query['lat'])
369 distance = float(request.query['distance'])
370 except KeyError as e:
371 raise web.HTTPBadRequest(reason=f'a required parameter was missing: {e.args[0]}')
372 except ValueError:
373 raise web.HTTPBadRequest(reason='one of the parameters was not a valid float')
374
375 geo_avg_noise_pipeline = [{
376 '$geoNear': {
377 'near' : {'type': 'Point', 'coordinates': [long, lat]},
378 'maxDistance': distance,
379 'minDistance': 0,
380 'spherical' : 'true',
381 'distanceField' : 'distance'
382 }
383 }]
384
385 db = request.app['db']
386
387 try:
388 noise_count, sum_noise, avg_noise = 0, 0, 0
389 async for item in db.noise.aggregate(geo_avg_noise_pipeline):
390 noise_count += 1
391 sum_noise += item['level']
392
393 if noise_count != 0:
394 avg_noise = sum_noise / noise_count
395 else:
396 avg_noise = None
397
398 except pymongo.errors.ConnectionFailure:
399 raise web.HTTPServiceUnavailable(reason='no connection to database')
400
401 return web.json_response({
402 'tree_count': tree_count,
403 'trees_per_type': [[k, v] for k, v in trees_per_type.items()],
404 'census_count': census_count,
405 'avg_noise': avg_noise,
406 })
407
408ROUTES = [
409 web.get('/rest/get-area-info', get_area_info)
410]
411</code></pre>
412<p>In this code, we’re only showing how to return the average noise because that’s the simplest we can do. The real code also fetches tree count, tree count per type, and census count.</p>
413<p>Again, there’s quite a bit to go through, so let’s go step by step:</p>
414<ul>
415<li>We parse the frontend’s <code>request.query</code> into <code>float</code> that we can use. In particular, the frontend is asking us for information at a certain latitude, longitude, and distance. If the query is malformed, we return a proper error.</li>
416<li>We create our query for Mongo outside, just so it’s clearer to read.</li>
417<li>We access the database reference we stored earlier when creating the server with <code>request.app['db']</code>. Handy!</li>
418<li>We try to query Mongo. It may fail if the Mongo server is not running, so we should handle that and tell the client what’s happening. If it succeeds though, we will gather information about the average noise.</li>
419<li>We return a <code>json_response</code> with Mongo results for the frontend to present to the user.
420You may have noticed we defined a <code>ROUTES</code> list at the bottom. This will make it easier to expand in the future, and the server creation won’t need to change anything in its code, because it’s already unpacking all the routes we define here.</li>
421</ul>
422<h3 id="geospatial_queries"><a class="anchor" href="#geospatial_queries">¶</a>Geospatial queries</h3>
423<p>In order to retrieve the information from Mongo database we have defined two geospatial queries:</p>
424<pre><code>geo_query = {
425 '$nearSphere' : {
426 '$geometry': {
427 'type': 'Point',
428 'coordinates': [long, lat]
429 },
430 '$maxDistance': distance,
431 '$minDistance': 0
432 }
433}
434</code></pre>
435<p>This query uses <a href="https://docs.mongodb.com/manual/reference/operator/query/nearSphere/#op._S_nearSphere">the operator <code>$nearSphere</code></a> which return geospatial objects in proximity to a point on a sphere.</p>
436<p>The sphere point is represented by the <code>$geometry</code> operator where it is specified the type of geometry and the coordinates (given by the HTTP request).</p>
437<p>The maximum and minimum distance are represented by <code>$maxDistance</code> and <code>$minDistance</code> respectively. We specify that the maximum distance is the radio selected by the user.</p>
438<pre><code>geo_avg_noise_pipeline = [{
439 '$geoNear': {
440 'near' : {'type': 'Point', 'coordinates': [long, lat]},
441 'maxDistance': distance,
442 'minDistance': 0,
443 'spherical' : 'true',
444 'distanceField' : 'distance'
445 }
446}]
447</code></pre>
448<p>This query uses the <a href="https://docs.mongodb.com/manual/core/aggregation-pipeline/">aggregation pipeline</a> stage <a href="https://docs.mongodb.com/manual/reference/operator/aggregation/geoNear/#pipe._S_geoNear"><code>$geoNear</code></a> which returns an ordered stream of documents based on the proximity to a geospatial point. The output documents include an additional distance field.</p>
449<p>The <code>near</code> field is mandatory and is the point for which to find the closest documents. In this field it is specified the type of geometry and the coordinates (given by the HTTP request).</p>
450<p>The <code>distanceField</code> field is also mandatory and is the output field that will contain the calculated distance. In this case we’ve just called it <code>distance</code>.</p>
451<p>Some other fields are <code>maxDistance</code> that indicates the maximum allowed distance from the center of the point, <code>minDistance</code> for the minimum distance, and <code>spherical</code> which tells MongoDB how to calculate the distance between two points.</p>
452<p>We specify the maximum distance as the radio selected by the user in the frontend.</p>
453<h2 id="frontend"><a class="anchor" href="#frontend">¶</a>Frontend</h2>
454<p>As said earlier, our frontend will use Svelte. We already downloaded the template, so we can start developing. For some, this is the most fun part, because they can finally see and interact with some of the results. But for this interaction to work, we needed a functional backend which we now have!</p>
455<h3 id="rest_queries"><a class="anchor" href="#rest_queries">¶</a>REST queries</h3>
456<p>The frontend has to query the server to get any meaningful data to show on the page. The <a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API">Fetch API</a> does not throw an exception if the server doesn’t respond with HTTP OK, but we would like one if things go wrong, so that we can handle them gracefully. The first we’ll do is define our own exception <a href="https://stackoverflow.com/a/27724419">which is not pretty</a>:</p>
457<pre><code>function NetworkError(message, status) {
458 var instance = new Error(message);
459 instance.name = 'NetworkError';
460 instance.status = status;
461 Object.setPrototypeOf(instance, Object.getPrototypeOf(this));
462 if (Error.captureStackTrace) {
463 Error.captureStackTrace(instance, NetworkError);
464 }
465 return instance;
466}
467
468NetworkError.prototype = Object.create(Error.prototype, {
469 constructor: {
470 value: Error,
471 enumerable: false,
472 writable: true,
473 configurable: true
474 }
475});
476Object.setPrototypeOf(NetworkError, Error);
477</code></pre>
478<p>But hey, now we have a proper and reusable <code>NetworkError</code>! Next, let’s make a proper and reusabe <code>query</code> function that deals with <code>fetch</code> for us:</p>
479<pre><code>async function query(endpoint) {
480 const res = await fetch(endpoint, {
481 // if we ever use cookies, this is important
482 credentials: 'include'
483 });
484 if (res.ok) {
485 return await res.json();
486 } else {
487 throw new NetworkError(await res.text(), res.status);
488 }
489}
490</code></pre>
491<p>At last, we can query our web server. The export here tells Svelte that this function should be visible to outer modules (public) as opposed to being private:</p>
492<pre><code>export function get_area_info(long, lat, distance) {
493 return query(`/rest/get-area-info?long=${long}&amp;lat=${lat}&amp;distance=${distance}`);
494}
495</code></pre>
496<p>The attentive reader will have noticed that <code>query</code> is <code>async</code>, but <code>get_area_info</code> is not. This is intentional, because we don’t need to <code>await</code> for anything inside of it. We can just return the <code>[Promise](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise)</code> that <code>query</code> created and let the caller <code>await</code> it as they see fit. The <code>await</code> here would have been redundant.</p>
497<p>For those of you who don’t know what a JavaScript promise is, think of it as an object that represents «an eventual result». The result may not be there yet, but we promised it will be present in the future, and we can <code>await</code> for it. You can also find the same concept in other languages like Python under a different name, such as <a href="https://docs.python.org/3/library/asyncio-future.html#asyncio.Future"><code>Future</code></a>.</p>
498<h3 id="map_component"><a class="anchor" href="#map_component">¶</a>Map component</h3>
499<p>In Svelte, we can define self-contained components that are issolated from the rest. This makes it really easy to create a modular application. Think of a Svelte component as your own HTML tag, which you can customize however you want, building upon the already-existing components HTML has to offer.</p>
500<p>The main thing that our map needs to do is render the map as an image and overlay the selection area as the user hovers the map with their mouse. We could render the image in the canvas itself, but instead we’ll use the HTML <code>&lt;img&gt;</code> tag for that and put a transparent <code>&lt;canvas&gt;</code> on top with some CSS. This should make it cheaper and easier to render things on the canvas.</p>
501<p>The <code>Map</code> component will thus render as the user moves the mouse over it, and produce an event when they click so that whatever component is using a <code>Map</code> knows that it was clicked. Here’s the final CSS and HTML:</p>
502<pre><code>&lt;style&gt;
503div {
504 position: relative;
505}
506canvas {
507 position: absolute;
508 left: 0;
509 top: 0;
510 cursor: crosshair;
511}
512&lt;/style&gt;
513
514&lt;div&gt;
515 &lt;img bind:this={img} on:load={handleLoad} {height} src=&quot;caceres-municipality.svg&quot; alt=&quot;Cáceres (municipality)&quot;/&gt;
516 &lt;canvas
517 bind:this={canvas}
518 on:mousemove={handleMove}
519 on:wheel={handleWheel}
520 on:mouseup={handleClick}/&gt;
521&lt;/div&gt;
522</code></pre>
523<p>We hardcode a map source here, but ideally this would be provided by the server. The project is already complex enough, so we tried to avoid more complexity than necessary.</p>
524<p>We bind the tags to some variables declared in the JavaScript code of the component, along with some functions and parameters to let the users of <code>Map</code> customize it just a little.</p>
525<p>Here’s the gist of the JavaScript code:</p>
526<pre><code>&lt;script&gt;
527 import { createEventDispatcher, onMount } from 'svelte';
528
529 export let height = 200;
530
531 const dispatch = createEventDispatcher();
532
533 let img;
534 let canvas;
535
536 const LONG_WEST = -6.426881;
537 const LONG_EAST = -6.354143;
538 const LAT_NORTH = 39.500064;
539 const LAT_SOUTH = 39.443201;
540
541 let x = 0;
542 let y = 0;
543 let clickInfo = null; // [x, y, radius]
544 let radiusDelta = 0.005 * height;
545 let maxRadius = 0.2 * height;
546 let minRadius = 0.01 * height;
547 let radius = 0.05 * height;
548
549 function handleLoad() {
550 canvas.width = img.width;
551 canvas.height = img.height;
552 }
553
554 function handleMove(event) {
555 const { left, top } = this.getBoundingClientRect();
556 x = Math.round(event.clientX - left);
557 y = Math.round(event.clientY - top);
558 }
559
560 function handleWheel(event) {
561 if (event.deltaY &lt; 0) {
562 if (radius &lt; maxRadius) {
563 radius += radiusDelta;
564 }
565 } else {
566 if (radius &gt; minRadius) {
567 radius -= radiusDelta;
568 }
569 }
570 event.preventDefault();
571 }
572
573 function handleClick(event) {
574 dispatch('click', {
575 // the real code here maps the x/y/radius values to the right range, here omitted
576 x: ...,
577 y: ...,
578 radius: ...,
579 });
580 }
581
582 onMount(() =&gt; {
583 const ctx = canvas.getContext('2d');
584 let frame;
585
586 (function loop() {
587 frame = requestAnimationFrame(loop);
588
589 // the real code renders mouse area/selection, here omitted for brevity
590 ...
591 }());
592
593 return () =&gt; {
594 cancelAnimationFrame(frame);
595 };
596 });
597&lt;/script&gt;
598</code></pre>
599<p>Let’s go through bit-by-bit:</p>
600<ul>
601<li>We define a few variables and constants for later use in the final code.</li>
602<li>We define the handlers to react to mouse movement and clicks. On click, we dispatch an event to outer components.</li>
603<li>We setup the render loop with animation frames, and cancel the current frame appropriatedly if the component disappears.</li>
604</ul>
605<h3 id="app_component"><a class="anchor" href="#app_component">¶</a>App component</h3>
606<p>Time to put everything together! We wil include our function to make REST queries along with our <code>Map</code> component to render things on screen.</p>
607<pre><code>&lt;script&gt;
608 import Map from './Map.svelte';
609 import { get_area_info } from './rest.js'
610 let selection = null;
611 let area_info_promise = null;
612 function handleMapSelection(event) {
613 selection = event.detail;
614 area_info_promise = get_area_info(selection.x, selection.y, selection.radius);
615 }
616 function format_avg_noise(avg_noise) {
617 if (avg_noise === null) {
618 return '(no data)';
619 } else {
620 return `${avg_noise.toFixed(2)} dB`;
621 }
622 }
623&lt;/script&gt;
624
625&lt;div class=&quot;container-fluid&quot;&gt;
626 &lt;div class=&quot;row&quot;&gt;
627 &lt;div class=&quot;col-3&quot; style=&quot;max-width: 300em;&quot;&gt;
628 &lt;div class=&quot;text-center&quot;&gt;
629 &lt;h1&gt;Caceres Data Consultory&lt;/h1&gt;
630 &lt;/div&gt;
631 &lt;Map height={400} on:click={handleMapSelection}/&gt;
632 &lt;div class=&quot;text-center mt-4&quot;&gt;
633 {#if selection === null}
634 &lt;p class=&quot;m-1 p-3 border border-bottom-0 bg-info text-white&quot;&gt;Click on the map to select the area you wish to see details for.&lt;/p&gt;
635 {:else}
636 &lt;h2 class=&quot;bg-dark text-white&quot;&gt;Selected area&lt;/h2&gt;
637 &lt;p&gt;&lt;b&gt;Coordinates:&lt;/b&gt; ({selection.x}, {selection.y})&lt;/p&gt;
638 &lt;p&gt;&lt;b&gt;Radius:&lt;/b&gt; {selection.radius} meters&lt;/p&gt;
639 {/if}
640 &lt;/div&gt;
641 &lt;/div&gt;
642 &lt;div class=&quot;col-sm-4&quot;&gt;
643 &lt;div class=&quot;row&quot;&gt;
644 {#if area_info_promise !== null}
645 {#await area_info_promise}
646 &lt;p&gt;Fetching area information…&lt;/p&gt;
647 {:then area_info}
648 &lt;div class=&quot;col&quot;&gt;
649 &lt;div class=&quot;text-center&quot;&gt;
650 &lt;h2 class=&quot;m-1 bg-dark text-white&quot;&gt;Area information&lt;/h2&gt;
651 &lt;ul class=&quot;list-unstyled&quot;&gt;
652 &lt;li&gt;There are &lt;b&gt;{area_info.tree_count} trees &lt;/b&gt; within the area&lt;/li&gt;
653 &lt;li&gt;The &lt;b&gt;average noise&lt;/b&gt; is &lt;b&gt;{format_avg_noise(area_info.avg_noise)}&lt;/b&gt;&lt;/li&gt;
654 &lt;li&gt;There are &lt;b&gt;{area_info.census_count} persons &lt;/b&gt; within the area&lt;/li&gt;
655 &lt;/ul&gt;
656 &lt;/div&gt;
657 {#if area_info.trees_per_type.length &gt; 0}
658 &lt;div class=&quot;text-center&quot;&gt;
659 &lt;h2 class=&quot;m-1 bg-dark text-white&quot;&gt;Tree count per type&lt;/h2&gt;
660 &lt;/div&gt;
661 &lt;ul class=&quot;list-group&quot;&gt;
662 {#each area_info.trees_per_type as [type, count]}
663 &lt;li class=&quot;list-group-item&quot;&gt;{type} &lt;span class=&quot;badge badge-dark float-right&quot;&gt;{count}&lt;/span&gt;&lt;/li&gt;
664 {/each}
665 &lt;/ul&gt;
666 {/if}
667 &lt;/div&gt;
668 {:catch error}
669 &lt;p&gt;Failed to fetch area information: {error.message}&lt;/p&gt;
670 {/await}
671 {/if}
672 &lt;/div&gt;
673 &lt;/div&gt;
674 &lt;/div&gt;
675&lt;/div&gt;
676</code></pre>
677<ul>
678<li>We import the <code>Map</code> component and REST function so we can use them.</li>
679<li>We define a listener for the events that the <code>Map</code> produces. Such event will trigger a REST call to the server and save the result in a promise used later.</li>
680<li>We’re using Bootstrap for the layout because it’s a lot easier. In the body we add our <code>Map</code> and another column to show the selection information.</li>
681<li>We make use of Svelte’s <code>{#await}</code> to nicely notify the user when the call is being made, when it was successful, and when it failed. If it’s successful, we display the info.</li>
682</ul>
683<h2 id="results"><a class="anchor" href="#results">¶</a>Results</h2>
684<p>Lo and behold, watch our application run!</p>
685<p><video controls="controls" src="sr-2020-04-14_09-28-25.mp4"></video></p>
686<p>In this video you can see our application running, but let’s describe what is happening in more detail.</p>
687<p>When the application starts running (by opening it in your web browser of choice), you can see a map with the town of Cáceres. Then you, the user, can click to retrieve the information within the selected area.</p>
688<p>It is important to note that one can make the selection area larger or smaller by trying to scroll up or down, respectively.</p>
689<p>Once an area is selected, it is colored green in order to let the user know which area they have selected. Under the map, the selected coordinates and the radius (in meters) is also shown for the curious. At the right side the information concerning the selected area is shown, such as the number of trees, the average noise and the number of persons. If there are trees in the area, the application also displays the trees per type, sorted by the number of trees.</p>
690<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
691<p>We hope you enjoyed reading this post as much as we enjoyed writing it! Feel free to download the final project and play around with it. Maybe you can adapt it for even more interesting purposes!</p>
692<p><em>download removed</em></p>
693<p>To run the above code:</p>
694<ol>
695<li>Unzip the downloaded file.</li>
696<li>Make a copy of <code>example-server-config.ini</code> and rename it to <code>server-config.ini</code>, then edit the file to suit your needs.</li>
697<li>Run the server with <code>python -m server</code>.</li>
698<li>Open <a href="http://localhost:9000">localhost:9000</a> in your web browser (or whatever port you chose) and enjoy!</li>
699</ol>
700</main>
701</body>
702</html>
703 </content></entry><entry><title>MongoDB: an Introduction</title><id>dist/mongodb-an-introduction/index.html</id><updated>2020-04-07T22:00:00+00:00</updated><published>2020-03-04T23:00:00+00:00</published><summary>This is the first post in the MongoDB series, where we will introduce the MongoDB database system and take a look at its features and installation methods.</summary><content type="html" src="dist/mongodb-an-introduction/index.html"><!DOCTYPE html>
704<html>
705<head>
706<meta charset="utf-8" />
707<meta name="viewport" content="width=device-width, initial-scale=1" />
708<title>MongoDB: an Introduction</title>
709<link rel="stylesheet" href="../css/style.css">
710</head>
711<body>
712<main>
713<p>This is the first post in the MongoDB series, where we will introduce the MongoDB database system and take a look at its features and installation methods.</p>
714<div class="date-created-modified">Created 2020-03-05<br>
715Modified 2020-04-08</div>
716<p>Other posts in this series:</p>
717<ul>
718<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a> (this post)</li>
719<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a></li>
720<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a></li>
721</ul>
722<p>This post is co-authored wih Classmate.</p>
723<hr />
724<div class="image-container">
725<img src="mongodb.png" alt="NoSQL database – MongoDB – First delivery" />
726<div class="image-caption"></div>
727</div>
728<p>
729<h2 class="title" id="purpose_of_technology"><a class="anchor" href="#purpose_of_technology">¶</a>Purpose of technology</h2>
730<p>MongoDB is a <strong>general purpose, document-based, distributed database</strong> built for modern application developers and for the cloud era, with the scalability and flexibility that you want with the querying and indexing that you need. It being a document database means it stores data in JSON-like documents.</p>
731<p>The Mongo team believes this is the most natural way to think about data, which is (they claim) much more expressive and powerful than the traditional row/column model, since programmers think in objects.</p>
732<h2 id="how_it_works"><a class="anchor" href="#how_it_works">¶</a>How it works</h2>
733<p>MongoDB’s architecture can be summarized as follows:</p>
734<ul>
735<li>Document data model.</li>
736<li>Distributed systems design.</li>
737<li>Unified experience with freedom to run it anywhere.</li>
738</ul>
739<p>For a more in-depth explanation, MongoDB offers a <a href="https://www.mongodb.com/collateral/mongodb-architecture-guide">download to the MongoDB Architecture Guide</a> with roughly ten pages worth of text.</p>
740<p><img src="knGHenfTGA4kzJb1PHmS9EQvtZl2QlhbIPN15M38m8fZfZf7ODwYfhf0Tltr.png" alt="" />
741_ Overview of MongoDB’s architecture_</p>
742<p>Regarding usage, MongoDB comes with a really nice introduction along with JavaScript, Python, Java, C++ or C# code at our choice, which describes the steps necessary to make it work. Below we will describe a common workflow.</p>
743<p>First, we must <strong>connect</strong> to a running MongoDB instance. Once the connection succeeds, we can access individual «collections», which we can think of as <em>tables</em> where collections of data is stored.</p>
744<p>For instance, we could <strong>insert</strong> an arbitrary JSON document into the <code>restaurants</code> collection to store information about a restaurant.</p>
745<p>At any other point in time, we can <strong>query</strong> these collections. The queries range from trivial, empty ones (which would retrieve all the documents and fields) to more rich and complex queries (for instance, using AND and OR operators, checking if data exists, and then looking for a value in a list).</p>
746<p>MongoDB also supports the creation of <strong>indices</strong>, similar to those in other database systems. It allows for the creation of indices on any field or subfields.</p>
747<p>In Mongo, the <strong>aggregation pipeline</strong> allows us to filter and analyze data based on a given set of criteria. For example, we could pull all the documents in the <code>restaurants</code> collection that have a <code>category</code> of <code>Bakery</code> using the <code>$match</code> operator. Then, we can group them by their star rating using the <code>$group</code> operator. Using the accumulator operator, <code>$sum</code>, we can see how many bakeries in our collection have each star rating.</p>
748<h2 id="features"><a class="anchor" href="#features">¶</a>Features</h2>
749<p>The features can be seen all over the place in their site, because it’s something they make a lot of emphasis on:</p>
750<ul>
751<li>
752<p><strong>Easy development</strong>, thanks to the document data model, something they claim to be «the best way to work with data».</p>
753</li>
754<li>
755<p>Data is stored in flexible JSON-like documents.</p>
756</li>
757<li>
758<p>This model directly maps to the objects in the application’s code.</p>
759</li>
760<li>
761<p>Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyze the data.</p>
762</li>
763<li>
764<p><strong>Powerful query language</strong>, with a rich and expressive query language that allows filtering and sorting by any field, no matter how nested it may be within a document. The queries are themselves JSON, and thus easily composable.</p>
765</li>
766<li>
767<p><strong>Support for aggregations</strong> and other modern use-cases such as geo-based search, graph search, and text search.</p>
768</li>
769<li>
770<p><strong>A distributed systems design</strong>, which allows developers to intelligently put data where they want it. High availability, horizontal scaling, and geographic distribution are built in and easy to use.</p>
771</li>
772<li>
773<p><strong>A unified experience</strong> with the freedom to run anywhere, which allows developers to future-proof their work and eliminate vendor lock-in.</p>
774</li>
775</ul>
776<h2 id="corner_in_cap_theorem"><a class="anchor" href="#corner_in_cap_theorem">¶</a>Corner in CAP theorem</h2>
777<p>MongoDB’s position in the CAP theorem (Consistency, Availability, Partition Tolerance) depends on the database and driver configurations, and the type of disaster.</p>
778<ul>
779<li>With <strong>no partitions</strong>, the main focus is <strong>CA</strong>.</li>
780<li>If there are **partitions **but the system is <strong>strongly connected</strong>, the main focus is <strong>AP</strong>: non-synchronized writes from the old primary are ignored.</li>
781<li>If there are <strong>partitions</strong> but the system is <strong>not strongly connected</strong>, the main focus is <strong>CP</strong>: only read access is provided to avoid inconsistencies.
782The general consensus seems to be that Mongo is <strong>CP</strong>.</li>
783</ul>
784<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
785<p>We will be using the apt-based installation.</p>
786<p>The Community version can be downloaded by anyone through <a href="https://www.mongodb.com/download-center/community">MongoDB Download Center</a>, where one can choose the version, Operating System and Package.MongoDB also seems to be <a href="https://packages.ubuntu.com/eoan/mongodb">available in Ubuntu’s PPAs</a>. </p>
787<h2 id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
788<p>We will be using an Ubuntu-based system, with apt available. To install MongoDB, we open a terminal and run the following command:</p>
789<pre><code>apt install mongodb
790</code></pre>
791<p>After confirming that we do indeed want to install the package, we should be able to run the following command to verify that the installation was successful:</p>
792<pre><code>mongod --version
793</code></pre>
794<p>The output should be similar to the following: </p>
795<pre><code>db version v4.0.16
796git version: 2a5433168a53044cb6b4fa8083e4cfd7ba142221
797OpenSSL version: OpenSSL 1.1.1 11 Sep 2018
798allocator: tcmalloc
799modules: none
800build environment:
801 distmod: ubuntu1804
802 distarch: x86_64
803 target_arch: x86_64
804</code></pre>
805<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
806<ul>
807<li><a href="https://www.mongodb.com/">MongoDB’s official site</a></li>
808<li><a href="https://www.mongodb.com/what-is-mongodb">What is MongoDB?</a></li>
809<li><a href="https://www.mongodb.com/mongodb-architecture">MongoDB Architecture</a></li>
810<li><a href="https://stackoverflow.com/q/11292215/4759433">Where does mongodb stand in the CAP theorem?</a></li>
811<li><a href="https://medium.com/@bikas.katwal10/mongodb-vs-cassandra-vs-rdbms-where-do-they-stand-in-the-cap-theorem-1bae779a7a15">What is the CAP Theorem? MongoDB vs Cassandra vs RDBMS, where do they stand in the CAP theorem?</a></li>
812<li><a href="https://www.quora.com/Why-doesnt-MongoDB-have-availability-in-the-CAP-theorem">Why doesn’t MongoDB have availability in the CAP theorem?</a></li>
813<li><a href="https://docs.mongodb.com/manual/installation/">Install MongoDB</a></li>
814</ul>
815</main>
816</body>
817</html>
818 </content></entry><entry><title>MongoDB: Basic Operations and Architecture</title><id>dist/mongodb-basic-operations-and-architecture/index.html</id><updated>2020-04-07T22:00:00+00:00</updated><published>2020-03-04T23:00:00+00:00</published><summary>This is the second post in the MongoDB series, where we will take a look at the </summary><content type="html" src="dist/mongodb-basic-operations-and-architecture/index.html"><!DOCTYPE html>
819<html>
820<head>
821<meta charset="utf-8" />
822<meta name="viewport" content="width=device-width, initial-scale=1" />
823<title>MongoDB: Basic Operations and Architecture</title>
824<link rel="stylesheet" href="../css/style.css">
825</head>
826<body>
827<main>
828<p>This is the second post in the MongoDB series, where we will take a look at the <a href="https://stackify.com/what-are-crud-operations/">CRUD operations</a> they support, the data model and architecture used.</p>
829<div class="date-created-modified">Created 2020-03-05<br>
830Modified 2020-04-08</div>
831<p>Other posts in this series:</p>
832<ul>
833<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a></li>
834<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a> (this post)</li>
835<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a></li>
836</ul>
837<p>This post is co-authored wih Classmate, and in it we will take an explorative approach using the <code>mongo</code> command line shell to execute commands against the database. It even has TAB auto-completion, which is awesome!</p>
838<hr />
839<p>Before creating any documents, we first need to create somewhere for the documents to be in. And before we create anything, the database has to be running, so let’s do that first. If we don’t have a service installed, we can run the <code>mongod</code> command ourselves in some local folder to make things easier:</p>
840<pre><code>$ mkdir -p mongo-database
841$ mongod --dbpath mongo-database
842</code></pre>
843<p>Just like that, we will have Mongo running. Now, let’s connect to it using the <code>mongo</code> command in another terminal (don’t close the terminal where the server is running, we need it!). By default, it connects to localhost, which is just what we need.</p>
844<pre><code>$ mongo
845</code></pre>
846<h2 class="title" id="create"><a class="anchor" href="#create">¶</a>Create</h2>
847<h3 id="create_a_database"><a class="anchor" href="#create_a_database">¶</a>Create a database</h3>
848<p>Let’s list the databases:</p>
849<pre><code>&gt; show databases
850admin 0.000GB
851config 0.000GB
852local 0.000GB
853</code></pre>
854<p>Oh, how interesting! There’s already some databases, even though we just created the directory where Mongo will store everything. However, they seem empty, which make sense.</p>
855<p>Creating a new database is done by <code>use</code>-ing a name that doesn’t exist. Let’s call our new database «helloworld».</p>
856<pre><code>&gt; use helloworld
857switched to db helloworld
858</code></pre>
859<p>Good! Now the «local variable» called <code>db</code> points to our <code>helloworld</code> database.</p>
860<pre><code>&gt; db
861helloworld
862</code></pre>
863<p>What happens if we print the databases again? Surely our new database will show up now…</p>
864<pre><code>&gt; show databases
865admin 0.000GB
866config 0.000GB
867local 0.000GB
868</code></pre>
869<p>…maybe not! It seems Mongo won’t create the database until we create some collections and documents in it. Databases contain collections, and inside collections (which you can think of as tables) we can insert new documents (which you can think of as rows). Like in many programming languages, the dot operator is used to access these «members».</p>
870<h3 id="create_a_document"><a class="anchor" href="#create_a_document">¶</a>Create a document</h3>
871<p>Let’s add a new greeting into the <code>greetings</code> collection:</p>
872<pre><code>&gt; db.greetings.insert({message: &quot;¡Bienvenido!&quot;, lang: &quot;es&quot;})
873WriteResult({ &quot;nInserted&quot; : 1 })
874
875&gt; show collections
876greetings
877
878&gt; show databases
879admin 0.000GB
880config 0.000GB
881helloworld 0.000GB
882local 0.000GB
883</code></pre>
884<p>That looks promising! We can also see our new <code>helloworld</code> database also shows up. The Mongo shell actually works on JavaScript-like code, which is why we can use a variant of JSON (BSON) to insert documents (note the lack of quotes around the keys, convenient!).</p>
885<p>The <a href="https://docs.mongodb.com/manual/reference/method/db.collection.insert/index.html"><code>insert</code></a> method actually supports a list of documents, and by default Mongo will assign a unique identifier to each. If we don’t want that though, all we have to do is add the <code>_id</code> key to our documents.</p>
886<pre><code>&gt; db.greetings.insert([
887... {message: &quot;Welcome!&quot;, lang: &quot;en&quot;},
888... {message: &quot;Bonjour!&quot;, lang: &quot;fr&quot;},
889... ])
890BulkWriteResult({
891 &quot;writeErrors&quot; : [ ],
892 &quot;writeConcernErrors&quot; : [ ],
893 &quot;nInserted&quot; : 2,
894 &quot;nUpserted&quot; : 0,
895 &quot;nMatched&quot; : 0,
896 &quot;nModified&quot; : 0,
897 &quot;nRemoved&quot; : 0,
898 &quot;upserted&quot; : [ ]
899})
900</code></pre>
901<h3 id="create_a_collection"><a class="anchor" href="#create_a_collection">¶</a>Create a collection</h3>
902<p>In this example, we created the collection <code>greetings</code> implicitly, but behind the scenes Mongo made a call to <a href="https://docs.mongodb.com/manual/reference/method/db.createCollection/"><code>createCollection</code></a>. Let’s do just that:</p>
903<pre><code>&gt; db.createCollection(&quot;goodbyes&quot;)
904{ &quot;ok&quot; : 1 }
905
906&gt; show collections
907goodbyes
908greetings
909</code></pre>
910<p>The method actually has a default parameter to configure other options, like the maximum size of the collection or maximum amount of documents in it, validation-related options, and so on. These are all described in more details in the documentation.</p>
911<h2 id="read"><a class="anchor" href="#read">¶</a>Read</h2>
912<p>To read the contents of a document, we have to <a href="https://docs.mongodb.com/manual/reference/method/db.collection.find/index.html"><code>find</code></a> it.</p>
913<pre><code>&gt; db.greetings.find()
914{ &quot;_id&quot; : ObjectId(&quot;5e74829a0659f802b15f18dd&quot;), &quot;message&quot; : &quot;¡Bienvenido!&quot;, &quot;lang&quot; : &quot;es&quot; }
915{ &quot;_id&quot; : ObjectId(&quot;5e7487b90659f802b15f18de&quot;), &quot;message&quot; : &quot;Welcome!&quot;, &quot;lang&quot; : &quot;en&quot; }
916{ &quot;_id&quot; : ObjectId(&quot;5e7487b90659f802b15f18df&quot;), &quot;message&quot; : &quot;Bonjour!&quot;, &quot;lang&quot; : &quot;fr&quot; }
917</code></pre>
918<p>That’s a bit unreadable for my taste, can we make it more <a href="https://docs.mongodb.com/manual/reference/method/cursor.pretty/index.html"><code>pretty</code></a>?</p>
919<pre><code>&gt; db.greetings.find().pretty()
920{
921 &quot;_id&quot; : ObjectId(&quot;5e74829a0659f802b15f18dd&quot;),
922 &quot;message&quot; : &quot;¡Bienvenido!&quot;,
923 &quot;lang&quot; : &quot;es&quot;
924}
925{
926 &quot;_id&quot; : ObjectId(&quot;5e7487b90659f802b15f18de&quot;),
927 &quot;message&quot; : &quot;Welcome!&quot;,
928 &quot;lang&quot; : &quot;en&quot;
929}
930{
931 &quot;_id&quot; : ObjectId(&quot;5e7487b90659f802b15f18df&quot;),
932 &quot;message&quot; : &quot;Bonjour!&quot;,
933 &quot;lang&quot; : &quot;fr&quot;
934}
935</code></pre>
936<p>Gorgeous! We can clearly see Mongo created an identifier for us automatically. The queries are also JSON, and support a bunch of operators (prefixed by <code>$</code>), known as <a href="https://docs.mongodb.com/manual/reference/operator/query/">Query Selectors</a>. Here’s a few:</p>
937<table>
938 <thead>
939 <tr>
940 <th>
941 Operation
942 </th>
943 <th>
944 Syntax
945 </th>
946 <th>
947 RDBMS equivalent
948 </th>
949 </tr>
950 </thead>
951 <tbody>
952 <tr>
953 <td>
954 Equals
955 </td>
956 <td>
957 <code>
958 {key: {$eq: value}}
959 </code>
960 <br/>
961 Shorthand:
962 <code>
963 {key: value}
964 </code>
965 </td>
966 <td>
967 <code>
968 where key = value
969 </code>
970 </td>
971 </tr>
972 <tr>
973 <td>
974 Less Than
975 </td>
976 <td>
977 <code>
978 {key: {$lte: value}}
979 </code>
980 </td>
981 <td>
982 <code>
983 where key &lt; value
984 </code>
985 </td>
986 </tr>
987 <tr>
988 <td>
989 Less Than or Equal
990 </td>
991 <td>
992 <code>
993 {key: {$lt: value}}
994 </code>
995 </td>
996 <td>
997 <code>
998 where key &lt;= value
999 </code>
1000 </td>
1001 </tr>
1002 <tr>
1003 <td>
1004 Greater Than
1005 </td>
1006 <td>
1007 <code>
1008 {key: {$gt: value}}
1009 </code>
1010 </td>
1011 <td>
1012 <code>
1013 where key &gt; value
1014 </code>
1015 </td>
1016 </tr>
1017 <tr>
1018 <td>
1019 Greater Than or Equal
1020 </td>
1021 <td>
1022 <code>
1023 {key: {$gte: value}}
1024 </code>
1025 </td>
1026 <td>
1027 <code>
1028 where key &gt;= value
1029 </code>
1030 </td>
1031 </tr>
1032 <tr>
1033 <td>
1034 Not Equal
1035 </td>
1036 <td>
1037 <code>
1038 {key: {$ne: value}}
1039 </code>
1040 </td>
1041 <td>
1042 <code>
1043 where key != value
1044 </code>
1045 </td>
1046 </tr>
1047 <tr>
1048 <td>
1049 And
1050 </td>
1051 <td>
1052 <code>
1053 {$and: [{k1: v1}, {k2: v2}]}
1054 </code>
1055 </td>
1056 <td>
1057 <code>
1058 where k1 = v1 and k2 = v2
1059 </code>
1060 </td>
1061 </tr>
1062 <tr>
1063 <td>
1064 Or
1065 </td>
1066 <td>
1067 <code>
1068 {$or: [{k1: v1}, {k2: v2}]}
1069 </code>
1070 </td>
1071 <td>
1072 <code>
1073 where k1 = v1 or k2 = v2
1074 </code>
1075 </td>
1076 </tr>
1077 </tbody>
1078</table>
1079<p>The operations all do what you would expect them to do, and their names are really intuitive. Aggregating operations with <code>$and</code> or <code>$or</code> can be done anywhere in the query, nested any level deep.</p>
1080<h2 id="update"><a class="anchor" href="#update">¶</a>Update</h2>
1081<p>Updating a document can be done by using <a href="https://docs.mongodb.com/manual/reference/method/db.collection.save/index.html"><code>save</code></a> on an already-existing document (that is, the document we want to save has <code>_id</code> and it’s in the collection already). If the document is not in the collection yet, this method will create it.</p>
1082<pre><code>&gt; db.greetings.save({_id: ObjectId(&quot;5e74829a0659f802b15f18dd&quot;), message: &quot;¡Bienvenido, humano!&quot;, &quot;lang&quot; : &quot;es&quot;})
1083WriteResult({ &quot;nMatched&quot; : 1, &quot;nUpserted&quot; : 0, &quot;nModified&quot; : 1 })
1084
1085&gt; db.greetings.find({lang: &quot;es&quot;})
1086{ &quot;_id&quot; : ObjectId(&quot;5e74829a0659f802b15f18dd&quot;), &quot;message&quot; : &quot;¡Bienvenido, humano!&quot;, &quot;lang&quot; : &quot;es&quot; }
1087</code></pre>
1088<p>Alternatively, the <a href="https://docs.mongodb.com/manual/reference/method/db.collection.update/index.html"><code>update</code></a> method takes a query and new value.</p>
1089<pre><code>&gt; db.greetings.update({lang: &quot;en&quot;}, {$set: {message: &quot;Welcome, human!&quot;}})
1090WriteResult({ &quot;nMatched&quot; : 1, &quot;nUpserted&quot; : 0, &quot;nModified&quot; : 1 })
1091
1092&gt; db.greetings.find({lang: &quot;en&quot;})
1093{ &quot;_id&quot; : ObjectId(&quot;5e7487b90659f802b15f18de&quot;), &quot;message&quot; : &quot;Welcome, human!&quot;, &quot;lang&quot; : &quot;en&quot; }
1094</code></pre>
1095<h2 id="indexing"><a class="anchor" href="#indexing">¶</a>Indexing</h2>
1096<p>Creating an index is done with <a href="https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/index.html"><code>createIndex</code></a>:</p>
1097<pre><code>&gt; db.greetings.createIndex({lang: +1})
1098{
1099 &quot;createdCollectionAutomatically&quot; : false,
1100 &quot;numIndexesBefore&quot; : 1,
1101 &quot;numIndexesAfter&quot; : 2,
1102 &quot;ok&quot; : 1
1103}
1104</code></pre>
1105<p>Here, we create an ascending index on the lang key. Descending order is done with <code>-1</code>. Now a query for <code>lang</code> in our three documents will be fast… well maybe iteration over three documents was faster than an index.</p>
1106<h2 id="delete"><a class="anchor" href="#delete">¶</a>Delete</h2>
1107<h3 id="delete_a_document"><a class="anchor" href="#delete_a_document">¶</a>Delete a document</h3>
1108<p>I have to confess, I can’t talk French. I learnt it long ago and it’s long forgotten, so let’s remove the translation I copied online from our greetings with <a href="https://docs.mongodb.com/manual/reference/method/db.collection.remove/index.html"><code>remove</code></a>.</p>
1109<pre><code>&gt; db.greetings.remove({lang: &quot;fr&quot;})
1110WriteResult({ &quot;nRemoved&quot; : 1 })
1111</code></pre>
1112<h3 id="delete_a_collection"><a class="anchor" href="#delete_a_collection">¶</a>Delete a collection</h3>
1113<p>We never really used the <code>goodbyes</code> collection. Can we get rid of that?</p>
1114<pre><code>&gt; db.goodbyes.drop()
1115true
1116</code></pre>
1117<p>Yes, it is <code>true</code> that we can <a href="https://docs.mongodb.com/manual/reference/method/db.collection.drop/index.html"><code>drop</code></a> it.</p>
1118<h3 id="delete_a_database"><a class="anchor" href="#delete_a_database">¶</a>Delete a database</h3>
1119<p>Now, I will be honest, I don’t really like our <code>greetings</code> database either. It stinks. Let’s get rid of it as well:</p>
1120<pre><code>&gt; db.dropDatabase()
1121{ &quot;dropped&quot; : &quot;helloworld&quot;, &quot;ok&quot; : 1 }
1122</code></pre>
1123<p>Yeah, take that! The <a href="https://docs.mongodb.com/manual/reference/method/db.dropDatabase/"><code>dropDatabase</code></a> can be used to drop databases.</p>
1124<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
1125<p>The examples in this post are all fictional, and the methods that could be used where taken from Classmate’s post, and of course <a href="https://docs.mongodb.com/manual/reference/method/">Mongo’s documentation</a>.</p>
1126</main>
1127</body>
1128</html>
1129 </content></entry><entry><title>Introduction to Hadoop and its MapReduce</title><id>dist/introduction-to-hadoop-and-its-mapreduce/index.html</id><updated>2020-04-02T22:00:00+00:00</updated><published>2020-03-31T22:00:00+00:00</published><summary>Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.</summary><content type="html" src="dist/introduction-to-hadoop-and-its-mapreduce/index.html"><!DOCTYPE html>
1130<html>
1131<head>
1132<meta charset="utf-8" />
1133<meta name="viewport" content="width=device-width, initial-scale=1" />
1134<title>Introduction to Hadoop and its MapReduce</title>
1135<link rel="stylesheet" href="../css/style.css">
1136</head>
1137<body>
1138<main>
1139<p>Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.</p>
1140<div class="date-created-modified">Created 2020-04-01<br>
1141Modified 2020-04-03</div>
1142<p>Hadoop also offers a distributed filesystem (HDFS) enabling for fast transfer among nodes, and a way to program with MapReduce.</p>
1143<p>It aims to strive for the 4 V’s: Volume, Variety, Veracity and Velocity. For veracity, it is a secure environment that can be trusted.</p>
1144<h2 class="title" id="milestones"><a class="anchor" href="#milestones">¶</a>Milestones</h2>
1145<p>The creators of Hadoop are Doug Cutting and Mike Cafarella, who just wanted to design a search engine, Nutch, and quickly found the problems of dealing with large amounts of data. They found their solution with the papers Google published.</p>
1146<p>The name comes from the plush of Cutting’s child, a yellow elephant.</p>
1147<ul>
1148<li>In July 2005, Nutch used GFS to perform MapReduce operations.</li>
1149<li>In February 2006, Nutch started a Lucene subproject which led to Hadoop.</li>
1150<li>In April 2007, Yahoo used Hadoop in a 1 000-node cluster.</li>
1151<li>In January 2008, Apache took over and made Hadoop a top-level project.</li>
1152<li>In July 2008, Apache tested a 4000-node cluster. The performance was the fastest compared to other technologies that year.</li>
1153<li>In May 2009, Hadoop sorted a petabyte of data in 17 hours.</li>
1154<li>In December 2011, Hadoop reached 1.0.</li>
1155<li>In May 2012, Hadoop 2.0 was released with the addition of YARN (Yet Another Resource Navigator) on top of HDFS, splitting MapReduce and other processes into separate components, greatly improving the fault tolerance.</li>
1156</ul>
1157<p>From here onwards, many other alternatives have born, like Spark, Hive &amp; Drill, Kafka, HBase, built around the Hadoop ecosystem.</p>
1158<p>As of 2017, Amazon has clusters between 1 and 100 nodes, Yahoo has over 100 000 CPUs running Hadoop, AOL has clusters with 50 machines, and Facebook has a 320-machine (2 560 cores) and 1.3PB of raw storage.</p>
1159<h2 id="why_not_use_rdbms_"><a class="anchor" href="#why_not_use_rdbms_">¶</a>Why not use RDBMS?</h2>
1160<p>Relational database management systems simply cannot scale horizontally, and vertical scaling will require very expensive servers. Similar to RDBMS, Hadoop has a notion of jobs (analogous to transactions), but without ACID or concurrency control. Hadoop supports any form of data (unstructured or semi-structured) in read-only mode, and failures are common but there’s a simple yet efficient fault tolerance.</p>
1161<p>So what problems does Hadoop solve? It solves the way we should think about problems, and distributing them, which is key to do anything related with BigData nowadays. We start working with clusters of nodes, and coordinating the jobs between them. Hadoop’s API makes this really easy.</p>
1162<p>Hadoop also takes very seriously the loss of data with replication, and if a node falls, they are moved to a different node.</p>
1163<h2 id="major_components"><a class="anchor" href="#major_components">¶</a>Major components</h2>
1164<p>The previously-mentioned HDFS runs on commodity machine, which are cost-friendly. It is very fault-tolerant and efficient enough to process huge amounts of data, because it splits large files into smaller chunks (or blocks) that can be more easily handled. Multiple nodes can work on multiple chunks at the same time.</p>
1165<p>NameNode stores the metadata of the various datablocks (map of blocks) along with their location. It is the brain and the master in Hadoop’s master-slave architecture, also known as the namespace, and makes use of the DataNode.</p>
1166<p>A secondary NameNode is a replica that can be used if the first NameNode dies, so that Hadoop doesn’t shutdown and can restart.</p>
1167<p>DataNode stores the blocks of data, and are the slaves in the architecture. This data is split into one or more files. Their only job is to manage this access to the data. They are often distributed among racks to avoid data lose.</p>
1168<p>JobTracker creates and schedules jobs from the clients for either map or reduce operations.</p>
1169<p>TaskTracker runs MapReduce tasks assigned to the current data node.</p>
1170<p>When clients need data, they first interact with the NameNode and replies with the location of the data in the correct DataNode. Client proceeds with interaction with the DataNode.</p>
1171<h2 id="mapreduce"><a class="anchor" href="#mapreduce">¶</a>MapReduce</h2>
1172<p>MapReduce, as the name implies, is split into two steps: the map and the reduce. The map stage is the «divide and conquer» strategy, while the reduce part is about combining and reducing the results.</p>
1173<p>The mapper has to process the input data (normally a file or directory), commonly line-by-line, and produce one or more outputs. The reducer uses all the results from the mapper as its input to produce a new output file itself.</p>
1174<p><img src="bitmap.png" alt="" /></p>
1175<p>When reading the data, some may be junk that we can choose to ignore. If it is valid data, however, we label it with a particular type that can be useful for the upcoming process. Hadoop is responsible for splitting the data accross the many nodes available to execute this process in parallel.</p>
1176<p>There is another part to MapReduce, known as the Shuffle-and-Sort. In this part, types or categories from one node get moved to a different node. This happens with all nodes, so that every node can work on a complete category. These categories are known as «keys», and allows Hadoop to scale linearly.</p>
1177<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
1178<ul>
1179<li><a href="https://youtu.be/oT7kczq5A-0">YouTube – Hadoop Tutorial For Beginners | What Is Hadoop? | Hadoop Tutorial | Hadoop Training | Simplilearn</a></li>
1180<li><a href="https://youtu.be/bcjSe0xCHbE">YouTube – Learn MapReduce with Playing Cards</a></li>
1181<li><a href="https://youtu.be/j8ehT1_G5AY?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">YouTube – Video Post #2: Hadoop para torpes (I)-¿Qué es y para qué sirve?</a></li>
1182<li><a href="https://youtu.be/NQ8mjVPCDvk?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #3: Hadoop para torpes (II)-¿Cómo funciona? HDFS y MapReduce</a></li>
1183<li><a href="https://hadoop.apache.org/old/releases.html">Apache Hadoop Releases</a></li>
1184<li><a href="https://youtu.be/20qWx2KYqYg?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #4: Hadoop para torpes (III y fin)- Ecosistema y distribuciones</a></li>
1185<li><a href="http://www.hadoopbook.com/">Chapter 2 – Hadoop: The Definitive Guide, Fourth Edition</a> (<a href="http://grut-computing.com/HadoopBook.pdf">pdf,</a><a href="http://www.hadoopbook.com/code.html">code</a>)</li>
1186</ul>
1187</main>
1188</body>
1189</html>
1190 </content></entry><entry><title>Google’s BigTable</title><id>dist/googles-bigtable/index.html</id><updated>2020-04-02T22:00:00+00:00</updated><published>2020-03-31T22:00:00+00:00</published><summary>Let’s talk about BigTable, and why it is what it is. But before we get into that, let’s see some important aspects anybody should consider when dealing with a lot of data (something BigTable does!).</summary><content type="html" src="dist/googles-bigtable/index.html"><!DOCTYPE html>
1191<html>
1192<head>
1193<meta charset="utf-8" />
1194<meta name="viewport" content="width=device-width, initial-scale=1" />
1195<title>Google’s BigTable</title>
1196<link rel="stylesheet" href="../css/style.css">
1197</head>
1198<body>
1199<main>
1200<p>Let’s talk about BigTable, and why it is what it is. But before we get into that, let’s see some important aspects anybody should consider when dealing with a lot of data (something BigTable does!).</p>
1201<div class="date-created-modified">Created 2020-04-01<br>
1202Modified 2020-04-03</div>
1203<h2 class="title" id="the_basics"><a class="anchor" href="#the_basics">¶</a>The basics</h2>
1204<p>Converting a text document into a different format is often a great way to greatly speed up scanning of it in the future. It allows for efficient searches.</p>
1205<p>In addition, you generally want to store everything in a single, giant file. This will save a lot of time opening and closing files, because everything is in the same file! One proposal to make this happen is <a href="https://trec.nist.gov/file_help.html">Web TREC</a> (see also the <a href="https://en.wikipedia.org/wiki/Text_Retrieval_Conference">Wikipedia page on TREC</a>), which is basically HTML but every document is properly delimited from one another.</p>
1206<p>Because we will have a lot of data, it’s often a good idea to compress it. Most text consists of the same words, over and over again. Classic compression techniques such as <code>DEFLATE</code> or <code>LZW</code> do an excellent job here.</p>
1207<h2 id="so_what_s_bigtable_"><a class="anchor" href="#so_what_s_bigtable_">¶</a>So what’s BigTable?</h2>
1208<p>Okay, enough of an introduction to the basics on storing data. BigTable is what Google uses to store documents, and it’s a customized approach to save, search and update web pages.</p>
1209<p>BigTable is is a distributed storage system for managing structured data, able to scale to petabytes of data across thousands of commodity servers, with wide applicability, scalability, high performance, and high availability.</p>
1210<p>In a way, it’s kind of like databases and shares many implementation strategies with them, like parallel databases, or main-memory databases, but of course, with a different schema.</p>
1211<p>It consists of a big table known as the «Root tablet», with pointers to many other «tablets» (or metadata in between). These are stored in a replicated filesystem accessible by all BigTable servers. Any change to a tablet gets logged (said log also gets stored in a replicated filesystem).</p>
1212<p>If any of the tablets servers gets locked, a different one can take its place, read the log and deal with the problem.</p>
1213<p>There’s no query language, transactions occur at row-level only. Every read or write in a row is atomic. Each row stores a single web page, and by combining the row and column keys along with a timestamp, it is possible to retrieve a single cell in the row. More formally, it’s a map that looks like this:</p>
1214<pre><code>fetch(row: string, column: string, time: int64) -&gt; string
1215</code></pre>
1216<p>A row may have as many columns as it needs, and these column groups are the same for everyone (but the columns themselves may vary), which is importan to reduce disk read time.</p>
1217<p>Rows are split in different tablets based on the row keys, which simplifies determining an appropriated server for them. The keys can be up to 64KB big, although most commonly they range 10-100 bytes.</p>
1218<h2 id="conclusions"><a class="anchor" href="#conclusions">¶</a>Conclusions</h2>
1219<p>BigTable is Google’s way to deal with large amounts of data on many of their services, and the ideas behind it are not too complex to understand.</p>
1220</main>
1221</body>
1222</html>
1223 </content></entry><entry><title>A practical example with Hadoop</title><id>dist/a-practical-example-with-hadoop/index.html</id><updated>2020-04-02T22:00:00+00:00</updated><published>2020-03-31T22:00:00+00:00</published><summary>In our </summary><content type="html" src="dist/a-practical-example-with-hadoop/index.html"><!DOCTYPE html>
1224<html>
1225<head>
1226<meta charset="utf-8" />
1227<meta name="viewport" content="width=device-width, initial-scale=1" />
1228<title>A practical example with Hadoop</title>
1229<link rel="stylesheet" href="../css/style.css">
1230</head>
1231<body>
1232<main>
1233<p>In our <a href="/blog/ribw/introduction-to-hadoop-and-its-mapreduce/">previous Hadoop post</a>, we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.</p>
1234<div class="date-created-modified">Created 2020-04-01<br>
1235Modified 2020-04-03</div>
1236<p>This post will showcase my own implementation to implement a word counter for any plain text document that you want to analyze.</p>
1237<h2 class="title" id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
1238<p>Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to <a href="http://hadoop.apache.org/releases.html">Apache Hadoop’s releases</a> and download the <a href="https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz">latest binary version</a> at the time of writing (3.2.1).</p>
1239<p>We will be using the <a href="https://linuxmint.com/">Linux Mint</a> distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as <a href="https://ubuntu.com/">Ubuntu</a>.</p>
1240<p>Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as <a href="https://openjdk.java.net/">OpenJDK</a>.</p>
1241<p>Here are all the three steps in the command line:</p>
1242<pre><code>wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
1243tar xf hadoop-3.2.1.tar.gz
1244hadoop-3.2.1/bin/hadoop version
1245</code></pre>
1246<h2 id="processing_data"><a class="anchor" href="#processing_data">¶</a>Processing data</h2>
1247<p>To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.</p>
1248<p>We will use Java, because it’s a dependency that we already have anyway, so might as well.</p>
1249<p>Our map function needs to split each of the lines we receive as input into words, and we will also convert them to lowercase, thus preparing the data for later use (counting words). There won’t be bad records, so we don’t have to worry about that.</p>
1250<p>Copy or reproduce the following code in a file called <code>WordCountMapper.java</code>, using any text editor of your choice:</p>
1251<pre><code>import java.io.IOException;
1252
1253import org.apache.hadoop.io.IntWritable;
1254import org.apache.hadoop.io.LongWritable;
1255import org.apache.hadoop.io.Text;
1256import org.apache.hadoop.mapreduce.Mapper;
1257
1258public class WordCountMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
1259 @Override
1260 public void map(LongWritable key, Text value, Context context)
1261 throws IOException, InterruptedException {
1262 for (String word : value.toString().split(&quot;\\W&quot;)) {
1263 context.write(new Text(word.toLowerCase()), new IntWritable(1));
1264 }
1265 }
1266}
1267</code></pre>
1268<p>Now, let’s create the <code>WordCountReducer.java</code> file. Its job is to reduce the data from multiple values into just one. We do that by summing all the values (our word count so far):</p>
1269<pre><code>import java.io.IOException;
1270import java.util.Iterator;
1271
1272import org.apache.hadoop.io.IntWritable;
1273import org.apache.hadoop.io.Text;
1274import org.apache.hadoop.mapreduce.Reducer;
1275
1276public class WordCountReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
1277 @Override
1278 public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context)
1279 throws IOException, InterruptedException {
1280 int count = 0;
1281 for (IntWritable value : values) {
1282 count += value.get();
1283 }
1284 context.write(key, new IntWritable(count));
1285 }
1286}
1287</code></pre>
1288<p>Let’s just take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.</p>
1289<p>Last, let’s write the <code>main</code> method, or else we won’t be able to run it. In our new file <code>WordCount.java</code>:</p>
1290<pre><code>import org.apache.hadoop.fs.Path;
1291import org.apache.hadoop.io.IntWritable;
1292import org.apache.hadoop.io.Text;
1293import org.apache.hadoop.mapreduce.Job;
1294import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
1295import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
1296
1297public class WordCount {
1298 public static void main(String[] args) throws Exception {
1299 if (args.length != 2) {
1300 System.err.println(&quot;usage: java WordCount &lt;input path&gt; &lt;output path&gt;&quot;);
1301 System.exit(-1);
1302 }
1303
1304 Job job = Job.getInstance();
1305
1306 job.setJobName(&quot;Word count&quot;);
1307 job.setJarByClass(WordCount.class);
1308 job.setMapperClass(WordCountMapper.class);
1309 job.setReducerClass(WordCountReducer.class);
1310 job.setOutputKeyClass(Text.class);
1311 job.setOutputValueClass(IntWritable.class);
1312
1313 FileInputFormat.addInputPath(job, new Path(args[0]));
1314 FileOutputFormat.setOutputPath(job, new Path(args[1]));
1315
1316 boolean result = job.waitForCompletion(true);
1317
1318 System.exit(result ? 0 : 1);
1319 }
1320}
1321</code></pre>
1322<p>And compile by including the required <code>.jar</code> dependencies in Java’s classpath with the <code>-cp</code> switch:</p>
1323<pre><code>javac -cp &quot;hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*&quot; *.java
1324</code></pre>
1325<p>At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful). Let’s run it on the same <code>WordCount.java</code> source file we wrote:</p>
1326<pre><code>java -cp &quot;.:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*&quot; WordCount WordCount.java results
1327</code></pre>
1328<p>Hooray! We should have a new <code>results/</code> folder along with the following files:</p>
1329<pre><code>$ ls results
1330part-r-00000 _SUCCESS
1331$ cat results/part-r-00000
1332 154
13330 2
13341 3
13352 1
1336addinputpath 1
1337apache 6
1338args 4
1339boolean 1
1340class 6
1341count 1
1342err 1
1343exception 1
1344-snip- (output cut for clarity)
1345</code></pre>
1346<p>It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.</p>
1347</main>
1348</body>
1349</html>
1350 </content></entry><entry><title>How does Google’s Search Engine work?</title><id>dist/how-does-googles-search-engine-work/index.html</id><updated>2020-03-27T23:00:00+00:00</updated><published>2020-03-17T23:00:00+00:00</published><summary>The original implementation was written in C/++ for Linux/Solaris.</summary><content type="html" src="dist/how-does-googles-search-engine-work/index.html"><!DOCTYPE html>
1351<html>
1352<head>
1353<meta charset="utf-8" />
1354<meta name="viewport" content="width=device-width, initial-scale=1" />
1355<title>How does Google’s Search Engine work?</title>
1356<link rel="stylesheet" href="../css/style.css">
1357</head>
1358<body>
1359<main>
1360<p>The original implementation was written in C/++ for Linux/Solaris.</p>
1361<div class="date-created-modified">Created 2020-03-18<br>
1362Modified 2020-03-28</div>
1363<p>There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.</p>
1364<p><img src="image-1024x649.png" alt="" /></p>
1365<p>But before we talk about the different components, let’s take a look at how they store all of this information.</p>
1366<h2 class="title" id="data_structures"><a class="anchor" href="#data_structures">¶</a>Data structures</h2>
1367<p>A «BigFile» is a virtual file addressable by 64 bits.</p>
1368<p>There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.</p>
1369<table class="">
1370 <tbody>
1371 <tr>
1372 <td>
1373 sync
1374 </td>
1375 <td>
1376 length
1377 </td>
1378 <td>
1379 compressed packet
1380 </td>
1381 </tr>
1382 </tbody>
1383</table>
1384<p>The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.</p>
1385<table class="">
1386 <tbody>
1387 <tr>
1388 <td>
1389 doc id
1390 </td>
1391 <td>
1392 ecode
1393 </td>
1394 <td>
1395 url len
1396 </td>
1397 <td>
1398 page len
1399 </td>
1400 <td>
1401 url
1402 </td>
1403 <td>
1404 page
1405 </td>
1406 </tr>
1407 </tbody>
1408</table>
1409<p>A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.</p>
1410<table class="">
1411 <tbody>
1412 <tr>
1413 <td>
1414 word id
1415 </td>
1416 <td>
1417 n docs
1418 </td>
1419 </tr>
1420 <tr>
1421 <td>
1422 word id
1423 </td>
1424 <td>
1425 n docs
1426 </td>
1427 </tr>
1428 </tbody>
1429</table>
1430<p>The Hit Lists store occurences of a word in a document.</p>
1431<table class="">
1432 <tbody>
1433 <tr>
1434 <td>
1435 <strong>
1436 plain
1437 </strong>
1438 </td>
1439 <td>
1440 cap: 1
1441 </td>
1442 <td>
1443 imp: 3
1444 </td>
1445 <td>
1446 pos: 12
1447 </td>
1448 </tr>
1449 <tr>
1450 <td>
1451 <strong>
1452 fancy
1453 </strong>
1454 </td>
1455 <td>
1456 cap: 1
1457 </td>
1458 <td>
1459 imp: 7
1460 </td>
1461 <td>
1462 type: 4
1463 </td>
1464 <td>
1465 pos: 8
1466 </td>
1467 </tr>
1468 <tr>
1469 <td>
1470 <strong>
1471 anchor
1472 </strong>
1473 </td>
1474 <td>
1475 cap: 1
1476 </td>
1477 <td>
1478 imp: 7
1479 </td>
1480 <td>
1481 type: 4
1482 </td>
1483 <td>
1484 hash: 4
1485 </td>
1486 <td>
1487 pos: 8
1488 </td>
1489 </tr>
1490 </tbody>
1491</table>
1492<p>The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).</p>
1493<table class="">
1494 <tbody>
1495 <tr>
1496 <td rowspan="3">
1497 doc id
1498 </td>
1499 <td>
1500 word id: 24
1501 </td>
1502 <td>
1503 n hits: 8
1504 </td>
1505 <td>
1506 hit hit hit hit hit hit hit hit
1507 </td>
1508 </tr>
1509 <tr>
1510 <td>
1511 word id: 24
1512 </td>
1513 <td>
1514 n hits: 8
1515 </td>
1516 <td>
1517 hit hit hit hit hit hit hit hit
1518 </td>
1519 </tr>
1520 <tr>
1521 <td>
1522 null word id
1523 </td>
1524 </tr>
1525 </tbody>
1526</table>
1527<p>The Inverted Index can be sorted by either document identifier or by ranking of word occurence.</p>
1528<table class="">
1529 <tbody>
1530 <tr>
1531 <td>
1532 doc id: 23
1533 </td>
1534 <td>
1535 n hits: 5
1536 </td>
1537 <td>
1538 hit hit hit hit hit
1539 </td>
1540 </tr>
1541 <tr>
1542 <td>
1543 doc id: 23
1544 </td>
1545 <td>
1546 n hits: 3
1547 </td>
1548 <td>
1549 hit hit hit
1550 </td>
1551 </tr>
1552 <tr>
1553 <td>
1554 doc id: 23
1555 </td>
1556 <td>
1557 n hits: 4
1558 </td>
1559 <td>
1560 hit hit hit hit
1561 </td>
1562 </tr>
1563 <tr>
1564 <td>
1565 doc id: 23
1566 </td>
1567 <td>
1568 n hits: 2
1569 </td>
1570 <td>
1571 hit hit
1572 </td>
1573 </tr>
1574 </tbody>
1575</table>
1576<p>Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.</p>
1577<h2 id="crawling"><a class="anchor" href="#crawling">¶</a>Crawling</h2>
1578<p>The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.</p>
1579<p>Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.</p>
1580<p>The crawled pages need parsing to deal with typos or formatting issues.</p>
1581<h2 id="indexing"><a class="anchor" href="#indexing">¶</a>Indexing</h2>
1582<p>Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.</p>
1583<p>Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.</p>
1584<h2 id="searching"><a class="anchor" href="#searching">¶</a>Searching</h2>
1585<p>We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:</p>
1586<p><img src="8e1e61b119e107fcb4bdd7e78f649985.png" alt="" />
1587<em>PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))</em></p>
1588<p>Where:</p>
1589<ul>
1590<li><code>A</code> is a given page</li>
1591<li><code>T&lt;sub&gt;n&lt;/sub&gt;</code> are pages that point to A</li>
1592<li><code>d</code> is the damping factor in the range <code>[0, 1]</code> (often 0.85)</li>
1593<li><code>C(A)</code> is the number of links going out of page <code>A</code></li>
1594<li><code>PR(A)</code> is the page rank of page <code>A</code>
1595This formula indicates the probability that a random surfer visits a certain page, and <code>1 - d</code> is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.</li>
1596</ul>
1597<p>The anchor text in the links also help provide a better description and helps indexing for even better results.</p>
1598<p>While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.</p>
1599<p>Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.</p>
1600<h2 id="conclusion"><a class="anchor" href="#conclusion">¶</a>Conclusion</h2>
1601<p>Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.</p>
1602<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
1603<ul>
1604<li><a href="https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf">The anatomy of a large-scale hypertextual Web search engine</a></li>
1605<li><a href="https://www.site.uottawa.ca/%7Ediana/csi4107/Google_SearchEngine.pdf">The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)</a></li>
1606</ul>
1607</main>
1608</body>
1609</html>
1610 </content></entry><entry><title>Privado: PC-Crawler evaluation 2</title><id>dist/pc-crawler-evaluation-2/index.html</id><updated>2020-03-27T23:00:00+00:00</updated><published>2020-03-15T23:00:00+00:00</published><summary>As the student </summary><content type="html" src="dist/pc-crawler-evaluation-2/index.html"><!DOCTYPE html>
1611<html>
1612<head>
1613<meta charset="utf-8" />
1614<meta name="viewport" content="width=device-width, initial-scale=1" />
1615<title>Privado: PC-Crawler evaluation 2</title>
1616<link rel="stylesheet" href="../css/style.css">
1617</head>
1618<body>
1619<main>
1620<p>As the student <code>a(i)</code> where <code>i = 9</code>, I have been assigned to evaluate students <code>a(i - 1)</code> and <code>a(i - 2)</code>, these being:</p>
1621<div class="date-created-modified">Created 2020-03-16<br>
1622Modified 2020-03-28</div>
1623<ul>
1624<li>a08: Classmate (username)</li>
1625<li>a07: Classmate (username)</li>
1626</ul>
1627<p>The evaluation is done according to the criteria described in Segunda entrega del PC-Crawler.</p>
1628<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
1629<p><strong>Grading: A.</strong></p>
1630<p>This is the evaluation of Crawler – Thesauro.</p>
1631<p>It’s a well-written post, properly using WordPress code blocks, and they explain the process of improving the code and what it does. Because there are no noticeable issues with the post, they get the highest grading.</p>
1632<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
1633<p><strong>Grading: B.</strong></p>
1634<p>This is the evaluation of Actividad 2-Crawler.</p>
1635<p>They start with an introduction on what they will do.</p>
1636<p>Next, they show the code they have written, also describing what it does, although they don’t explain <em>why</em> they chose the data structures they used.</p>
1637<p>The style of the code leaves a lot to be desired, and they should have embedded the code in the post instead of taking screenshots. People that rely on screen readers will not be able to see the code.</p>
1638<p>I have graded them B and not A for this last reason.</p>
1639</main>
1640</body>
1641</html>
1642 </content></entry><entry><title>What is ElasticSearch and why should you care?</title><id>dist/what-is-elasticsearch-and-why-should-you-care/index.html</id><updated>2020-03-26T23:00:00+00:00</updated><published>2020-03-17T23:00:00+00:00</published><summary>ElasticSearch is a giant search index with powerful analytics capabilities. It’s like a database and search engine on steroids, really easy and fast to get up and running. One can think of it as your own Google, a search engine with analytics.</summary><content type="html" src="dist/what-is-elasticsearch-and-why-should-you-care/index.html"><!DOCTYPE html>
1643<html>
1644<head>
1645<meta charset="utf-8" />
1646<meta name="viewport" content="width=device-width, initial-scale=1" />
1647<title>What is ElasticSearch and why should you care?</title>
1648<link rel="stylesheet" href="../css/style.css">
1649</head>
1650<body>
1651<main>
1652<p>ElasticSearch is a giant search index with powerful analytics capabilities. It’s like a database and search engine on steroids, really easy and fast to get up and running. One can think of it as your own Google, a search engine with analytics.</p>
1653<div class="date-created-modified">Created 2020-03-18<br>
1654Modified 2020-03-27</div>
1655<p>ElasticSearch is rich, stable, performs well, is well maintained, and able to scale to petabytes of any kind of data, whether it’s structured, semi-structured or not at all. It’s cost-effective and can be used to make business decisions.</p>
1656<p>Or, described in 10 seconds:</p>
1657<blockquote>
1658<p>Schema-free, REST &amp; JSON based distributed document store
1659Open source: Apache License 2.0
1660Zero configuration</p>
1661</blockquote>
1662<p>-- Alex Reelsen</p>
1663<h2 class="title" id="basic_capabilities"><a class="anchor" href="#basic_capabilities">¶</a>Basic capabilities</h2>
1664<p>ElasticSearch lets you ask questions about your data, not just make queries. You may think SQL can do this too, but what’s important is making a pipeline of facets, and feed the results from query to query.</p>
1665<p>Instead of changing your data, you can be flexible with your questions with no need to re-index it every time the questions change.</p>
1666<p>ElasticSearch is not just to search for full-text data, either. It can search for structured data and return more than just the results. It also yields additional data, such as ranking, highlights, and allows for pagination.</p>
1667<p>It doesn’t take a lot of configuration to get running, either, which can be a good boost on productivity.</p>
1668<h2 id="how_does_it_work_"><a class="anchor" href="#how_does_it_work_">¶</a>How does it work?</h2>
1669<p>ElasticSearch depends on Java, and can work in a distributed cluster if you execute multiple instances. Data will be replicated and sharded as needed. The current version at the time of writing is 7.6.1, and it’s being developed fast!</p>
1670<p>It also has support for plugins, with an ever-growing ecosystem and integration on many programming languages. Tools around it are being built around it, too, like Kibana which helps you visualize your data.</p>
1671<p>The way you use it is through a JSON API, served over HTTP/S.</p>
1672<h2 id="how_can_i_use_it_"><a class="anchor" href="#how_can_i_use_it_">¶</a>How can I use it?</h2>
1673<p><a href="https://www.elastic.co/downloads/">You can try ElasticSearch out for free on Elastic Cloud</a>, however, it can also be <a href="https://www.elastic.co/downloads/elasticsearch">downloaded and ran offline</a>, which is what we’ll do. Download the file corresponding to your operating system, unzip it, and execute the binary. Running it is as simple as that!</p>
1674<p>Now you can make queries to it over HTTP, with for example <code>curl</code>:</p>
1675<pre><code>curl -X PUT localhost:9200/orders/order/1 -d '
1676{
1677 &quot;created_at&quot;: &quot;2013/09/05 15:45:10&quot;,
1678 &quot;items&quot;: [
1679 {
1680 name: &quot;HD Monitor&quot;
1681 }
1682 ],
1683 &quot;total&quot;: 249.95
1684}'
1685</code></pre>
1686<p>This will create a new order with some information, such as when it was created, what items it contains, and the total cost of the order.</p>
1687<p>You can then query or filter as needed, script it or even create statistics.</p>
1688<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
1689<ul>
1690<li><a href="https://youtu.be/sKnkQSec1U0">YouTube – What is Elasticsearch?</a></li>
1691<li><a href="https://youtu.be/yWNiRC_hUAw">YouTube – GOTO 2013 • Elasticsearch – Beyond Full-text Search • Alex Reelsen</a></li>
1692<li><a href="https://www.elastic.co/kibana">Kibana – Your window into the Elastic Stack</a></li>
1693<li><a href="https://www.elastic.co/guide/index.html">Elastic Stack and Product Documentation</a></li>
1694</ul>
1695</main>
1696</body>
1697</html>
1698 </content></entry><entry><title>Privado: NoSQL evaluation</title><id>dist/nosql-evaluation/index.html</id><updated>2020-03-26T23:00:00+00:00</updated><published>2020-03-15T23:00:00+00:00</published><summary>I have decided to evaluate Classmate‘s post and Classmate‘s post, because they review databases I have not seen or used before, and I think it would be interesting to see new ones.</summary><content type="html" src="dist/nosql-evaluation/index.html"><!DOCTYPE html>
1699<html>
1700<head>
1701<meta charset="utf-8" />
1702<meta name="viewport" content="width=device-width, initial-scale=1" />
1703<title>Privado: NoSQL evaluation</title>
1704<link rel="stylesheet" href="../css/style.css">
1705</head>
1706<body>
1707<main>
1708<p>I have decided to evaluate Classmate‘s post and Classmate‘s post, because they review databases I have not seen or used before, and I think it would be interesting to see new ones.</p>
1709<div class="date-created-modified">Created 2020-03-16<br>
1710Modified 2020-03-27</div>
1711<p>The evaluation is based on the requirements defined by Trabajos en grupo sobre Bases de Datos NoSQL:</p>
1712<blockquote>
1713<p><strong>1ª entrada:</strong> Descripción de la finalidad de la tecnología y cómo funciona o trabaja la BD NoSQL, sus características, la arista que ocupa en el Teorema CAP, de dónde se descarga, y cómo se instala.</p>
1714</blockquote>
1715<p>-- Teacher</p>
1716<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
1717<p><strong>Grading: A.</strong></p>
1718<p>The post I have evaluated is BB.DD. NoSQL: Voldemort 1ª Fase.</p>
1719<p>The post doesn’t start very well, because the first sentence has (emphasis mine):</p>
1720<blockquote>
1721<p>En él repasaremos en qué consiste <strong>MongoDB</strong>, sus características, y cómo se instala, entre otros.</p>
1722</blockquote>
1723<p>-- Classmate</p>
1724<p>…yet the post is about Voldemort!</p>
1725<p>The post does detail how it works, its architecture, corner in the CAP theorem, download and installation.</p>
1726<p>I have graded the post with A because I think it meets all the requirements, even if they slipped a bit in the beginning.</p>
1727<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
1728<p><strong>Grading: A.</strong></p>
1729<p>The post I have evaluted is Raven.</p>
1730<p>They have done a good job describing the project’s goals, corner in the CAP theorem, download, and provide an extensive installation section.</p>
1731<p>They don’t seem to use some of WordPress features, such as lists, but otherwise the post is good and deserves an A grading.</p>
1732</main>
1733</body>
1734</html>
1735 </content></entry><entry><title>Integrating Apache Tika into our Crawler</title><id>dist/integrating-apache-tika-into-our-crawler/index.html</id><updated>2020-03-24T23:00:00+00:00</updated><published>2020-03-17T23:00:00+00:00</published><summary>In our last crawler post</summary><content type="html" src="dist/integrating-apache-tika-into-our-crawler/index.html"><!DOCTYPE html>
1736<html>
1737<head>
1738<meta charset="utf-8" />
1739<meta name="viewport" content="width=device-width, initial-scale=1" />
1740<title>Integrating Apache Tika into our Crawler</title>
1741<link rel="stylesheet" href="../css/style.css">
1742</head>
1743<body>
1744<main>
1745<p><a href="/blog/ribw/upgrading-our-baby-crawler/">In our last crawler post</a>, we detailed how our crawler worked, and although it did a fine job, it’s time for some extra upgrading.</p>
1746<div class="date-created-modified">Created 2020-03-18<br>
1747Modified 2020-03-25</div>
1748<h2 class="title" id="what_kind_of_upgrades_"><a class="anchor" href="#what_kind_of_upgrades_">¶</a>What kind of upgrades?</h2>
1749<p>A small but useful one. We are adding support for file types that contain text but cannot be processed by normal text editors because they are structured and not just plain text (such as PDF files, Excel, Word documents…).</p>
1750<p>And for this task, we will make use of the help offered by <a href="https://tika.apache.org/">Tika</a>, our friendly Apache tool.</p>
1751<h2 id="what_is_tika_"><a class="anchor" href="#what_is_tika_">¶</a>What is Tika?</h2>
1752<p><a href="https://tika.apache.org/">Tika</a> is a set of libraries offered by <a href="https://en.wikipedia.org/wiki/The_Apache_Software_Foundation">The Apache Software Foundation</a> that we can include in our project in order to extract the text and metadata of files from a <a href="https://tika.apache.org/1.24/formats.html">long list of supported formats</a>.</p>
1753<h2 id="changes_in_the_code"><a class="anchor" href="#changes_in_the_code">¶</a>Changes in the code</h2>
1754<p>Not much has changed in the structure of the crawler, we simply have added a new method in <code>Utils</code> that uses the class <code>Tika</code> from the previously mentioned library so as to process and extract the text of more filetypes.</p>
1755<p>Then, we use this text just like we would for our standard text file (checking the thesaurus and adding it to the word map) and voilà! We have just added support for a big range of file types.</p>
1756<h2 id="incorporating_gradle"><a class="anchor" href="#incorporating_gradle">¶</a>Incorporating Gradle</h2>
1757<p>In order for the previous code to work, we need to make use of external libraries. To make this process easier and because the project is growing, we decided to use <a href="https://gradle.org/">Gradle</a>, a build system that can be used for projects in various programming languages, such as Java.</p>
1758<p>We followed their <a href="https://guides.gradle.org/building-java-applications/">guide to Building Java Applications</a>, and in a few steps added the required <code>.gradle</code> files. Now we can compile and run the code without having to worry about juggling with Java and external dependencies in a single command:</p>
1759<pre><code>./gradlew run
1760</code></pre>
1761<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
1762<p>And here you can download the final result:</p>
1763<p><em>download removed</em></p>
1764</main>
1765</body>
1766</html>
1767 </content></entry><entry><title>Cassandra: Basic Operations and Architecture</title><id>dist/nosql-databases-basic-operations-and-architecture/index.html</id><updated>2020-03-23T23:00:00+00:00</updated><published>2020-03-04T23:00:00+00:00</published><summary>This is the second post in the NoSQL Databases series, with a brief description on the basic operations (such as insertion, retrieval, indexing…), and complete execution along with the data model / architecture.</summary><content type="html" src="dist/nosql-databases-basic-operations-and-architecture/index.html"><!DOCTYPE html>
1768<html>
1769<head>
1770<meta charset="utf-8" />
1771<meta name="viewport" content="width=device-width, initial-scale=1" />
1772<title>Cassandra: Basic Operations and Architecture</title>
1773<link rel="stylesheet" href="../css/style.css">
1774</head>
1775<body>
1776<main>
1777<p>This is the second post in the NoSQL Databases series, with a brief description on the basic operations (such as insertion, retrieval, indexing…), and complete execution along with the data model / architecture.</p>
1778<div class="date-created-modified">Created 2020-03-05<br>
1779Modified 2020-03-24</div>
1780<p>Other posts in this series:</p>
1781<ul>
1782<li><a href="/blog/ribw/nosql-databases-an-introduction/">Cassandra: an Introduction</a></li>
1783<li><a href="/blog/ribw/nosql-databases-basic-operations-and-architecture/">Cassandra: Basic Operations and Architecture</a> (this post)</li>
1784</ul>
1785<hr />
1786<p>Cassandra uses it own Query Language for managing the databases, it is known as **CQL **(<strong>Cassandra Query Language</strong>). Cassandra stores data in <strong><em>tables</em></strong>, as in relational databases, and these tables are grouped in <strong><em>keyspaces</em></strong>. A keyspace defines a number of options that applies to all the tables it contains. The most used option is the **replication strategy. **It is recommended to have only one keyspace by application.</p>
1787<p>It is important to mention that <strong>tables and keyspaces</strong> are** case insensitive**, so myTable is equivalent to mytable, but it is possible to <strong>force case sensitivity</strong> using <strong>double-quotes</strong>.</p>
1788<p>To begin with the basic operations it is necessary to deploy Cassandra:</p>
1789<ol>
1790<li>Open a terminal in the root of the Apache Cassandra folder downloaded in the previous post.</li>
1791<li>Run the command:</li>
1792</ol>
1793<pre><code>$ bin/cassandra
1794</code></pre>
1795<p>Once Cassandra is deployed, it is time to open a** CQL Shell**, in <strong>other terminal</strong>, with the command: </p>
1796<pre><code>$ bin/cqlsh
1797</code></pre>
1798<p>It is possible to check if Cassandra is deployed if the SQL Shell prints the next message:</p>
1799<p><img src="uwqQgQte-cuYb_pePFOuY58re23kngrDKNgL1qz4yOfnBDZkqMIH3fFuCrye.png" alt="" />
1800<em>CQL Shell</em></p>
1801<h2 class="title" id="create_insert"><a class="anchor" href="#create_insert">¶</a>Create/Insert</h2>
1802<h3 id="ddl_data_definition_language_"><a class="anchor" href="#ddl_data_definition_language_">¶</a>DDL (Data Definition Language)</h3>
1803<h4 id="create_keyspace"><a class="anchor" href="#create_keyspace">¶</a>Create keyspace</h4>
1804<p>A keyspace is created using a **CREATE KEYSPACE **statement:</p>
1805<pre><code>$ **CREATE** KEYSPACE [ **IF** **NOT** **EXISTS** ] keyspace_name **WITH** options;
1806</code></pre>
1807<p>The supported “<strong>options</strong>” are:</p>
1808<ul>
1809<li>“<strong>replication</strong>”: this is **mandatory **and defines the <strong>replication strategy</strong> and the <strong>replication factor</strong> (the number of nodes that will have a copy of the data). Within this option there is a property called “<strong>class</strong>” in which the <strong>replication strategy</strong> is specified (“SimpleStrategy” or “NetworkTopologyStrategy”)</li>
1810<li>“<strong>durable_writes</strong>”: this is <strong>not mandatory</strong> and it is possible to use the <strong>commit logs for updates</strong>.
1811Attempting to create an already existing keyspace will return an error unless the **IF NOT EXISTS **directive is used. </li>
1812</ul>
1813<p>The example associated to this statement is create a keyspace with name “test_keyspace” with “SimpleStrategy” as “class” of replication and a “replication_factor” of 3.</p>
1814<pre><code>**CREATE** KEYSPACE test_keyspace
1815 **WITH** **replication** = {'class': 'SimpleStrategy',
1816 'replication_factor' : 3};
1817</code></pre>
1818<p>The **USE **statement allows to <strong>change</strong> the current <strong>keyspace</strong>. The syntax of this statement is very simple: </p>
1819<pre><code>**USE** keyspace_name;
1820</code></pre>
1821<p><img src="RDWIG2RwvEevUFQv6TGFtGzRm4_9ERpxPf0feriflaj3alvWw3FEIAr_ZdF1.png" alt="" />
1822<em>USE statement</em></p>
1823<p>It is also possible to get the metadata from a keyspace with the **DESCRIBE **statement.</p>
1824<pre><code>**DESCRIBE** KEYSPACES | KEYSPACE keyspace_name;
1825</code></pre>
1826<h4 id="create_table"><a class="anchor" href="#create_table">¶</a>Create table</h4>
1827<p>Creating a new table uses the **CREATE TABLE **statement:</p>
1828<pre><code>**CREATE** **TABLE** [ **IF** **NOT** **EXISTS** ] table_name
1829 '('
1830 column_definition
1831 ( ',' column_definition )*
1832 [ ',' **PRIMARY** **KEY** '(' primary_key ')' ]
1833 ')' [ **WITH** table_options ];
1834</code></pre>
1835<p>With “column_definition” as: column_name cql_type [ STATIC ] [ PRIMARY KEY]; “primary_key” as: partition_key [ ‘,’ clustering_columns ]; and “table_options” as: COMPACT STORAGE [ AND table_options ] or CLUSTERING ORDER BY ‘(‘ clustering_order ‘)’ [ AND table_options ] or “options”.</p>
1836<p>Attempting to create an already existing table will return an error unless the <strong>IF NOT EXISTS</strong> directive is used.</p>
1837<p>The <strong>CQL types</strong> are described in the References section.</p>
1838<p>For example, we are going to create a table called “species_table” in the keyspace “test_keyspace” in which we will have a “species” text (as PRIMARY KEY), a “common_name” text, a “population” varint, a “average_size” int and a “sex” text. Besides, we are going to add a comment to the table: “Some species records”;</p>
1839<pre><code>**CREATE** **TABLE** species_table (
1840 species text **PRIMARY** **KEY**,
1841 common_name text,
1842 population varint,
1843 average_size **int**,
1844 sex text,
1845) **WITH** **comment**='Some species records';
1846</code></pre>
1847<p>It is also possible to get the metadata from a table with the **DESCRIBE **statement.</p>
1848<pre><code>**DESCRIBE** **TABLES** | **TABLE** [keyspace_name.]table_name;
1849</code></pre>
1850<h3 id="dml_data_manipulation_language_"><a class="anchor" href="#dml_data_manipulation_language_">¶</a>DML (Data Manipulation Language)</h3>
1851<h4 id="insert_data"><a class="anchor" href="#insert_data">¶</a>Insert data</h4>
1852<p>Inserting data for a row is done using an **INSERT **statement:</p>
1853<pre><code>**INSERT** **INTO** table_name ( names_values | json_clause )
1854 [ **IF** **NOT** **EXISTS** ]
1855 [ **USING** update_parameter ( **AND** update_parameter )* ];
1856</code></pre>
1857<p>Where “names_values” is: names VALUES tuple_literal; “json_clause” is: JSON string [ DEFAULT ( NULL | UNSET ) ]; and “update_parameter” is usually: TTL.</p>
1858<p>For example we are going to use both VALUES and JSON clauses to insert data in the table “species_table”. In the VALUES clause it is necessary to supply the list of columns, not as in the JSON clause that is optional.</p>
1859<p>Note: TTL (Time To Live) and Timestamp are metrics for expiring data, so, when the time set is passed, the operation is expired.</p>
1860<p>In the VALUES clause we are going to insert a new specie called “White monkey”, with an average size of 3, its common name is “Copito de nieve”, population 0 and sex “male”.</p>
1861<pre><code>**INSERT** **INTO** species_table (species, common_name, population, average_size, sex)
1862 **VALUES** ('White monkey', 'Copito de nieve', 0, 3, 'male');
1863</code></pre>
1864<p>In the JSON clause we are going to insert a new specie called “Cloned sheep”, with an average size of 1, its common name is “Dolly the sheep”, population 0 and sex “female”.</p>
1865<pre><code>**INSERT** **INTO** species_table JSON '{&quot;species&quot;: &quot;Cloned Sheep&quot;,
1866 &quot;common_name&quot;: &quot;Dolly the Sheep&quot;,
1867 &quot;average_size&quot;:1,
1868 &quot;population&quot;:0,
1869 &quot;sex&quot;: &quot;female&quot;}';
1870</code></pre>
1871<p>Note: all updates for an **INSERT **are applied **atomically **and in <strong>isolation.</strong></p>
1872<h2 id="read"><a class="anchor" href="#read">¶</a>Read</h2>
1873<p>Querying data from data is done using a **SELECT **statement:</p>
1874<pre><code>**SELECT** [ JSON | **DISTINCT** ] ( select_clause | '*' )
1875 **FROM** table_name
1876 [ **WHERE** where_clause ]
1877 [ **GROUP** **BY** group_by_clause ]
1878 [ **ORDER** **BY** ordering_clause ]
1879 [ PER **PARTITION** **LIMIT** (**integer** | bind_marker) ]
1880 [ **LIMIT** (**integer** | bind_marker) ]
1881 [ ALLOW FILTERING ];
1882</code></pre>
1883<p>The **CQL SELECT **statement is very **similar **to the **SQL SELECT **statement due to the fact that both allows filtering (<strong>WHERE</strong>), grouping data (<strong>GROUP BY</strong>), ordering the data (<strong>ORDER BY</strong>) and limit the number of data (<strong>LIMIT</strong>). Besides, **CQL offers **a **limit per partition **and allow the **filtering **of <strong>data</strong>.</p>
1884<p>Note: as in SQL it it possible to set alias to the data with the statement <strong>AS.</strong></p>
1885<p>For example we are going to retrieve all the information about those values from the tables “species_table” which “sex” is “male”. Allow filtering is mandatory when there is a WHERE statement.</p>
1886<pre><code>**SELECT** * **FROM** species_table **WHERE** sex = 'male' ALLOW FILTERING;
1887</code></pre>
1888<p><img src="s6GrKIGATvOSD7oGRNScUU5RnLN_-3X1JXvnVi_wDT_hrmPMZdnCdBI8DpIJ.png" alt="" />
1889<em>SELECT statement</em></p>
1890<p>Furthermore, we are going to test the SELECT JSON statement. For this, we are going to retrieve only the species name with a population of 0. </p>
1891<pre><code>**SELECT** JSON species **FROM** species_table **WHERE** population = 0 ALLOW FILTERING;
1892</code></pre>
1893<p><img src="Up_eHlqKQp2RI5XIbgPOvj1B5J3gLxz7v7EI0NDRgezQTipecdfDT6AQoso0.png" alt="" />
1894<em>SELECT JSON statement</em></p>
1895<h2 id="update"><a class="anchor" href="#update">¶</a>Update</h2>
1896<h3 id="ddl_data_definition_language__2"><a class="anchor" href="#ddl_data_definition_language__2">¶</a>DDL (Data Definition Language)</h3>
1897<h4 id="alter_keyspace"><a class="anchor" href="#alter_keyspace">¶</a>Alter keyspace</h4>
1898<p>The statement **ALTER KEYSPACE **allows to modify the options of a keyspace:</p>
1899<pre><code>**ALTER** KEYSPACE keyspace_name **WITH** options;
1900</code></pre>
1901<p>Note: the supported **options **are the same than for creating a keyspace, “<strong>replication</strong>” and “<strong>durable_writes</strong>”.</p>
1902<p>The example associated to this statement is to modify the keyspace with name “test_keyspace” and set a “replication_factor” of 4.</p>
1903<pre><code>**ALTER** KEYSPACE test_keyspace
1904 **WITH** **replication** = {'class': 'SimpleStrategy', 'replication_factor' : 4};
1905</code></pre>
1906<h4 id="alter_table"><a class="anchor" href="#alter_table">¶</a>Alter table</h4>
1907<p>Altering an existing table uses the **ALTER TABLE **statement:</p>
1908<pre><code>**ALTER** **TABLE** table_name alter_table_instruction;
1909</code></pre>
1910<p>Where “alter_table_instruction” can be: ADD column_name cql_type ( ‘,’ column_name cql_type )<em>; or DROP column_name ( column_name )</em>; or WITH options</p>
1911<p>The example associated to this statement is to ADD a new column to the table “species_table”, called “extinct” with type “boolean”.</p>
1912<pre><code>**ALTER** **TABLE** species_table **ADD** extinct **boolean**;
1913</code></pre>
1914<p>Another example is to DROP the column called “sex” from the table “species_table”.</p>
1915<pre><code>**ALTER** **TABLE** species_table **DROP** sex;
1916</code></pre>
1917<p>Finally, alter the comment with the WITH clause and set the comment to “All species records”. </p>
1918<pre><code>**ALTER** **TABLE** species_table **WITH** **comment**='All species records';
1919</code></pre>
1920<p>These changes can be checked with the **DESCRIBE **statement:</p>
1921<pre><code>**DESCRIBE** **TABLE** species_table;
1922</code></pre>
1923<p><img src="xebKPqkWkn97YVHpRVXZYWvRUfeRUyCH-vPDs67aFaEeU53YTRbDOFscOlAr.png" alt="" />
1924<em>DESCRIBE table</em></p>
1925<h3 id="dml_data_manipulation_language__2"><a class="anchor" href="#dml_data_manipulation_language__2">¶</a>DML (Data Manipulation Language)</h3>
1926<h4 id="update_data"><a class="anchor" href="#update_data">¶</a>Update data</h4>
1927<p>Updating a row is done using an **UPDATE **statement:</p>
1928<pre><code>**UPDATE** table_name
1929[ **USING** update_parameter ( **AND** update_parameter )* ]
1930**SET** assignment ( ',' assignment )*
1931**WHERE** where_clause
1932[ **IF** ( **EXISTS** | condition ( **AND** condition )*) ];
1933</code></pre>
1934<p>Where the update_parameter is: ( TIMESTAMP | TTL) (integer | bind_marker)</p>
1935<p>It is important to mention that the **WHERE **clause is used to select the row to update and **must <strong>include ** all columns</strong> composing the <strong>PRIMARY KEY.</strong></p>
1936<p>We are going to test this statement updating the column “extinct” to true to the column with name ‘White monkey’.</p>
1937<pre><code>**UPDATE** species_table **SET** extinct = **true** **WHERE** species='White monkey';
1938</code></pre>
1939<p><img src="IcaCe6VEC5c0ZQIygz-CiclzFyt491u7xPMg2muJLR8grmqaiUzkoQsVCoHf.png" alt="" />
1940<em>SELECT statement</em></p>
1941<h2 id="delete"><a class="anchor" href="#delete">¶</a>Delete</h2>
1942<h3 id="ddl_data_definition_language__3"><a class="anchor" href="#ddl_data_definition_language__3">¶</a>DDL (Data Definition Language)</h3>
1943<h4 id="drop_keyspace"><a class="anchor" href="#drop_keyspace">¶</a>Drop keyspace</h4>
1944<p>Dropping a keyspace can be done using the **DROP KEYSPACE **statement:</p>
1945<pre><code>**DROP** KEYSPACE [ **IF** **EXISTS** ] keyspace_name;
1946</code></pre>
1947<p>For example, drop the keyspace called “test_keyspace_2” if it exists:</p>
1948<pre><code>**DROP** KEYSPACE **IF** **EXISTS** test_keyspace_2;
1949</code></pre>
1950<p>As this keyspace does not exists, this sentence will do nothing.</p>
1951<h4 id="drop_table"><a class="anchor" href="#drop_table">¶</a>Drop table</h4>
1952<p>Dropping a table uses the **DROP TABLE **statement:</p>
1953<pre><code>**DROP** **TABLE** [ **IF** **EXISTS** ] table_name;
1954</code></pre>
1955<p>For example, drop the table called “species_2” if it exists: </p>
1956<pre><code>**DROP** **TABLE** **IF** **EXISTS** species_2;
1957</code></pre>
1958<p>As this table does not exists, this sentence will do nothing.</p>
1959<h4 id="truncate_table_"><a class="anchor" href="#truncate_table_">¶</a>Truncate (table)</h4>
1960<p>A table can be truncated using the **TRUNCATE **statement:</p>
1961<pre><code>**TRUNCATE** [ **TABLE** ] table_name;
1962</code></pre>
1963<p>Do not execute this command now, because if you do it, you will need to insert the previous data again.</p>
1964<p>Note: as tables are the only object that can be truncated the keyword TABLE can be omitted.</p>
1965<p><img src="FOkhfpxlWFQCzcdfeWxLTy7wx5inDv0xwVeVhE79Pqtk3yYzWsZJnz_SBhUi.png" alt="" />
1966<em>TRUNCATE statement</em></p>
1967<h3 id="dml_data_manipulation_language__3"><a class="anchor" href="#dml_data_manipulation_language__3">¶</a>DML (Data Manipulation Language)</h3>
1968<h4 id="delete_data"><a class="anchor" href="#delete_data">¶</a>Delete data</h4>
1969<p>Deleting rows or parts of rows uses the **DELETE **statement:</p>
1970<pre><code>**DELETE** [ simple_selection ( ',' simple_selection ) ]
1971 **FROM** table_name
1972 [ **USING** update_parameter ( **AND** update_parameter )* ]
1973 **WHERE** where_clause
1974 [ **IF** ( **EXISTS** | condition ( **AND** condition )*) ]
1975</code></pre>
1976<p>Now we are going to delete the value of the column “average_size” from “Cloned Sheep”. </p>
1977<pre><code>**DELETE** average_size **FROM** species_table **WHERE** species = 'Cloned Sheep';
1978</code></pre>
1979<p><img src="CyuQokVL5J9TAelq-WEWhNl6kFtbIYs0R1AeU5NX4EkG-YQI81mNHdnf2yWN.png" alt="" />
1980<em>DELETE value statement</em></p>
1981<p>And we are going to delete the same row as mentioned before. </p>
1982<pre><code>**DELETE** **FROM** species_table **WHERE** species = 'Cloned Sheep';
1983</code></pre>
1984<p><img src="jvQ5cXJ5GTVQ6giVhBEpPJmrJw-zwKKyB9nsTm5PRcGSTzkmh-WO4kTeuLpB.png" alt="" />
1985<em>DELETE row statement</em></p>
1986<h2 id="batch"><a class="anchor" href="#batch">¶</a>Batch</h2>
1987<p>Multiple <strong>INSERT</strong>, **UPDATE **and **DELETE **can be executed in a <strong>single statement</strong> by grouping them through a **BATCH **statement.</p>
1988<pre><code>**BEGIN** [ UNLOGGED | COUNTER ] BATCH
1989 [ **USING** update_parameter ( **AND** update_parameter )* ]
1990 modification_statement ( ';' modification_statement )*
1991 APPLY BATCH;
1992</code></pre>
1993<p>Where modification_statement can be a insert_statement or an update_statement or a delete_statement.</p>
1994<ul>
1995<li>**UNLOGGED **means that either all operations in a batch eventually complete or none will.</li>
1996<li><strong>COUNTER</strong> means that the updates are not idempotent, so each time we execute the updates in a batch, we will have different results.
1997For example:</li>
1998</ul>
1999<pre><code>**BEGIN** BATCH
2000 **INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
2001 **VALUES** ('Blue Shark', 'Tiburillo', 30, 10, **false**);
2002 **INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
2003 **VALUES** ('Cloned sheep', 'Dolly the Sheep', 1, 1, **true**);
2004 **UPDATE** species_table **SET** population = 2 **WHERE** species='Cloned sheep';
2005 **DELETE** **FROM** species_table **WHERE** species = 'White monkey';
2006APPLY BATCH;
2007</code></pre>
2008<p><img src="EL9Dac26o0FqkVoeAKmopEKQe0wWq-xYI14b9RzGxtUkFJA3i2eTiR6qkuuJ.png" alt="" />
2009<em>BATCH statement</em></p>
2010<h2 id="index"><a class="anchor" href="#index">¶</a>Index</h2>
2011<p>CQL support creating secondary indexes on tables, allowing queries on the table to use those indexes. </p>
2012<p>**Creating **a secondary index on a table uses the **CREATE INDEX **statement:</p>
2013<pre><code>**CREATE** [ CUSTOM ] **INDEX** [ **IF** **NOT** **EXISTS** ] [ index_name ]
2014 **ON** table_name '(' index_identifier ')'
2015 [ **USING** string [ **WITH** OPTIONS = map_literal ] ];
2016</code></pre>
2017<p>For example we are going to create a index called “population_idx” that is related to the column “population” in the table “species_table”.</p>
2018<pre><code>**CREATE** **INDEX** population_idx **ON** species_table (population);
2019</code></pre>
2020<p>**Dropping **a secondary index uses the <strong>DROP INDEX</strong> statement: </p>
2021<pre><code>**DROP** **INDEX** [ **IF** **EXISTS** ] index_name;
2022</code></pre>
2023<p>Now, we are going to drop the previous index: </p>
2024<pre><code>**DROP** **INDEX** **IF** **EXISTS** population_idx;
2025</code></pre>
2026<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
2027<ul>
2028<li><a href="https://cassandra.apache.org/doc/latest/cql/ddl.html">Cassandra CQL</a></li>
2029<li><a href="https://techdifferences.com/difference-between-ddl-and-dml-in-dbms.html">Differences between DML and DDL</a></li>
2030<li><a href="https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cqlReferenceTOC.html">Datastax CQL</a></li>
2031<li><a href="https://cassandra.apache.org/doc/latest/cql/types.html#grammar-token-cql-type">Cassandra CQL Types</a></li>
2032<li><a href="https://cassandra.apache.org/doc/latest/cql/indexes.html">Cassandra Index</a></li>
2033</ul>
2034</main>
2035</body>
2036</html>
2037 </content></entry><entry><title>Upgrading our Baby Crawler</title><id>dist/upgrading-our-baby-crawler/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-03-10T23:00:00+00:00</published><summary>In our </summary><content type="html" src="dist/upgrading-our-baby-crawler/index.html"><!DOCTYPE html>
2038<html>
2039<head>
2040<meta charset="utf-8" />
2041<meta name="viewport" content="width=device-width, initial-scale=1" />
2042<title>Upgrading our Baby Crawler</title>
2043<link rel="stylesheet" href="../css/style.css">
2044</head>
2045<body>
2046<main>
2047<p>In our <a href="/blog/ribw/build-your-own-pc/">last post on this series</a>, we presented the code for our Personal Crawler. However, we didn’t quite explain what a crawler even is! We will use this moment to go a bit more in-depth, and make some upgrades to it.</p>
2048<div class="date-created-modified">Created 2020-03-11<br>
2049Modified 2020-03-18</div>
2050<h2 class="title" id="what_is_a_crawler_"><a class="anchor" href="#what_is_a_crawler_">¶</a>What is a Crawler?</h2>
2051<p>A crawler is a program whose job is to analyze documents and extract data from them. For example, search engines like <a href="http://duckduckgo.com/">DuckDuckGo</a>, <a href="https://bing.com/">Bing</a> or <a href="http://google.com/">Google</a> all have crawlers to analyze websites and build a database around them. They are some kind of «trackers», because they keep track of everything they find.</p>
2052<p>Their basic behaviour can be described as follows: given a starting list of URLs, follow them all and identify hyperlinks inside the documents. Add these to the list of links to follow, and repeat <em>ad infinitum</em>.</p>
2053<ul>
2054<li>This lets us create an index to quickly search across them all.</li>
2055<li>We can also identify broken links.</li>
2056<li>We can gather any other type of information that we found.
2057Our crawler will work offline, within our own computer, scanning the text documents it finds on the root we tell it to scan.</li>
2058</ul>
2059<h2 id="design_decissions"><a class="anchor" href="#design_decissions">¶</a>Design Decissions</h2>
2060<ul>
2061<li>We will use Java. Its runtime is quite ubiquitous, so it should be able to run in virtually anywhere. The language is typed, which helps catch errors early on.</li>
2062<li>Our solution is iterative. While recursion can be seen as more elegants by some, iterative solutions are often more performant with less need for optimization.</li>
2063</ul>
2064<h2 id="requirements"><a class="anchor" href="#requirements">¶</a>Requirements</h2>
2065<p>If you don’t have Java installed yet, you can <a href="https://java.com/en/download/">Download Free Java Software</a> from Oracle’s site. To compile the code, the <a href="https://www.oracle.com/java/technologies/javase-jdk8-downloads.html">Java Development Kit</a> is also necessary.</p>
2066<p>We don’t depend on any other external libraries, for easier deployment and compilation.</p>
2067<h2 id="implementation"><a class="anchor" href="#implementation">¶</a>Implementation</h2>
2068<p>Because the code was getting pretty large, it has been split into several files, and we have also upgraded it to use a Graphical User Interface instead! We decided to use Swing, based on the Java tutorial <a href="https://docs.oracle.com/javase/tutorial/uiswing/">Creating a GUI With JFC/Swing</a>.</p>
2069<h3 id="app"><a class="anchor" href="#app">¶</a>App</h3>
2070<p>This file is the entry point of our application. Its job is to initialize the components, lay them out in the main panel, and connect the event handlers.</p>
2071<p>Most widgets are pretty standard, and are defined as class variables. However, some variables are notable. The <code>[DefaultTableModel](https://docs.oracle.com/javase/8/docs/api/javax/swing/table/DefaultTableModel.html)</code> is used because it allows to <a href="https://stackoverflow.com/a/22550106">dynamically add rows</a>, and we also have a <code>[SwingWorker](https://docs.oracle.com/javase/8/docs/api/javax/swing/SwingWorker.html)</code> subclass responsible for performing the word analysis (which is quite CPU intensive and should not be ran in the UI thread!).</p>
2072<p>There’s a few utility methods to ease some common operations, such as <code>updateStatus</code> which changes the status label in the main window, informing the user of the latest changes.</p>
2073<h3 id="thesaurus"><a class="anchor" href="#thesaurus">¶</a>Thesaurus</h3>
2074<p>A thesaurus is a collection of words or terms used to represent concepts. In literature this is commonly known as a dictionary.</p>
2075<p>On the subject of this project, we are using a thesaurus based on how relevant is a word for the meaning of a sentence, filtering out those that barely give us any information.</p>
2076<p>This file contains a simple thesaurus implementation, which can trivially be used as a normal or inverted thesaurus. However, we only treat it as inverted, and its job is loading itself and determining if words are valid or should otherwise be ignored.</p>
2077<h3 id="utils"><a class="anchor" href="#utils">¶</a>Utils</h3>
2078<p>Several utility functions used across the codebase.</p>
2079<h3 id="wordmap"><a class="anchor" href="#wordmap">¶</a>WordMap</h3>
2080<p>This file is the important one, and its implementation hasn’t changed much since our last post. Instances of a word map contain… wait for it… a map of words! It stores the mapping <code>word → count</code> in memory, and offers methods to query the count of a word or iterate over the word count entries.</p>
2081<p>It can be loaded from cache or told to analyze a root path. Once an instance is created, additional files could be analyzed one by one if desired.</p>
2082<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
2083<p>The code was getting a bit too large to embed it within the blog post itself, so instead you can download it as a<code>.zip</code> file.</p>
2084<p><em>download removed</em></p>
2085</main>
2086</body>
2087</html>
2088 </content></entry><entry><title>Cassandra: an Introduction</title><id>dist/cassandra-an-introduction/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-03-04T23:00:00+00:00</published><summary>This is the first post in the Cassandra series, where we will introduce the Cassandra database system and take a look at its features and installation methods.</summary><content type="html" src="dist/cassandra-an-introduction/index.html"><!DOCTYPE html>
2089<html>
2090<head>
2091<meta charset="utf-8" />
2092<meta name="viewport" content="width=device-width, initial-scale=1" />
2093<title>Cassandra: an Introduction</title>
2094<link rel="stylesheet" href="../css/style.css">
2095</head>
2096<body>
2097<main>
2098<p>This is the first post in the Cassandra series, where we will introduce the Cassandra database system and take a look at its features and installation methods.</p>
2099<div class="date-created-modified">Created 2020-03-05<br>
2100Modified 2020-03-18</div>
2101<p>Other posts in this series:</p>
2102<ul>
2103<li><a href="/blog/ribw/cassandra-an-introduction/">Cassandra: an Introduction</a> (this post)</li>
2104</ul>
2105<p>This post is co-authored wih Classmate.</p>
2106<hr />
2107<div class="image-container">
2108<img src="cassandra-database-e1584191543401.jpg" alt="NoSQL database – Apache Cassandra – First delivery" />
2109<div class="image-caption"></div>
2110</div>
2111<p>
2112<h2 class="title" id="purpose_of_technology"><a class="anchor" href="#purpose_of_technology">¶</a>Purpose of technology</h2>
2113<p>Apache Cassandra is a <strong>NoSQL</strong>, <strong>open-source</strong>, <strong>distributed “key-value” database</strong>. It allows <strong>large volumes of distributed data</strong>. The main **goal **is provide <strong>linear scalability and availabilitywithout compromising performance</strong>. Besides, Cassandra <strong>supports replication</strong> across multiple datacenters, providing low latency. </p>
2114<h2 id="how_it_works"><a class="anchor" href="#how_it_works">¶</a>How it works</h2>
2115<p>Cassandra’s distributed **architecture **is based on a series of <strong>equal nodes</strong> that communicate with a <strong>P2P protocol</strong> so that <strong>redundancy is maximum</strong>. It offers robust support for multiple datacenters, with <strong>asynchronous replication</strong> without the need for a master server. </p>
2116<p>Besides, Cassandra’s <strong>data model consists of partitioning the rows</strong>, which are rearranged into <strong>different tables</strong>. The primary keys of each table have a first component that is the <strong>partition key</strong>. Within a partition, the rows are grouped by the remaining columns of the key. The other columns can be indexed separately from the primary key.</p>
2117<p>These tables can be <strong>created, deleted, updated and queried****at runtime without blocking</strong> each other. However it does <strong>not support joins or subqueries</strong>, but instead <strong>emphasizes denormalization</strong> through features like collections.</p>
2118<p>Nowadays, Cassandra uses its own query language called <strong>CQL</strong> (<strong>Cassandra Query Language</strong>), with a <strong>similar syntax to SQL</strong>. It also allows access from <strong>JDBC</strong>.</p>
2119<p><img src="s0GHpggGZXOFcdhypRWV4trU-PkSI6lukEv54pLZnoirh0GlDVAc4LamB1Dy.png" alt="" />
2120_ Cassandra architecture _</p>
2121<h2 id="features"><a class="anchor" href="#features">¶</a>Features</h2>
2122<ul>
2123<li><strong>Decentralized</strong>: there are <strong>no single points of failure</strong>, every **node **in the cluster has the <strong>same role</strong> and there is <strong>no master node</strong>, so each node <strong>can service any request</strong>, besides the data is distributed across the cluster.</li>
2124<li>Supports **replication **and multiple replication of <strong>data center</strong>: the replication strategies are <strong>configurable</strong>. </li>
2125<li>**Scalability: **reading and writing performance increases linearly as new nodes are added, also <strong>new nodes</strong> can be <strong>added without interrupting</strong> application <strong>execution</strong>.</li>
2126<li><strong>Fault tolerance: data replication</strong> is done **automatically **in several nodes in order to recover from failures. It is possible to <strong>replace failure nodes****without <strong>making</strong> inactivity time or interruptions</strong> to the application.</li>
2127<li>**Consistency: **a choice of consistency level is provided for <strong>reading and writing</strong>.</li>
2128<li><strong>MapReduce support</strong>: it is **integrated **with <strong>Apache Hadoop</strong> to support MapReduce.</li>
2129<li><strong>Query language</strong>: it has its own query language called **CQL (Cassandra Query Language) **</li>
2130</ul>
2131<h2 id="corner_in_cap_theorem"><a class="anchor" href="#corner_in_cap_theorem">¶</a>Corner in CAP theorem</h2>
2132<p><strong>Apache Cassandra</strong> is usually described as an “<strong>AP</strong>” system because it guarantees <strong>availability</strong> and <strong>partition/fault tolerance</strong>. So it errs on the side of ensuring data availability even if this means <strong>sacrificing consistency</strong>. But, despite this fact, Apache Cassandra <strong>seeks to satisfy all three requirements</strong> (Consistency, Availability and Fault tolerance) simultaneously and can be <strong>configured to behave</strong> like a “<strong>CP</strong>” database, guaranteeing <strong>consistency and partition/fault tolerance</strong>. </p>
2133<p><img src="rf3n9LTOKCQVbx4qrn7NPSVcRcwE1LxR_khi-9Qc51Hcbg6BHHPu-0GZjUwD.png" alt="" />
2134<em>Cassandra in CAP Theorem</em></p>
2135<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
2136<p>In order to download the file, with extension .tar.gz. you must visit the <a href="https://cassandra.apache.org/download/">download site</a> and click on the file “<a href="https://ftp.cixug.es/apache/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz">https://ftp.cixug.es/apache/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz</a>”. It is important to mention that the previous link is related to the 3.11.6 version.</p>
2137<h2 id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
2138<p>This database can only be installed on Linux distributions and Mac OS X systems, so, it is not possible to install it on Microsoft Windows.</p>
2139<p>The first main requirement is having installed Java 8 in <strong>Ubuntu</strong>, the OS that we will use. Therefore, the Java 8 installation is explained below. First open a terminal and execute the next command:</p>
2140<pre><code>sudo apt update
2141sudo apt install openjdk-8-jdk openjdk-8-jre
2142</code></pre>
2143<p>In order to establish Java as a environment variable it is needed to open the file “/.bashrc”: </p>
2144<pre><code>nano ~/.bashrc
2145</code></pre>
2146<p>And add at the end of it the path where Java is installed, as follows: </p>
2147<pre><code>export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
2148export PATH=$PATH:$JAVA_HOME/bin
2149</code></pre>
2150<p>At this point, save the file and execute the next command, note that it does the same effect re-opening the terminal: </p>
2151<pre><code>source ~/.bashrc
2152</code></pre>
2153<p>In order to check if the Java environment variable is set correctly, run the next command: </p>
2154<pre><code>echo $JAVA_HOME
2155</code></pre>
2156<p><img src="JUUmX5MIHynJR_K9EdCgKeJcpINeCGRRt2QRu4JLPtRhCVidOhcbWwVTQjyu.png" alt="" />
2157<em>$JAVAHOME variable</em></p>
2158<p>Afterwards, it is possible to check the installed Java version with the command: </p>
2159<pre><code>java -version
2160</code></pre>
2161<p><img src="z9v1-0hpZwjI4U5UZej9cRGN5-Y4AZl0WUPWyQ_-JlzTAIvZtTFPnKY2xMQ_.png" alt="" />
2162<em>Java version</em></p>
2163<p>The next requirement is having installed the latest version of Python 2.7. This can be checked with the command: </p>
2164<pre><code>python --version
2165</code></pre>
2166<p>If it is not installed, to install it, it is as simple as run the next command in the terminal: </p>
2167<pre><code>sudo apt install python
2168</code></pre>
2169<p>Note: it is better to use “python2” instead of “python” because in that way, you force to user Python 2.7. Modern distributions will use Python 3 for the «python» command.</p>
2170<p>Therefore, it is possible to check the installed Python version with the command:</p>
2171<pre><code>python --version
2172</code></pre>
2173<p><img src="Ger5Vw_e1HIK84QgRub-BwGmzIGKasgiYb4jHdfRNRrvG4d6Msp_3Vk62-9i.png" alt="" />
2174<em>Python version</em></p>
2175<p>Once both requirements are ready, next step is to unzip the file previously downloaded, right click on the file and select “Extract here” or with the next command, on the directory where is the downloaded file. </p>
2176<pre><code>tar -zxvf apache-cassandra-x.x.x-bin.tar.gz
2177</code></pre>
2178<p>In order to check if the installation is completed, you can execute the next command, in the root folder of the project. This will start Cassandra in a single node. </p>
2179<pre><code>/bin/cassandra
2180</code></pre>
2181<p>It is possible to make a get some data from Cassandra with CQL (Cassandra Query Language). To check this execute the next command in another terminal. </p>
2182<pre><code>/bin/cqlsh localhost
2183</code></pre>
2184<p>Once CQL is open, type the next sentence and check the result: </p>
2185<pre><code>SELECT cluster_name, listen_address from system.local;
2186</code></pre>
2187<p>The output should be:</p>
2188<p><img src="miUO60A-RtyEAOOVFJqlkPRC18H4RKUhot6RWzhO9FmtzgTPOYHFtwxqgZEf.png" alt="" />
2189<em>Sentence output</em></p>
2190<p>Finally, the installation guide provided by the website of the database is attached in this <a href="https://cassandra.apache.org/doc/latest/getting_started/installing.html">installation guide</a>. </p>
2191<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
2192<ul>
2193<li><a href="https://es.wikipedia.org/wiki/Apache_Cassandra">Wikipedia</a></li>
2194<li><a href="https://cassandra.apache.org/">Apache Cassandra</a></li>
2195<li><a href="https://www.datastax.com/blog/2019/05/how-apache-cassandratm-balances-consistency-availability-and-performance">Datastax</a></li>
2196<li><a href="https://blog.yugabyte.com/apache-cassandra-architecture-how-it-works-lightweight-transactions/">yugabyte</a></li>
2197</ul>
2198</main>
2199</body>
2200</html>
2201 </content></entry><entry><title>Privado: PC-Crawler evaluation</title><id>dist/pc-crawler-evaluation/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-03-03T23:00:00+00:00</published><summary>As the student </summary><content type="html" src="dist/pc-crawler-evaluation/index.html"><!DOCTYPE html>
2202<html>
2203<head>
2204<meta charset="utf-8" />
2205<meta name="viewport" content="width=device-width, initial-scale=1" />
2206<title>Privado: PC-Crawler evaluation</title>
2207<link rel="stylesheet" href="../css/style.css">
2208</head>
2209<body>
2210<main>
2211<p>As the student <code>a(i)</code> where <code>i = 9</code>, I have been assigned to evaluate students <code>a(i + 3)</code> and <code>a(i + 4)</code>, these being:</p>
2212<div class="date-created-modified">Created 2020-03-04<br>
2213Modified 2020-03-18</div>
2214<ul>
2215<li>a12: Classmate (username)</li>
2216<li>a13: Classmate (username)</li>
2217</ul>
2218<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
2219<p><strong>Grading: B.</strong></p>
2220<p>I think they mix up a bit their considerations with program usage and how it works, not justifying why the considerations are the ones they chose, or what the alternatives would be.</p>
2221<p>The implementation notes are quite well-written. Even someone without knowledge of Java’s syntax can read the notes and more or less make sense of what’s going on, with the relevant code excerpts on each section.</p>
2222<p>Implementation-wise, some methods could definitely use some improvement:</p>
2223<ul>
2224<li><code>esExtensionTextual</code> is overly complicated. It could use a <code>for</code> loop and Java’s <code>String.endsWith</code>.</li>
2225<li><code>calcularFrecuencia</code> has quite some duplication (e.g. <code>this.getFicherosYDirectorios().remove(0)</code>) and could definitely be cleaned up.</li>
2226</ul>
2227<p>However, all the desired functionality is implemented.</p>
2228<p>Style-wise, some of the newlines and avoiding braces on <code>if</code> and <code>while</code> could be changed to improve the readability.</p>
2229<p>The post is written in Spanish, but uses some words that don’t translate well («remover» could better be said as «eliminar» or «quitar»).</p>
2230<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
2231<p><strong>Grading: B.</strong></p>
2232<p>Their post starts with an explanation on what a crawler is, common uses for them, and what type of crawler they will be developing. This is a very good start. Regarding the post style, it seems they are not properly using some of WordPress features, such as lists, and instead rely on paragraphs with special characters prefixing each list item.</p>
2233<p>The post also contains some details on how to install the requirements to run the program, which can be very useful for someone not used to working with Java.</p>
2234<p>They do not explain their implementation and the filename of the download has a typo.</p>
2235<p>Implementation-wise, the code seems to be well-organized, into several packages and files, although the naming is a bit inconsistent. They even designed a GUI, which is quite impressive.</p>
2236<p>Some of the methods are documented, although the code inside them is not very commented, including missing rationale for the data structures chosen. There also seem to be several other unused main functions, which I’m unsure why they were kept.</p>
2237<p>However, all the desired functionality is implemented.</p>
2238<p>Similar to Classmate, the code style could be improved and settled on some standard, as well as making use of Java features such as <code>for</code> loops over iterators instead of manual loops.</p>
2239</main>
2240</body>
2241</html>
2242 </content></entry><entry><title>Introduction to NoSQL</title><id>dist/introduction-to-nosql/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-02-24T23:00:00+00:00</published><summary>This post will primarly focus on the talk held in the </summary><content type="html" src="dist/introduction-to-nosql/index.html"><!DOCTYPE html>
2243<html>
2244<head>
2245<meta charset="utf-8" />
2246<meta name="viewport" content="width=device-width, initial-scale=1" />
2247<title>Introduction to NoSQL</title>
2248<link rel="stylesheet" href="../css/style.css">
2249</head>
2250<body>
2251<main>
2252<p>This post will primarly focus on the talk held in the <a href="https://youtu.be/qI_g07C_Q5I">GOTO 2012 conference: Introduction to NoSQL by Martin Fowler</a>. It can be seen as an informal, summarized transcript of the talk</p>
2253<div class="date-created-modified">Created 2020-02-25<br>
2254Modified 2020-03-18</div>
2255<hr />
2256<p>The relational database model is affected by the <em><a href="https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch">impedance mismatch problem</a></em>. This occurs because we have to match our high-level design with the separate columns and rows used by relational databases.</p>
2257<p>Taking the in-memory objects and putting them into a relational database (which were dominant at the time) simply didn’t work out. Why? Relational databases were more than just databases, they served as a an integration mechanism across applications, up to the 2000s. For 20 years!</p>
2258<p>With the rise of the Internet and the sheer amount of traffic, databases needed to scale. Unfortunately, relational databases only scale well vertically (by upgrading a <em>single</em> node). This is <em>very</em> expensive, and not something many could afford.</p>
2259<p>The problem are those pesky <code>JOIN</code>‘s, and its friends <code>GROUP BY</code>. Because our program and reality model don’t match the tables used by SQL, we have to rely on them to query the data. It is because the model doesn’t map directly.</p>
2260<p>Furthermore, graphs don’t map very well at all to relational models.</p>
2261<p>We needed a way to scale horizontally (by increasing the <em>amount</em> of nodes), something relational databases were not designed to do.</p>
2262<blockquote>
2263<p><em>We need to do something different, relational across nodes is an unnatural act</em></p>
2264</blockquote>
2265<p>This inspired the NoSQL movement.</p>
2266<blockquote>
2267<p><em>#nosql was only meant to be a hashtag to advertise it, but unfortunately it’s how it is called now</em></p>
2268</blockquote>
2269<p>It is not possible to define NoSQL, but we can identify some of its characteristics:</p>
2270<ul>
2271<li>
2272<p>Non-relational</p>
2273</li>
2274<li>
2275<p><strong>Cluster-friendly</strong> (this was the original spark)</p>
2276</li>
2277<li>
2278<p>Open-source (until now, generally)</p>
2279</li>
2280<li>
2281<p>21st century web culture</p>
2282</li>
2283<li>
2284<p>Schema-less (easier integration or conjugation of several models, structure aggregation)
2285These databases use different data models to those used by the relational model. However, it is possible to identify 4 broad chunks (some may say 3, or even 2!):</p>
2286</li>
2287<li>
2288<p><strong>Key-value store</strong>. With a certain key, you obtain the value corresponding to it. It knows nothing else, nor does it care. We say the data is opaque.</p>
2289</li>
2290<li>
2291<p><strong>Document-based</strong>. It stores an entire mass of documents with complex structure, normally through the use of JSON (XML has been left behind). Then, you can ask for certain fields, structures, or portions. We say the data is transparent.</p>
2292</li>
2293<li>
2294<p><strong>Column-family</strong>. There is a «row key», and within it we store multiple «column families» (columns that fit together, our aggregate). We access by row-key and column-family name.
2295All of these kind of serve to store documents without any <em>explicit</em> schema. Just shove in anything! This gives a lot of flexibility and ease of migration, except… that’s not really true. There’s an <em>implicit</em> schema when querying.</p>
2296</li>
2297</ul>
2298<p>For example, a query where we may do <code>anOrder['price'] * anOrder['quantity']</code> is assuming that <code>anOrder</code> has both a <code>price</code> and a <code>quantity</code>, and that both of these can be multiplied together. «Schema-less» is a fuzzy term.</p>
2299<p>However, it is the lack of a <em>fixed</em> schema that gives flexibility.</p>
2300<p>One could argue that the line between key-value and document-based is very fuzzy, and they would be right! Key-value databases often let you include additional metadata that behaves like an index, and in document-based, documents often have an identifier anyway.</p>
2301<p>The common notion between these three types is what matters. They save an entire structure as an <em>unit</em>. We can refer to these as «Aggregate Oriented Databases». Aggregate, because we group things when designing or modeling our systems, as opposed to relational databases that scatter the information across many tables.</p>
2302<p>There exists a notable outlier, though, and that’s:</p>
2303<ul>
2304<li><strong>Graph</strong> databases. They use a node-and-arc graph structure. They are great for moving on relationships across things. Ironically, relational databases are not very good at jumping across relationships! It is possibly to perform very interesting queries in graph databases which would be really hard and costly on relational models. Unlike the aggregated databases, graphs break things into even smaller units.
2305NoSQL is not <em>the</em> solution. It depends on how you’ll work with your data. Do you need an aggregate database? Will you have a lot of relationships? Or would the relational model be good fit for you?</li>
2306</ul>
2307<p>NoSQL, however, is a good fit for large-scale projects (data will <em>always</em> grow) and faster development (the impedance mismatch is drastically reduced).</p>
2308<p>Regardless of our choice, it is important to remember that NoSQL is a young technology, which is still evolving really fast (SQL has been stable for <em>decades</em>). But the <em>polyglot persistence</em> is what matters. One must know the alternatives, and be able to choose.</p>
2309<hr />
2310<p>Relational databases have the well-known ACID properties: Atomicity, Consistency, Isolation and Durability.</p>
2311<p>NoSQL (except graph-based!) are about being BASE instead: Basically Available, Soft state, Eventual consistency.</p>
2312<p>SQL needs transactions because we don’t want to perform a read while we’re only half-way done with a write! The readers and writers are the problem, and ensuring consistency results in a performance hit, even if the risk is low (two writers are extremely rare but it still must be handled).</p>
2313<p>NoSQL on the other hand doesn’t need ACID because the aggregate <em>is</em> the transaction boundary. Even before NoSQL itself existed! Any update is atomic by nature. When updating many documents it <em>is</em> a problem, but this is very rare.</p>
2314<p>We have to distinguish between logical and replication consistency. During an update and if a conflict occurs, it must be resolved to preserve the logical consistency. Replication consistency on the other hand is preserveed when distributing the data across many machines, for example during sharding or copies.</p>
2315<p>Replication buys us more processing power and resillence (at the cost of more storage) in case some of the nodes die. But what happens if what dies is the communication across the nodes? We could drop the requests and preserve the consistency, or accept the risk to continue and instead preserve the availability.</p>
2316<p>The choice on whether trading consistency for availability is acceptable or not depends on the domain rules. It is the domain’s choice, the business people will choose. If you’re Amazon, you always want to be able to sell, but if you’re a bank, you probably don’t want your clients to have negative numbers in their account!</p>
2317<p>Regardless of what we do, in a distributed system, the CAP theorem always applies: Consistecy, Availability, Partitioning-tolerancy (error tolerancy). It is <strong>impossible</strong> to guarantee all 3 at 100%. Most of the times, it does work, but it is mathematically impossible to guarantee at 100%.</p>
2318<p>A database has to choose what to give up at some point. When designing a distributed system, this must be considered. Normally, the choice is made between consistency or response time.</p>
2319<h2 class="title" id="further_reading"><a class="anchor" href="#further_reading">¶</a>Further reading</h2>
2320<ul>
2321<li><a href="https://www.martinfowler.com/articles/nosql-intro-original.pdf">The future is: <del>NoSQL Databases</del> Polyglot Persistence</a></li>
2322<li><a href="https://www.thoughtworks.com/insights/blog/nosql-databases-overview">NoSQL Databases: An Overview</a></li>
2323</ul>
2324</main>
2325</body>
2326</html>
2327 </content></entry><entry><title>Build your own PC</title><id>dist/build-your-own-pc/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-02-24T23:00:00+00:00</published><summary>…where PC obviously stands for Personal Crawler</summary><content type="html" src="dist/build-your-own-pc/index.html"><!DOCTYPE html>
2328<html>
2329<head>
2330<meta charset="utf-8" />
2331<meta name="viewport" content="width=device-width, initial-scale=1" />
2332<title>Build your own PC</title>
2333<link rel="stylesheet" href="../css/style.css">
2334</head>
2335<body>
2336<main>
2337<p><em>…where PC obviously stands for Personal Crawler</em>.</p>
2338<div class="date-created-modified">Created 2020-02-25<br>
2339Modified 2020-03-18</div>
2340<hr />
2341<p>This post contains the source code for a very simple crawler written in Java. You can compile and run it on any file or directory, and it will calculate the frequency of all the words it finds.</p>
2342<h2 class="title" id="source_code"><a class="anchor" href="#source_code">¶</a>Source code</h2>
2343<p>Paste the following code in a new file called <code>Crawl.java</code>:</p>
2344<pre><code>import java.io.*;
2345import java.util.*;
2346import java.util.regex.Matcher;
2347import java.util.regex.Pattern;
2348
2349class Crawl {
2350 // Regex used to tokenize the words from a line of text
2351 private final static Pattern WORDS = Pattern.compile(&quot;\\w+&quot;);
2352
2353 // The file where we will cache our results
2354 private final static File INDEX_FILE = new File(&quot;index.bin&quot;);
2355
2356 // Helper method to determine if a file is a text file or not
2357 private static boolean isTextFile(File file) {
2358 String name = file.getName().toLowerCase();
2359 return name.endsWith(&quot;.txt&quot;)
2360 || name.endsWith(&quot;.java&quot;)
2361 || name.endsWith(&quot;.c&quot;)
2362 || name.endsWith(&quot;.cpp&quot;)
2363 || name.endsWith(&quot;.h&quot;)
2364 || name.endsWith(&quot;.hpp&quot;)
2365 || name.endsWith(&quot;.html&quot;)
2366 || name.endsWith(&quot;.css&quot;)
2367 || name.endsWith(&quot;.js&quot;);
2368 }
2369
2370 // Normalizes a string by converting it to lowercase and removing accents
2371 private static String normalize(String string) {
2372 return string.toLowerCase()
2373 .replace(&quot;á&quot;, &quot;a&quot;)
2374 .replace(&quot;é&quot;, &quot;e&quot;)
2375 .replace(&quot;í&quot;, &quot;i&quot;)
2376 .replace(&quot;ó&quot;, &quot;o&quot;)
2377 .replace(&quot;ú&quot;, &quot;u&quot;);
2378 }
2379
2380 // Recursively fills the map with the count of words found on all the text files
2381 static void fillWordMap(Map&lt;String, Integer&gt; map, File root) throws IOException {
2382 // Our file queue begins with the root
2383 Queue&lt;File&gt; fileQueue = new ArrayDeque&lt;&gt;();
2384 fileQueue.add(root);
2385
2386 // For as long as the queue is not empty...
2387 File file;
2388 while ((file = fileQueue.poll()) != null) {
2389 if (!file.exists() || !file.canRead()) {
2390 // ...ignore files for which we don't have permission...
2391 System.err.println(&quot;warning: cannot read file: &quot; + file);
2392 } else if (file.isDirectory()) {
2393 // ...else if it's a directory, extend our queue with its files...
2394 File[] files = file.listFiles();
2395 if (files == null) {
2396 System.err.println(&quot;warning: cannot list dir: &quot; + file);
2397 } else {
2398 fileQueue.addAll(Arrays.asList(files));
2399 }
2400 } else if (isTextFile(file)) {
2401 // ...otherwise, count the words in the file.
2402 countWordsInFile(map, file);
2403 }
2404 }
2405 }
2406
2407 // Counts the words in a single file and adds the count to the map.
2408 public static void countWordsInFile(Map&lt;String, Integer&gt; map, File file) throws IOException {
2409 BufferedReader reader = new BufferedReader(new FileReader(file));
2410
2411 String line;
2412 while ((line = reader.readLine()) != null) {
2413 Matcher matcher = WORDS.matcher(line);
2414 while (matcher.find()) {
2415 String token = normalize(matcher.group());
2416 Integer count = map.get(token);
2417 if (count == null) {
2418 map.put(token, 1);
2419 } else {
2420 map.put(token, count + 1);
2421 }
2422 }
2423 }
2424
2425 reader.close();
2426 }
2427
2428 // Prints the map of word count to the desired output stream.
2429 public static void printWordMap(Map&lt;String, Integer&gt; map, PrintStream writer) {
2430 List&lt;String&gt; keys = new ArrayList&lt;&gt;(map.keySet());
2431 Collections.sort(keys);
2432 for (String key : keys) {
2433 writer.println(key + &quot;\t&quot; + map.get(key));
2434 }
2435 }
2436
2437 @SuppressWarnings(&quot;unchecked&quot;)
2438 public static void main(String[] args) throws IOException, ClassNotFoundException {
2439 // Validate arguments
2440 if (args.length == 1 &amp;&amp; args[0].equals(&quot;--help&quot;)) {
2441 System.err.println(&quot;usage: java Crawl [input]&quot;);
2442 return;
2443 }
2444
2445 File root = new File(args.length &gt; 0 ? args[0] : &quot;.&quot;);
2446
2447 // Loading or generating the map where we aggregate the data {word: count}
2448 Map&lt;String, Integer&gt; map;
2449 if (INDEX_FILE.isFile()) {
2450 System.err.println(&quot;Found existing index file: &quot; + INDEX_FILE);
2451 try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(INDEX_FILE))) {
2452 map = (Map&lt;String, Integer&gt;) ois.readObject();
2453 }
2454 } else {
2455 System.err.println(&quot;Index file not found: &quot; + INDEX_FILE + &quot;; indexing...&quot;);
2456 map = new TreeMap&lt;&gt;();
2457 fillWordMap(map, root);
2458 // Cache the results to avoid doing the work a next time
2459 try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(INDEX_FILE))) {
2460 out.writeObject(map);
2461 }
2462 }
2463
2464 // Ask the user in a loop to query for words
2465 Scanner scanner = new Scanner(System.in);
2466 while (true) {
2467 System.out.print(&quot;Escriba palabra a consultar (o Enter para salir): &quot;);
2468 System.out.flush();
2469 String line = scanner.nextLine().trim();
2470 if (line.isEmpty()) {
2471 break;
2472 }
2473
2474 line = normalize(line);
2475 Integer count = map.get(line);
2476 if (count == null) {
2477 System.out.println(String.format(&quot;La palabra \&quot;%s\&quot; no está presente&quot;, line));
2478 } else if (count == 1) {
2479 System.out.println(String.format(&quot;La palabra \&quot;%s\&quot; está presente 1 vez&quot;, line));
2480 } else {
2481 System.out.println(String.format(&quot;La palabra \&quot;%s\&quot; está presente %d veces&quot;, line, count));
2482 }
2483 }
2484 }
2485}
2486</code></pre>
2487<p>It can be compiled and executed as follows:</p>
2488<pre><code>javac Crawl.java
2489java Crawl
2490</code></pre>
2491<p>Instead of copy-pasting the code, you may also download it as a <code>.zip</code>:</p>
2492<p><em>(contents removed)</em></p>
2493<h2 id="addendum"><a class="anchor" href="#addendum">¶</a>Addendum</h2>
2494<p>The following simple function can be used if one desires to print the contents of a file:</p>
2495<pre><code>public static void printFile(File file) {
2496 if (isTextFile(file)) {
2497 System.out.println('\n' + file.getName());
2498 try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
2499 String line;
2500 while ((line = reader.readLine()) != null) {
2501 System.out.println(line);
2502 }
2503 } catch (FileNotFoundException ignored) {
2504 System.err.println(&quot;warning: file disappeared while reading: &quot; + file);
2505 } catch (IOException e) {
2506 e.printStackTrace();
2507 }
2508 }
2509}
2510</code></pre>
2511</main>
2512</body>
2513</html>
2514 </content></entry><entry><title>About Boolean Retrieval</title><id>dist/about-boolean-retrieval/index.html</id><updated>2020-03-17T23:00:00+00:00</updated><published>2020-02-24T23:00:00+00:00</published><summary>This entry will discuss the section on the </summary><content type="html" src="dist/about-boolean-retrieval/index.html"><!DOCTYPE html>
2515<html>
2516<head>
2517<meta charset="utf-8" />
2518<meta name="viewport" content="width=device-width, initial-scale=1" />
2519<title>About Boolean Retrieval</title>
2520<link rel="stylesheet" href="../css/style.css">
2521</head>
2522<body>
2523<main>
2524<p>This entry will discuss the section on the <em><a href="https://nlp.stanford.edu/IR-book/pdf/01bool.pdf">Boolean retrieval</a></em> section of the book <em><a href="https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf">An Introduction to Information Retrieval</a></em>.</p>
2525<div class="date-created-modified">Created 2020-02-25<br>
2526Modified 2020-03-18</div>
2527<h2 class="title" id="summary_on_the_topic"><a class="anchor" href="#summary_on_the_topic">¶</a>Summary on the topic</h2>
2528<p>Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called <em>search</em>.</p>
2529<p>A simple way to retrieve information is to <em>grep</em> through the text (term named after the Unix tool <code>grep</code>), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.</p>
2530<p>To avoid linear scanning, we build an <em>index</em> and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document <em>incidence matrix</em>. Such a possible matrix is:</p>
2531<table class="">
2532 <tbody>
2533 <tr>
2534 <td>
2535 <em>
2536 word/play
2537 </em>
2538 </td>
2539 <td>
2540 <strong>
2541 Antony and Cleopatra
2542 </strong>
2543 </td>
2544 <td>
2545 <strong>
2546 Julius Caesar
2547 </strong>
2548 </td>
2549 <td>
2550 <strong>
2551 The Tempest
2552 </strong>
2553 </td>
2554 <td>
2555 <strong>
2556 …
2557 </strong>
2558 </td>
2559 </tr>
2560 <tr>
2561 <td>
2562 <strong>
2563 Antony
2564 </strong>
2565 </td>
2566 <td>
2567 1
2568 </td>
2569 <td>
2570 1
2571 </td>
2572 <td>
2573 0
2574 </td>
2575 <td>
2576 </td>
2577 </tr>
2578 <tr>
2579 <td>
2580 <strong>
2581 Brutus
2582 </strong>
2583 </td>
2584 <td>
2585 1
2586 </td>
2587 <td>
2588 1
2589 </td>
2590 <td>
2591 0
2592 </td>
2593 <td>
2594 </td>
2595 </tr>
2596 <tr>
2597 <td>
2598 <strong>
2599 Caesar
2600 </strong>
2601 </td>
2602 <td>
2603 1
2604 </td>
2605 <td>
2606 1
2607 </td>
2608 <td>
2609 0
2610 </td>
2611 <td>
2612 </td>
2613 </tr>
2614 <tr>
2615 <td>
2616 <strong>
2617 Calpurnia
2618 </strong>
2619 </td>
2620 <td>
2621 0
2622 </td>
2623 <td>
2624 1
2625 </td>
2626 <td>
2627 0
2628 </td>
2629 <td>
2630 </td>
2631 </tr>
2632 <tr>
2633 <td>
2634 <strong>
2635 Cleopatra
2636 </strong>
2637 </td>
2638 <td>
2639 1
2640 </td>
2641 <td>
2642 0
2643 </td>
2644 <td>
2645 0
2646 </td>
2647 <td>
2648 </td>
2649 </tr>
2650 <tr>
2651 <td>
2652 <strong>
2653 mercy
2654 </strong>
2655 </td>
2656 <td>
2657 1
2658 </td>
2659 <td>
2660 0
2661 </td>
2662 <td>
2663 1
2664 </td>
2665 <td>
2666 </td>
2667 </tr>
2668 <tr>
2669 <td>
2670 <strong>
2671 worser
2672 </strong>
2673 </td>
2674 <td>
2675 1
2676 </td>
2677 <td>
2678 0
2679 </td>
2680 <td>
2681 1
2682 </td>
2683 <td>
2684 </td>
2685 </tr>
2686 <tr>
2687 <td>
2688 <strong>
2689 …
2690 </strong>
2691 </td>
2692 <td>
2693 </td>
2694 <td>
2695 </td>
2696 <td>
2697 </td>
2698 <td>
2699 </td>
2700 </tr>
2701 </tbody>
2702</table>
2703<p>We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.</p>
2704<p>Now, answering a query such as <code>Brutus AND Caesar AND NOT Calpurnia</code> becomes trivial:</p>
2705<pre><code>VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
2706= 110 AND 110 AND COMPLEMENT(010)
2707= 110 AND 110 AND 101
2708= 100
2709</code></pre>
2710<p>The query is only satisfied for our first column.</p>
2711<p>The <em>Boolean retrieval model</em> is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with <code>OR</code>, <code>AND</code>, and <code>NOT</code>.</p>
2712<p>Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be <em>missing</em> when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the <em>positions</em> of the 1’s. This is known as an <em>inverted index</em>.</p>
2713<p>The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (<em>postings</em>). Applied to boolean retrieval, we would:</p>
2714<ol>
2715<li>Collects the documents to be indexed, assign a unique identifier each</li>
2716<li>Tokenize the text in the documents into a list of terms</li>
2717<li>Normalize the tokens, which now become indexing terms</li>
2718<li>Index the documents</li>
2719</ol>
2720<table class="">
2721 <tbody>
2722 <tr>
2723 <td>
2724 <strong>
2725 Dictionary
2726 </strong>
2727 </td>
2728 <td>
2729 <strong>
2730 Postings
2731 </strong>
2732 </td>
2733 </tr>
2734 <tr>
2735 <td>
2736 Brutus
2737 </td>
2738 <td>
2739 1, 2, 4, 11, 31, 45, 173, 174
2740 </td>
2741 </tr>
2742 <tr>
2743 <td>
2744 Caesar
2745 </td>
2746 <td>
2747 1, 2, 4, 5, 6, 16, 57, 132, …
2748 </td>
2749 </tr>
2750 <tr>
2751 <td>
2752 Calpurnia
2753 </td>
2754 <td>
2755 2, 31, 54, 101
2756 </td>
2757 </tr>
2758 <tr>
2759 <td>
2760 …
2761 </td>
2762 <td>
2763 </td>
2764 </tr>
2765 </tbody>
2766</table>
2767<p>Sort the pairs <code>(term, document_id)</code> so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.</p>
2768<table class="">
2769 <tbody>
2770 <tr>
2771 <td>
2772 <strong>
2773 term
2774 </strong>
2775 </td>
2776 <td>
2777 <strong>
2778 document_id
2779 </strong>
2780 </td>
2781 </tr>
2782 <tr>
2783 <td>
2784 I
2785 </td>
2786 <td>
2787 1
2788 </td>
2789 </tr>
2790 <tr>
2791 <td>
2792 did
2793 </td>
2794 <td>
2795 1
2796 </td>
2797 </tr>
2798 <tr>
2799 <td>
2800 …
2801 </td>
2802 <td>
2803 </td>
2804 </tr>
2805 <tr>
2806 <td>
2807 with
2808 </td>
2809 <td>
2810 2
2811 </td>
2812 </tr>
2813 </tbody>
2814</table>
2815<table class="">
2816 <tbody>
2817 <tr>
2818 <td>
2819 <strong>
2820 term
2821 </strong>
2822 </td>
2823 <td>
2824 <strong>
2825 document_id
2826 </strong>
2827 </td>
2828 </tr>
2829 <tr>
2830 <td>
2831 be
2832 </td>
2833 <td>
2834 2
2835 </td>
2836 </tr>
2837 <tr>
2838 <td>
2839 brutus
2840 </td>
2841 <td>
2842 1
2843 </td>
2844 </tr>
2845 <tr>
2846 <td>
2847 brutus
2848 </td>
2849 <td>
2850 2
2851 </td>
2852 </tr>
2853 <tr>
2854 <td>
2855 …
2856 </td>
2857 <td>
2858 </td>
2859 </tr>
2860 </tbody>
2861</table>
2862<table class="">
2863 <tbody>
2864 <tr>
2865 <td>
2866 <strong>
2867 term
2868 </strong>
2869 </td>
2870 <td>
2871 <strong>
2872 frequency
2873 </strong>
2874 </td>
2875 <td>
2876 <strong>
2877 postings list
2878 </strong>
2879 </td>
2880 </tr>
2881 <tr>
2882 <td>
2883 be
2884 </td>
2885 <td>
2886 1
2887 </td>
2888 <td>
2889 2
2890 </td>
2891 </tr>
2892 <tr>
2893 <td>
2894 brutus
2895 </td>
2896 <td>
2897 2
2898 </td>
2899 <td>
2900 1, 2
2901 </td>
2902 </tr>
2903 <tr>
2904 <td>
2905 capitol
2906 </td>
2907 <td>
2908 1
2909 </td>
2910 <td>
2911 1
2912 </td>
2913 </tr>
2914 <tr>
2915 <td>
2916 …
2917 </td>
2918 <td>
2919 </td>
2920 <td>
2921 </td>
2922 </tr>
2923 </tbody>
2924</table>
2925<p>Intersecting posting lists now becomes of transversing both lists in order:</p>
2926<pre><code>Brutus : 1 -&gt; 2 -&gt; 4 -&gt; 11 -&gt; 31 -&gt; 45 -&gt; 173 -&gt; 174
2927Calpurnia: 2 -&gt; 31 -&gt; 54 -&gt; 101
2928Intersect: 2 -&gt; 31
2929</code></pre>
2930<p>A simple conjunctive query (e.g. <code>Brutus AND Calpurnia</code>) is executed as follows:</p>
2931<ol>
2932<li>Locate <code>Brutus</code> in the dictionary</li>
2933<li>Retrieve its postings</li>
2934<li>Locate <code>Calpurnia</code> in the dictionary</li>
2935<li>Retrieve its postings</li>
2936<li>Intersect (<em>merge</em>) both postings</li>
2937</ol>
2938<p>Since the lists are sorted, walking both of them can be done in <em>O(n)</em> time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.</p>
2939<h2 id="thoughts"><a class="anchor" href="#thoughts">¶</a>Thoughts</h2>
2940<p>The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.</p>
2941<p>However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.</p>
2942<p>All in all, it’s an interesting way to look at the data and query it efficiently.</p>
2943</main>
2944</body>
2945</html>
2946 </content></entry></feed>