pagongpagong2020-10-02T22:00:00+00:00Information Retrieval and Web Searchdist/index/index.html2020-10-02T22:00:00+00:002020-10-02T22:00:00+00:00During 2020 at university, this subject ("Recuperación de la Información y Búsqueda en la Web")<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Information Retrieval and Web Search</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<h1 class="title" id="information_retrieval_and_web_search"><a class="anchor" href="#information_retrieval_and_web_search">¶</a>Information Retrieval and Web Search</h1>
<div class="date-created-modified">2020-10-03</div>
<p>During 2020 at university, this subject ("Recuperación de la Información y Búsqueda en la Web")
had us write blog posts as assignments. I think it would be really fun and I wanted to preserve
that work here, with the hopes it's interesting to someone.</p>
<p>The posts were auto-generated from the original HTML files and manually anonymized later.</p>
</main>
</body>
</html>
Privado: Final NoSQL evaluationdist/final-nosql-evaluation/index.html2020-05-13T22:00:00+00:002020-05-12T22:00:00+00:00This evaluation is a bit different to my <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Privado: Final NoSQL evaluation</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This evaluation is a bit different to my <a href="/blog/ribw/16/nosql-evaluation/">previous one</a> because this time I have been tasked to evaluate the student <code>a(i - 2)</code>, and because I am <code>a = 9</code> that happens to be <code>a(7) =</code> Classmate.</p>
<div class="date-created-modified">Created 2020-05-13<br>
Modified 2020-05-14</div>
<p>Unfortunately for Classmate, the only entry related to NoSQL I have found in their blog is Prima y segunda Actividad: Base de datos NoSQL which does not develop an application as requested for the third entry (as of 14th of May).</p>
<p>This means that, instead, I will evaluate <code>a(i - 3)</code> which happens to be <code>a(6) =</code> Classmate and they do have an entry.</p>
<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s Evaluation</h2>
<p><strong>Grading: B.</strong></p>
<p>The post I have evaluated is BB.DD. NoSQL RethinkDB 3ª Fase. Aplicación.</p>
<p>It starts with an introduction, properly explaining what database they have chosen and why, but not what application they will be making.</p>
<p>This is detailed just below in the next section, although it’s a bit vague.</p>
<p>The next section talks about the Python dependencies that are required, but they never said they would be making a Python application or that we need to install Python!</p>
<p>The next section talks about the file structure of the project, and they detail what everything part does, although I have missed some code snippets.</p>
<p>The final result is pretty cool and contains many interesting graphs, they provide a download to the source code and list all the relevant references used.</p>
<p>Except for a weird «necesario falta» in the text, it’s otherwise well-written, although given the issues above I cannot grade it with the highest score.</p>
</main>
</body>
</html>
Developing a Python application for MongoDBdist/developing-a-python-application-for-mongodb/index.html2020-04-15T22:00:00+00:002020-03-24T23:00:00+00:00This is the third and last post in the MongoDB series, where we will develop a Python application to process and store OpenData inside Mongo.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Developing a Python application for MongoDB</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This is the third and last post in the MongoDB series, where we will develop a Python application to process and store OpenData inside Mongo.</p>
<div class="date-created-modified">Created 2020-03-25<br>
Modified 2020-04-16</div>
<p>Other posts in this series:</p>
<ul>
<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a></li>
<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a></li>
<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a> (this post)</li>
</ul>
<p>This post is co-authored wih a Classmate.</p>
<hr />
<h2 class="title" id="what_are_we_making_"><a class="anchor" href="#what_are_we_making_">¶</a>What are we making?</h2>
<p>We are going to develop a web application that renders a map, in this case, the town of Cáceres, with which users can interact. When the user clicks somewhere on the map, the selected location will be sent to the server to process. This server will perform geospatial queries to Mongo and once the results are ready, the information is presented back at the webpage.</p>
<p>The data used for the application comes from <a href="https://opendata.caceres.es/">Cáceres’ OpenData</a>, and our goal is that users will be able to find information about certain areas in a quick and intuitive way, such as precise coordinates, noise level, and such.</p>
<h2 id="what_are_we_using_"><a class="anchor" href="#what_are_we_using_">¶</a>What are we using?</h2>
<p>The web application will be using <a href="https://python.org/">Python</a> for the backend, <a href="https://svelte.dev/">Svelte</a> for the frontend, and <a href="https://www.mongodb.com/">Mongo</a> as our storage database and processing center.</p>
<ul>
<li><strong>Why Python?</strong> It’s a comfortable language to write and to read, and has a great ecosystem with <a href="https://pypi.org/">plenty of libraries</a>.</li>
<li><strong>Why Svelte?</strong> Svelte is the New Thing<strong>™</strong> in the world of component frameworks for JavaScript. It is similar to React or Vue, but compiled and with a lot less boilerplate. Check out their <a href="https://svelte.dev/blog/svelte-3-rethinking-reactivity">Svelte post</a> to learn more.</li>
<li><strong>Why Mongo?</strong> We believe NoSQL is the right approach for doing the kind of processing and storage that we expect, and it’s <a href="https://docs.mongodb.com/">very easy to use</a>. In addition, we will be making Geospatial Queries which <a href="https://docs.mongodb.com/manual/geospatial-queries/">Mongo supports</a>.</li>
</ul>
<p>Why didn’t we choose to make a smaller project, you may ask? You will be shocked to hear that we do not have an answer for that!</p>
<p>Note that we will not be embedding <strong>all</strong> the code of the project in this post, or it would be too long! We will include only the relevant snippets needed to understand the core ideas of the project, and not the unnecessary parts of it (for example, parsing configuration files to easily change the port where the server runs is not included).</p>
<h2 id="python_dependencies"><a class="anchor" href="#python_dependencies">¶</a>Python dependencies</h2>
<p>Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the <a href="https://www.python.org/downloads/">Python downloads section</a>, but if you’re on Linux, chances are you have it installed already.</p>
<p>Once Python 3.7 or above is installed, install <a href="https://motor.readthedocs.io/en/stable/"><code>motor</code> (Asynchronous Python driver for MongoDB)</a> and the <a href="https://docs.aiohttp.org/en/stable/web.html"><code>aiohttp</code> server</a> through <code>pip</code>:</p>
<pre><code>pip install aiohttp motor
</code></pre>
<p>Make sure that Mongo is running in the background (this has been described in previous posts), and we should be able to get to work.</p>
<h2 id="web_dependencies"><a class="anchor" href="#web_dependencies">¶</a>Web dependencies</h2>
<p>To work with Svelte and its dependencies, we will need <code>[npm](https://www.npmjs.com/)</code> which comes with <a href="https://nodejs.org/en/">NodeJS</a>, so go and <a href="https://nodejs.org/en/download/">install Node from their site</a>. The download will be different depending on your operating system.</p>
<p>Following <a href="https://svelte.dev/blog/the-easiest-way-to-get-started">the easiest way to get started with Svelte</a>, we will put our project in a <code>client/</code> folder (because this is what the clients see, the frontend). Feel free to tinker a bit with the configuration files to change the name and such, although this isn’t relevant for the rest of the post.</p>
<h2 id="finding_the_data"><a class="anchor" href="#finding_the_data">¶</a>Finding the data</h2>
<p>We are going to work with the JSON files provided by <a href="http://opendata.caceres.es/">OpenData Cáceres</a>. In particular, we want information about the noise, census, vias and trees. To save you the time from <a href="http://opendata.caceres.es/dataset">searching each of these</a>, we will automate the download with code.</p>
<p>If you want to save the data offline or just know what data we’ll be using for other purposes though, you can right click on the following links and select «Save Link As…» with the name of the link:</p>
<ul>
<li><code>[noise.json](http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&format=json)</code></li>
<li><code>[census.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&year=2017&format=json)</code></li>
<li><code>[vias.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&year=2017&format=json)</code></li>
<li><code>[trees.json](http://opendata.caceres.es/GetData/GetData?dataset=om:Arbol&format=json)</code></li>
</ul>
<h2 id="backend"><a class="anchor" href="#backend">¶</a>Backend</h2>
<p>It’s time to get started with some code! We will put it in a <code>server/</code> folder because it will contain the Python server, that is, the backend of our application.</p>
<p>We are using <code>aiohttp</code> because we would like our server to be <code>async</code>. We don’t expect a lot of users at the same time, but it’s good to know our server would be well-designed for that use-case. As a bonus, it makes IO points clear in the code, which can help reason about it. The implicit synchronization between <code>await</code> is also a nice bonus.</p>
<h3 id="saving_the_data_in_mongo"><a class="anchor" href="#saving_the_data_in_mongo">¶</a>Saving the data in Mongo</h3>
<p>Before running the server, we must ensure that the data we need is already stored and indexed in Mongo. Our <code>server/data.py</code> will take care of downloading the files, cleaning them up a little (Cáceres’ OpenData can be a bit awkward sometimes), inserting them into Mongo and indexing them.</p>
<p>Downloading the JSON data can be done with <code>[ClientSession.get](https://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientSession.get)</code>. We also take this opportunity to clean up the messy encoding from the JSON, which does not seem to be UTF-8 in some cases.</p>
<pre><code>async def load_json(session, url):
fixes = [(old, new.encode('utf-8')) for old, new in [
(b'\xc3\x83\\u2018', 'Ñ'),
(b'\xc3\x83\\u0081', 'Á'),
(b'\xc3\x83\\u2030', 'É'),
(b'\xc3\x83\\u008D', 'Í'),
(b'\xc3\x83\\u201C', 'Ó'),
(b'\xc3\x83\xc5\xa1', 'Ú'),
(b'\xc3\x83\xc2\xa1', 'á'),
]]
async with session.get(url) as resp:
data = await resp.read()
# Yes, this feels inefficient, but it's not really worth improving.
for old, new in fixes:
data = data.replace(old, new)
data = data.decode('utf-8')
return json.loads(data)
</code></pre>
<p>Later on, it can be reused for the various different URLs:</p>
<pre><code>import aiohttp
NOISE_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&format=json'
# (...other needed URLs here)
async def insert_to_db(db):
async with aiohttp.ClientSession() as session:
data = await load_json(session, NOISE_URL)
# now we have the JSON data cleaned up, ready to be parsed
</code></pre>
<h3 id="data_model"><a class="anchor" href="#data_model">¶</a>Data model</h3>
<p>With the JSON data in our hands, it’s time to parse it. Always remember to <a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">parse, don’t validate</a>. With <a href="https://docs.python.org/3/library/dataclasses.html">Python 3.7 <code>dataclasses</code></a> it’s trivial to define classes that will store only the fields we care about, typed, and with proper names:</p>
<pre><code>from dataclasses import dataclass
Longitude = float
Latitude = float
@dataclass
class GSON:
type: str
coordinates: (Longitude, Latitude)
@dataclass
class Noise:
id: int
geo: GSON
level: float
</code></pre>
<p>This makes it really easy to see that, if we have a <code>Noise</code>, we can access its <code>geo</code> data which is a <code>GSON</code> with a <code>type</code> and <code>coordinates</code>, having <code>Longitude</code> and <code>Latitude</code> respectively. <code>dataclasses</code> and <code>[typing](https://docs.python.org/3/library/typing.html)</code> make dealing with this very easy and clear.</p>
<p>Every dataclass will be on its own collection inside Mongo, and these are:</p>
<ul>
<li>
<p>Noise</p>
</li>
<li>
<p>Integer <code>id</code></p>
</li>
<li>
<p>GeoJSON <code>geo</code></p>
</li>
<li>
<p>String <code>type</code></p>
</li>
<li>
<p>Longitude-latitude pair <code>coordinates</code></p>
</li>
<li>
<p>Floating-point number <code>level</code></p>
</li>
<li>
<p>Tree</p>
</li>
<li>
<p>String <code>name</code></p>
</li>
<li>
<p>String <code>gender</code></p>
</li>
<li>
<p>Integer <code>units</code></p>
</li>
<li>
<p>Floating-point number <code>height</code></p>
</li>
<li>
<p>Floating-point number <code>cup_diameter</code></p>
</li>
<li>
<p>Floating-point number <code>trunk_diameter</code></p>
</li>
<li>
<p>Optional string <code>variety</code></p>
</li>
<li>
<p>Optional string <code>distribution</code></p>
</li>
<li>
<p>GeoJSON <code>geo</code></p>
</li>
<li>
<p>Optional string <code>irrigation</code></p>
</li>
<li>
<p>Census</p>
</li>
<li>
<p>Integer <code>year</code></p>
</li>
<li>
<p>Via <code>via</code></p>
</li>
<li>
<p>String <code>name</code></p>
</li>
<li>
<p>String <code>kind</code></p>
</li>
<li>
<p>Integer <code>code</code></p>
</li>
<li>
<p>Optional string <code>history</code></p>
</li>
<li>
<p>Optional string <code>old_name</code></p>
</li>
<li>
<p>Optional floating-point number <code>length</code></p>
</li>
<li>
<p>Optional GeoJSON <code>start</code></p>
</li>
<li>
<p>GeoJSON <code>middle</code></p>
</li>
<li>
<p>Optional GeoJSON <code>end</code></p>
</li>
<li>
<p>Optional list with geometry pairs <code>geometry</code></p>
</li>
<li>
<p>Integer <code>count</code></p>
</li>
<li>
<p>Mapping year-to-count <code>count_per_year</code></p>
</li>
<li>
<p>Mapping gender-to-count <code>count_per_gender</code></p>
</li>
<li>
<p>Mapping nationality-to-count <code>count_per_nationality</code></p>
</li>
<li>
<p>Integer <code>time_year</code></p>
</li>
</ul>
<p>Now, let’s define a method to actually parse the JSON and yield instances from these new data classes:</p>
<pre><code>@classmethod
def iter_from_json(cls, data):
for row in data['results']['bindings']:
noise_id = int(row['uri']['value'].split('/')[-1])
long = float(row['geo_long']['value'])
lat = float(row['geo_lat']['value'])
level = float(row['om_nivelRuido']['value'])
yield cls(
id=noise_id,
geo=GSON(type='Point', coordinates=[long, lat]),
level=level
)
</code></pre>
<p>Here we iterate over the input JSON <code>data</code> bindings and <code>yield cls</code> instances with more consistent naming than the original one. We also extract the data from the many unnecessary nested levels of the JSON and have something a lot flatter to work with.</p>
<p>For those of you who don’t know what <code>yield</code> does (after all, not everyone is used to seeing generators), here’s two functions that work nearly the same:</p>
<pre><code>def squares_return(n):
result = []
for i in range(n):
result.append(n ** 2)
return result
def squares_yield(n):
for i in range(n):
yield n ** 2
</code></pre>
<p>The difference is that the one with <code>yield</code> is «lazy» and doesn’t need to do all the work up-front. It will generate (yield) more values as they are needed when you use a <code>for</code> loop. Generally, it’s a better idea to create generator functions than do all the work early which may be unnecessary. See <a href="https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do">What does the «yield» keyword do?</a> if you still have questions.</p>
<p>With everything parsed, it’s time to insert the data into Mongo. If the data was not present yet (0 documents), then we will download the file, parse it, insert it as documents into the given Mongo <code>db</code>, and index it:</p>
<pre><code>from dataclasses import asdict
async def insert_to_db(db):
async with aiohttp.ClientSession() as session:
if await db.noise.estimated_document_count() == 0:
data = await load_json(session, NOISE_URL)
await db.noise.insert_many(asdict(noise) for noise in Noise.iter_from_json(data))
await db.noise.create_index([('geo', '2dsphere')])
</code></pre>
<p>We repeat this process for all the other data, and just like that, Mongo is ready to be used in our server.</p>
<h3 id="indices"><a class="anchor" href="#indices">¶</a>Indices</h3>
<p>In order to execute our geospatial queries we have to create an index on the attribute that represents the location, because the operators that we will use requires it. This attribute can be a <a href="https://docs.mongodb.com/manual/reference/geojson/">GeoJSON object</a> or a legacy coordinate pair.</p>
<p>We have decided to use a GeoJSON object because we want to avoid legacy features that may be deprecated in the future.</p>
<p>The attribute is called <code>geo</code> for the <code>Tree</code> and <code>Noise</code> objects and <code>start</code>, <code>middle</code> or <code>end</code> for the <code>Via</code> class. In the <code>Via</code> we are going to index the attribute <code>middle</code> because it is the most representative field for us. Because the <code>Via</code> is inside the <code>Census</code> and it doesn’t have its own collection, we create the index on the <code>Census</code> collection.</p>
<p>The used index type is <code>2dsphere</code> because it supports queries that work on geometries on an earth-like sphere. Another option is the <code>2d</code> index but it’s not a good fit for our because it is for queries that calculate geometries on a two-dimensional plane.</p>
<h3 id="running_the_server"><a class="anchor" href="#running_the_server">¶</a>Running the server</h3>
<p>If we ignore the configuration part of the server creation, our <code>server.py</code> file is pretty simple. Its job is to create a <a href="https://aiohttp.readthedocs.io/en/stable/web.html">server application</a>, setup Mongo and return it to the caller so that they can run it:</p>
<pre><code>import asyncio
import subprocess
import motor.motor_asyncio
from aiohttp import web
from . import rest, data
def create_app():
ret = subprocess.run('npm run build', cwd='../client', shell=True).returncode
if ret != 0:
exit(ret)
db = motor.motor_asyncio.AsyncIOMotorClient().opendata
loop = asyncio.get_event_loop()
loop.run_until_complete(data.insert_to_db(db))
app = web.Application()
app['db'] = db
app.router.add_routes([
web.get('/', lambda r: web.HTTPSeeOther('/index.html')),
*rest.ROUTES,
web.static('/', os.path.join(config['www']['root'], 'public')),
])
return app
</code></pre>
<p>There’s a bit going on here, but it’s nothing too complex:</p>
<ul>
<li>We automatically run <code>npm run build</code> on the frontend because it’s very comfortable to have the frontend built automatically before the server runs.</li>
<li>We create a Motor client and access the <code>opendata</code> database. Into it, we load the data, effectively saving it in Mongo for the server to use.</li>
<li>We create the server application and save a reference to the Mongo database in it, so that it can be used later on any endpoint without needing to recreate it.</li>
<li>We define the routes of our app: root, REST and static (where the frontend files live). We’ll get to the <code>rest</code> part soon.
Running the server is now simple:</li>
</ul>
<pre><code>def main():
from aiohttp import web
from . import server
app = server.create_app()
web.run_app(app)
if __name__ == '__main__':
main()
</code></pre>
<h3 id="rest_endpoints"><a class="anchor" href="#rest_endpoints">¶</a>REST endpoints</h3>
<p>The frontend will communicate with the backend via <a href="https://en.wikipedia.org/wiki/Representational_state_transfer">REST</a> calls, so that it can ask for things like «give me the information associated with this area», and the web server can query the Mongo server to reply with a HTTP response. This little diagram should help:</p>
<p><img src="bitmap.png" alt="" /></p>
<p>What we need to do, then, is define those REST endpoints we mentioned earlier when creating the server. We will process the HTTP request, ask Mongo for the data, and return the HTTP response:</p>
<pre><code>import asyncio
import pymongo
from aiohttp import web
async def get_area_info(request):
try:
long = float(request.query['long'])
lat = float(request.query['lat'])
distance = float(request.query['distance'])
except KeyError as e:
raise web.HTTPBadRequest(reason=f'a required parameter was missing: {e.args[0]}')
except ValueError:
raise web.HTTPBadRequest(reason='one of the parameters was not a valid float')
geo_avg_noise_pipeline = [{
'$geoNear': {
'near' : {'type': 'Point', 'coordinates': [long, lat]},
'maxDistance': distance,
'minDistance': 0,
'spherical' : 'true',
'distanceField' : 'distance'
}
}]
db = request.app['db']
try:
noise_count, sum_noise, avg_noise = 0, 0, 0
async for item in db.noise.aggregate(geo_avg_noise_pipeline):
noise_count += 1
sum_noise += item['level']
if noise_count != 0:
avg_noise = sum_noise / noise_count
else:
avg_noise = None
except pymongo.errors.ConnectionFailure:
raise web.HTTPServiceUnavailable(reason='no connection to database')
return web.json_response({
'tree_count': tree_count,
'trees_per_type': [[k, v] for k, v in trees_per_type.items()],
'census_count': census_count,
'avg_noise': avg_noise,
})
ROUTES = [
web.get('/rest/get-area-info', get_area_info)
]
</code></pre>
<p>In this code, we’re only showing how to return the average noise because that’s the simplest we can do. The real code also fetches tree count, tree count per type, and census count.</p>
<p>Again, there’s quite a bit to go through, so let’s go step by step:</p>
<ul>
<li>We parse the frontend’s <code>request.query</code> into <code>float</code> that we can use. In particular, the frontend is asking us for information at a certain latitude, longitude, and distance. If the query is malformed, we return a proper error.</li>
<li>We create our query for Mongo outside, just so it’s clearer to read.</li>
<li>We access the database reference we stored earlier when creating the server with <code>request.app['db']</code>. Handy!</li>
<li>We try to query Mongo. It may fail if the Mongo server is not running, so we should handle that and tell the client what’s happening. If it succeeds though, we will gather information about the average noise.</li>
<li>We return a <code>json_response</code> with Mongo results for the frontend to present to the user.
You may have noticed we defined a <code>ROUTES</code> list at the bottom. This will make it easier to expand in the future, and the server creation won’t need to change anything in its code, because it’s already unpacking all the routes we define here.</li>
</ul>
<h3 id="geospatial_queries"><a class="anchor" href="#geospatial_queries">¶</a>Geospatial queries</h3>
<p>In order to retrieve the information from Mongo database we have defined two geospatial queries:</p>
<pre><code>geo_query = {
'$nearSphere' : {
'$geometry': {
'type': 'Point',
'coordinates': [long, lat]
},
'$maxDistance': distance,
'$minDistance': 0
}
}
</code></pre>
<p>This query uses <a href="https://docs.mongodb.com/manual/reference/operator/query/nearSphere/#op._S_nearSphere">the operator <code>$nearSphere</code></a> which return geospatial objects in proximity to a point on a sphere.</p>
<p>The sphere point is represented by the <code>$geometry</code> operator where it is specified the type of geometry and the coordinates (given by the HTTP request).</p>
<p>The maximum and minimum distance are represented by <code>$maxDistance</code> and <code>$minDistance</code> respectively. We specify that the maximum distance is the radio selected by the user.</p>
<pre><code>geo_avg_noise_pipeline = [{
'$geoNear': {
'near' : {'type': 'Point', 'coordinates': [long, lat]},
'maxDistance': distance,
'minDistance': 0,
'spherical' : 'true',
'distanceField' : 'distance'
}
}]
</code></pre>
<p>This query uses the <a href="https://docs.mongodb.com/manual/core/aggregation-pipeline/">aggregation pipeline</a> stage <a href="https://docs.mongodb.com/manual/reference/operator/aggregation/geoNear/#pipe._S_geoNear"><code>$geoNear</code></a> which returns an ordered stream of documents based on the proximity to a geospatial point. The output documents include an additional distance field.</p>
<p>The <code>near</code> field is mandatory and is the point for which to find the closest documents. In this field it is specified the type of geometry and the coordinates (given by the HTTP request).</p>
<p>The <code>distanceField</code> field is also mandatory and is the output field that will contain the calculated distance. In this case we’ve just called it <code>distance</code>.</p>
<p>Some other fields are <code>maxDistance</code> that indicates the maximum allowed distance from the center of the point, <code>minDistance</code> for the minimum distance, and <code>spherical</code> which tells MongoDB how to calculate the distance between two points.</p>
<p>We specify the maximum distance as the radio selected by the user in the frontend.</p>
<h2 id="frontend"><a class="anchor" href="#frontend">¶</a>Frontend</h2>
<p>As said earlier, our frontend will use Svelte. We already downloaded the template, so we can start developing. For some, this is the most fun part, because they can finally see and interact with some of the results. But for this interaction to work, we needed a functional backend which we now have!</p>
<h3 id="rest_queries"><a class="anchor" href="#rest_queries">¶</a>REST queries</h3>
<p>The frontend has to query the server to get any meaningful data to show on the page. The <a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API">Fetch API</a> does not throw an exception if the server doesn’t respond with HTTP OK, but we would like one if things go wrong, so that we can handle them gracefully. The first we’ll do is define our own exception <a href="https://stackoverflow.com/a/27724419">which is not pretty</a>:</p>
<pre><code>function NetworkError(message, status) {
var instance = new Error(message);
instance.name = 'NetworkError';
instance.status = status;
Object.setPrototypeOf(instance, Object.getPrototypeOf(this));
if (Error.captureStackTrace) {
Error.captureStackTrace(instance, NetworkError);
}
return instance;
}
NetworkError.prototype = Object.create(Error.prototype, {
constructor: {
value: Error,
enumerable: false,
writable: true,
configurable: true
}
});
Object.setPrototypeOf(NetworkError, Error);
</code></pre>
<p>But hey, now we have a proper and reusable <code>NetworkError</code>! Next, let’s make a proper and reusabe <code>query</code> function that deals with <code>fetch</code> for us:</p>
<pre><code>async function query(endpoint) {
const res = await fetch(endpoint, {
// if we ever use cookies, this is important
credentials: 'include'
});
if (res.ok) {
return await res.json();
} else {
throw new NetworkError(await res.text(), res.status);
}
}
</code></pre>
<p>At last, we can query our web server. The export here tells Svelte that this function should be visible to outer modules (public) as opposed to being private:</p>
<pre><code>export function get_area_info(long, lat, distance) {
return query(`/rest/get-area-info?long=${long}&lat=${lat}&distance=${distance}`);
}
</code></pre>
<p>The attentive reader will have noticed that <code>query</code> is <code>async</code>, but <code>get_area_info</code> is not. This is intentional, because we don’t need to <code>await</code> for anything inside of it. We can just return the <code>[Promise](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise)</code> that <code>query</code> created and let the caller <code>await</code> it as they see fit. The <code>await</code> here would have been redundant.</p>
<p>For those of you who don’t know what a JavaScript promise is, think of it as an object that represents «an eventual result». The result may not be there yet, but we promised it will be present in the future, and we can <code>await</code> for it. You can also find the same concept in other languages like Python under a different name, such as <a href="https://docs.python.org/3/library/asyncio-future.html#asyncio.Future"><code>Future</code></a>.</p>
<h3 id="map_component"><a class="anchor" href="#map_component">¶</a>Map component</h3>
<p>In Svelte, we can define self-contained components that are issolated from the rest. This makes it really easy to create a modular application. Think of a Svelte component as your own HTML tag, which you can customize however you want, building upon the already-existing components HTML has to offer.</p>
<p>The main thing that our map needs to do is render the map as an image and overlay the selection area as the user hovers the map with their mouse. We could render the image in the canvas itself, but instead we’ll use the HTML <code><img></code> tag for that and put a transparent <code><canvas></code> on top with some CSS. This should make it cheaper and easier to render things on the canvas.</p>
<p>The <code>Map</code> component will thus render as the user moves the mouse over it, and produce an event when they click so that whatever component is using a <code>Map</code> knows that it was clicked. Here’s the final CSS and HTML:</p>
<pre><code><style>
div {
position: relative;
}
canvas {
position: absolute;
left: 0;
top: 0;
cursor: crosshair;
}
</style>
<div>
<img bind:this={img} on:load={handleLoad} {height} src="caceres-municipality.svg" alt="Cáceres (municipality)"/>
<canvas
bind:this={canvas}
on:mousemove={handleMove}
on:wheel={handleWheel}
on:mouseup={handleClick}/>
</div>
</code></pre>
<p>We hardcode a map source here, but ideally this would be provided by the server. The project is already complex enough, so we tried to avoid more complexity than necessary.</p>
<p>We bind the tags to some variables declared in the JavaScript code of the component, along with some functions and parameters to let the users of <code>Map</code> customize it just a little.</p>
<p>Here’s the gist of the JavaScript code:</p>
<pre><code><script>
import { createEventDispatcher, onMount } from 'svelte';
export let height = 200;
const dispatch = createEventDispatcher();
let img;
let canvas;
const LONG_WEST = -6.426881;
const LONG_EAST = -6.354143;
const LAT_NORTH = 39.500064;
const LAT_SOUTH = 39.443201;
let x = 0;
let y = 0;
let clickInfo = null; // [x, y, radius]
let radiusDelta = 0.005 * height;
let maxRadius = 0.2 * height;
let minRadius = 0.01 * height;
let radius = 0.05 * height;
function handleLoad() {
canvas.width = img.width;
canvas.height = img.height;
}
function handleMove(event) {
const { left, top } = this.getBoundingClientRect();
x = Math.round(event.clientX - left);
y = Math.round(event.clientY - top);
}
function handleWheel(event) {
if (event.deltaY < 0) {
if (radius < maxRadius) {
radius += radiusDelta;
}
} else {
if (radius > minRadius) {
radius -= radiusDelta;
}
}
event.preventDefault();
}
function handleClick(event) {
dispatch('click', {
// the real code here maps the x/y/radius values to the right range, here omitted
x: ...,
y: ...,
radius: ...,
});
}
onMount(() => {
const ctx = canvas.getContext('2d');
let frame;
(function loop() {
frame = requestAnimationFrame(loop);
// the real code renders mouse area/selection, here omitted for brevity
...
}());
return () => {
cancelAnimationFrame(frame);
};
});
</script>
</code></pre>
<p>Let’s go through bit-by-bit:</p>
<ul>
<li>We define a few variables and constants for later use in the final code.</li>
<li>We define the handlers to react to mouse movement and clicks. On click, we dispatch an event to outer components.</li>
<li>We setup the render loop with animation frames, and cancel the current frame appropriatedly if the component disappears.</li>
</ul>
<h3 id="app_component"><a class="anchor" href="#app_component">¶</a>App component</h3>
<p>Time to put everything together! We wil include our function to make REST queries along with our <code>Map</code> component to render things on screen.</p>
<pre><code><script>
import Map from './Map.svelte';
import { get_area_info } from './rest.js'
let selection = null;
let area_info_promise = null;
function handleMapSelection(event) {
selection = event.detail;
area_info_promise = get_area_info(selection.x, selection.y, selection.radius);
}
function format_avg_noise(avg_noise) {
if (avg_noise === null) {
return '(no data)';
} else {
return `${avg_noise.toFixed(2)} dB`;
}
}
</script>
<div class="container-fluid">
<div class="row">
<div class="col-3" style="max-width: 300em;">
<div class="text-center">
<h1>Caceres Data Consultory</h1>
</div>
<Map height={400} on:click={handleMapSelection}/>
<div class="text-center mt-4">
{#if selection === null}
<p class="m-1 p-3 border border-bottom-0 bg-info text-white">Click on the map to select the area you wish to see details for.</p>
{:else}
<h2 class="bg-dark text-white">Selected area</h2>
<p><b>Coordinates:</b> ({selection.x}, {selection.y})</p>
<p><b>Radius:</b> {selection.radius} meters</p>
{/if}
</div>
</div>
<div class="col-sm-4">
<div class="row">
{#if area_info_promise !== null}
{#await area_info_promise}
<p>Fetching area information…</p>
{:then area_info}
<div class="col">
<div class="text-center">
<h2 class="m-1 bg-dark text-white">Area information</h2>
<ul class="list-unstyled">
<li>There are <b>{area_info.tree_count} trees </b> within the area</li>
<li>The <b>average noise</b> is <b>{format_avg_noise(area_info.avg_noise)}</b></li>
<li>There are <b>{area_info.census_count} persons </b> within the area</li>
</ul>
</div>
{#if area_info.trees_per_type.length > 0}
<div class="text-center">
<h2 class="m-1 bg-dark text-white">Tree count per type</h2>
</div>
<ul class="list-group">
{#each area_info.trees_per_type as [type, count]}
<li class="list-group-item">{type} <span class="badge badge-dark float-right">{count}</span></li>
{/each}
</ul>
{/if}
</div>
{:catch error}
<p>Failed to fetch area information: {error.message}</p>
{/await}
{/if}
</div>
</div>
</div>
</div>
</code></pre>
<ul>
<li>We import the <code>Map</code> component and REST function so we can use them.</li>
<li>We define a listener for the events that the <code>Map</code> produces. Such event will trigger a REST call to the server and save the result in a promise used later.</li>
<li>We’re using Bootstrap for the layout because it’s a lot easier. In the body we add our <code>Map</code> and another column to show the selection information.</li>
<li>We make use of Svelte’s <code>{#await}</code> to nicely notify the user when the call is being made, when it was successful, and when it failed. If it’s successful, we display the info.</li>
</ul>
<h2 id="results"><a class="anchor" href="#results">¶</a>Results</h2>
<p>Lo and behold, watch our application run!</p>
<p><video controls="controls" src="sr-2020-04-14_09-28-25.mp4"></video></p>
<p>In this video you can see our application running, but let’s describe what is happening in more detail.</p>
<p>When the application starts running (by opening it in your web browser of choice), you can see a map with the town of Cáceres. Then you, the user, can click to retrieve the information within the selected area.</p>
<p>It is important to note that one can make the selection area larger or smaller by trying to scroll up or down, respectively.</p>
<p>Once an area is selected, it is colored green in order to let the user know which area they have selected. Under the map, the selected coordinates and the radius (in meters) is also shown for the curious. At the right side the information concerning the selected area is shown, such as the number of trees, the average noise and the number of persons. If there are trees in the area, the application also displays the trees per type, sorted by the number of trees.</p>
<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
<p>We hope you enjoyed reading this post as much as we enjoyed writing it! Feel free to download the final project and play around with it. Maybe you can adapt it for even more interesting purposes!</p>
<p><em>download removed</em></p>
<p>To run the above code:</p>
<ol>
<li>Unzip the downloaded file.</li>
<li>Make a copy of <code>example-server-config.ini</code> and rename it to <code>server-config.ini</code>, then edit the file to suit your needs.</li>
<li>Run the server with <code>python -m server</code>.</li>
<li>Open <a href="http://localhost:9000">localhost:9000</a> in your web browser (or whatever port you chose) and enjoy!</li>
</ol>
</main>
</body>
</html>
MongoDB: an Introductiondist/mongodb-an-introduction/index.html2020-04-07T22:00:00+00:002020-03-04T23:00:00+00:00This is the first post in the MongoDB series, where we will introduce the MongoDB database system and take a look at its features and installation methods.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>MongoDB: an Introduction</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This is the first post in the MongoDB series, where we will introduce the MongoDB database system and take a look at its features and installation methods.</p>
<div class="date-created-modified">Created 2020-03-05<br>
Modified 2020-04-08</div>
<p>Other posts in this series:</p>
<ul>
<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a> (this post)</li>
<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a></li>
<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a></li>
</ul>
<p>This post is co-authored wih Classmate.</p>
<hr />
<div class="image-container">
<img src="mongodb.png" alt="NoSQL database – MongoDB – First delivery" />
<div class="image-caption"></div>
</div>
<p>
<h2 class="title" id="purpose_of_technology"><a class="anchor" href="#purpose_of_technology">¶</a>Purpose of technology</h2>
<p>MongoDB is a <strong>general purpose, document-based, distributed database</strong> built for modern application developers and for the cloud era, with the scalability and flexibility that you want with the querying and indexing that you need. It being a document database means it stores data in JSON-like documents.</p>
<p>The Mongo team believes this is the most natural way to think about data, which is (they claim) much more expressive and powerful than the traditional row/column model, since programmers think in objects.</p>
<h2 id="how_it_works"><a class="anchor" href="#how_it_works">¶</a>How it works</h2>
<p>MongoDB’s architecture can be summarized as follows:</p>
<ul>
<li>Document data model.</li>
<li>Distributed systems design.</li>
<li>Unified experience with freedom to run it anywhere.</li>
</ul>
<p>For a more in-depth explanation, MongoDB offers a <a href="https://www.mongodb.com/collateral/mongodb-architecture-guide">download to the MongoDB Architecture Guide</a> with roughly ten pages worth of text.</p>
<p><img src="knGHenfTGA4kzJb1PHmS9EQvtZl2QlhbIPN15M38m8fZfZf7ODwYfhf0Tltr.png" alt="" />
_ Overview of MongoDB’s architecture_</p>
<p>Regarding usage, MongoDB comes with a really nice introduction along with JavaScript, Python, Java, C++ or C# code at our choice, which describes the steps necessary to make it work. Below we will describe a common workflow.</p>
<p>First, we must <strong>connect</strong> to a running MongoDB instance. Once the connection succeeds, we can access individual «collections», which we can think of as <em>tables</em> where collections of data is stored.</p>
<p>For instance, we could <strong>insert</strong> an arbitrary JSON document into the <code>restaurants</code> collection to store information about a restaurant.</p>
<p>At any other point in time, we can <strong>query</strong> these collections. The queries range from trivial, empty ones (which would retrieve all the documents and fields) to more rich and complex queries (for instance, using AND and OR operators, checking if data exists, and then looking for a value in a list).</p>
<p>MongoDB also supports the creation of <strong>indices</strong>, similar to those in other database systems. It allows for the creation of indices on any field or subfields.</p>
<p>In Mongo, the <strong>aggregation pipeline</strong> allows us to filter and analyze data based on a given set of criteria. For example, we could pull all the documents in the <code>restaurants</code> collection that have a <code>category</code> of <code>Bakery</code> using the <code>$match</code> operator. Then, we can group them by their star rating using the <code>$group</code> operator. Using the accumulator operator, <code>$sum</code>, we can see how many bakeries in our collection have each star rating.</p>
<h2 id="features"><a class="anchor" href="#features">¶</a>Features</h2>
<p>The features can be seen all over the place in their site, because it’s something they make a lot of emphasis on:</p>
<ul>
<li>
<p><strong>Easy development</strong>, thanks to the document data model, something they claim to be «the best way to work with data».</p>
</li>
<li>
<p>Data is stored in flexible JSON-like documents.</p>
</li>
<li>
<p>This model directly maps to the objects in the application’s code.</p>
</li>
<li>
<p>Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyze the data.</p>
</li>
<li>
<p><strong>Powerful query language</strong>, with a rich and expressive query language that allows filtering and sorting by any field, no matter how nested it may be within a document. The queries are themselves JSON, and thus easily composable.</p>
</li>
<li>
<p><strong>Support for aggregations</strong> and other modern use-cases such as geo-based search, graph search, and text search.</p>
</li>
<li>
<p><strong>A distributed systems design</strong>, which allows developers to intelligently put data where they want it. High availability, horizontal scaling, and geographic distribution are built in and easy to use.</p>
</li>
<li>
<p><strong>A unified experience</strong> with the freedom to run anywhere, which allows developers to future-proof their work and eliminate vendor lock-in.</p>
</li>
</ul>
<h2 id="corner_in_cap_theorem"><a class="anchor" href="#corner_in_cap_theorem">¶</a>Corner in CAP theorem</h2>
<p>MongoDB’s position in the CAP theorem (Consistency, Availability, Partition Tolerance) depends on the database and driver configurations, and the type of disaster.</p>
<ul>
<li>With <strong>no partitions</strong>, the main focus is <strong>CA</strong>.</li>
<li>If there are **partitions **but the system is <strong>strongly connected</strong>, the main focus is <strong>AP</strong>: non-synchronized writes from the old primary are ignored.</li>
<li>If there are <strong>partitions</strong> but the system is <strong>not strongly connected</strong>, the main focus is <strong>CP</strong>: only read access is provided to avoid inconsistencies.
The general consensus seems to be that Mongo is <strong>CP</strong>.</li>
</ul>
<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
<p>We will be using the apt-based installation.</p>
<p>The Community version can be downloaded by anyone through <a href="https://www.mongodb.com/download-center/community">MongoDB Download Center</a>, where one can choose the version, Operating System and Package.MongoDB also seems to be <a href="https://packages.ubuntu.com/eoan/mongodb">available in Ubuntu’s PPAs</a>. </p>
<h2 id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
<p>We will be using an Ubuntu-based system, with apt available. To install MongoDB, we open a terminal and run the following command:</p>
<pre><code>apt install mongodb
</code></pre>
<p>After confirming that we do indeed want to install the package, we should be able to run the following command to verify that the installation was successful:</p>
<pre><code>mongod --version
</code></pre>
<p>The output should be similar to the following: </p>
<pre><code>db version v4.0.16
git version: 2a5433168a53044cb6b4fa8083e4cfd7ba142221
OpenSSL version: OpenSSL 1.1.1 11 Sep 2018
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1804
distarch: x86_64
target_arch: x86_64
</code></pre>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://www.mongodb.com/">MongoDB’s official site</a></li>
<li><a href="https://www.mongodb.com/what-is-mongodb">What is MongoDB?</a></li>
<li><a href="https://www.mongodb.com/mongodb-architecture">MongoDB Architecture</a></li>
<li><a href="https://stackoverflow.com/q/11292215/4759433">Where does mongodb stand in the CAP theorem?</a></li>
<li><a href="https://medium.com/@bikas.katwal10/mongodb-vs-cassandra-vs-rdbms-where-do-they-stand-in-the-cap-theorem-1bae779a7a15">What is the CAP Theorem? MongoDB vs Cassandra vs RDBMS, where do they stand in the CAP theorem?</a></li>
<li><a href="https://www.quora.com/Why-doesnt-MongoDB-have-availability-in-the-CAP-theorem">Why doesn’t MongoDB have availability in the CAP theorem?</a></li>
<li><a href="https://docs.mongodb.com/manual/installation/">Install MongoDB</a></li>
</ul>
</main>
</body>
</html>
MongoDB: Basic Operations and Architecturedist/mongodb-basic-operations-and-architecture/index.html2020-04-07T22:00:00+00:002020-03-04T23:00:00+00:00This is the second post in the MongoDB series, where we will take a look at the <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>MongoDB: Basic Operations and Architecture</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This is the second post in the MongoDB series, where we will take a look at the <a href="https://stackify.com/what-are-crud-operations/">CRUD operations</a> they support, the data model and architecture used.</p>
<div class="date-created-modified">Created 2020-03-05<br>
Modified 2020-04-08</div>
<p>Other posts in this series:</p>
<ul>
<li><a href="/blog/ribw/mongodb-an-introduction/">MongoDB: an Introduction</a></li>
<li><a href="/blog/ribw/mongodb-basic-operations-and-architecture/">MongoDB: Basic Operations and Architecture</a> (this post)</li>
<li><a href="/blog/ribw/developing-a-python-application-for-mongodb/">Developing a Python application for MongoDB</a></li>
</ul>
<p>This post is co-authored wih Classmate, and in it we will take an explorative approach using the <code>mongo</code> command line shell to execute commands against the database. It even has TAB auto-completion, which is awesome!</p>
<hr />
<p>Before creating any documents, we first need to create somewhere for the documents to be in. And before we create anything, the database has to be running, so let’s do that first. If we don’t have a service installed, we can run the <code>mongod</code> command ourselves in some local folder to make things easier:</p>
<pre><code>$ mkdir -p mongo-database
$ mongod --dbpath mongo-database
</code></pre>
<p>Just like that, we will have Mongo running. Now, let’s connect to it using the <code>mongo</code> command in another terminal (don’t close the terminal where the server is running, we need it!). By default, it connects to localhost, which is just what we need.</p>
<pre><code>$ mongo
</code></pre>
<h2 class="title" id="create"><a class="anchor" href="#create">¶</a>Create</h2>
<h3 id="create_a_database"><a class="anchor" href="#create_a_database">¶</a>Create a database</h3>
<p>Let’s list the databases:</p>
<pre><code>> show databases
admin 0.000GB
config 0.000GB
local 0.000GB
</code></pre>
<p>Oh, how interesting! There’s already some databases, even though we just created the directory where Mongo will store everything. However, they seem empty, which make sense.</p>
<p>Creating a new database is done by <code>use</code>-ing a name that doesn’t exist. Let’s call our new database «helloworld».</p>
<pre><code>> use helloworld
switched to db helloworld
</code></pre>
<p>Good! Now the «local variable» called <code>db</code> points to our <code>helloworld</code> database.</p>
<pre><code>> db
helloworld
</code></pre>
<p>What happens if we print the databases again? Surely our new database will show up now…</p>
<pre><code>> show databases
admin 0.000GB
config 0.000GB
local 0.000GB
</code></pre>
<p>…maybe not! It seems Mongo won’t create the database until we create some collections and documents in it. Databases contain collections, and inside collections (which you can think of as tables) we can insert new documents (which you can think of as rows). Like in many programming languages, the dot operator is used to access these «members».</p>
<h3 id="create_a_document"><a class="anchor" href="#create_a_document">¶</a>Create a document</h3>
<p>Let’s add a new greeting into the <code>greetings</code> collection:</p>
<pre><code>> db.greetings.insert({message: "¡Bienvenido!", lang: "es"})
WriteResult({ "nInserted" : 1 })
> show collections
greetings
> show databases
admin 0.000GB
config 0.000GB
helloworld 0.000GB
local 0.000GB
</code></pre>
<p>That looks promising! We can also see our new <code>helloworld</code> database also shows up. The Mongo shell actually works on JavaScript-like code, which is why we can use a variant of JSON (BSON) to insert documents (note the lack of quotes around the keys, convenient!).</p>
<p>The <a href="https://docs.mongodb.com/manual/reference/method/db.collection.insert/index.html"><code>insert</code></a> method actually supports a list of documents, and by default Mongo will assign a unique identifier to each. If we don’t want that though, all we have to do is add the <code>_id</code> key to our documents.</p>
<pre><code>> db.greetings.insert([
... {message: "Welcome!", lang: "en"},
... {message: "Bonjour!", lang: "fr"},
... ])
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 2,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})
</code></pre>
<h3 id="create_a_collection"><a class="anchor" href="#create_a_collection">¶</a>Create a collection</h3>
<p>In this example, we created the collection <code>greetings</code> implicitly, but behind the scenes Mongo made a call to <a href="https://docs.mongodb.com/manual/reference/method/db.createCollection/"><code>createCollection</code></a>. Let’s do just that:</p>
<pre><code>> db.createCollection("goodbyes")
{ "ok" : 1 }
> show collections
goodbyes
greetings
</code></pre>
<p>The method actually has a default parameter to configure other options, like the maximum size of the collection or maximum amount of documents in it, validation-related options, and so on. These are all described in more details in the documentation.</p>
<h2 id="read"><a class="anchor" href="#read">¶</a>Read</h2>
<p>To read the contents of a document, we have to <a href="https://docs.mongodb.com/manual/reference/method/db.collection.find/index.html"><code>find</code></a> it.</p>
<pre><code>> db.greetings.find()
{ "_id" : ObjectId("5e74829a0659f802b15f18dd"), "message" : "¡Bienvenido!", "lang" : "es" }
{ "_id" : ObjectId("5e7487b90659f802b15f18de"), "message" : "Welcome!", "lang" : "en" }
{ "_id" : ObjectId("5e7487b90659f802b15f18df"), "message" : "Bonjour!", "lang" : "fr" }
</code></pre>
<p>That’s a bit unreadable for my taste, can we make it more <a href="https://docs.mongodb.com/manual/reference/method/cursor.pretty/index.html"><code>pretty</code></a>?</p>
<pre><code>> db.greetings.find().pretty()
{
"_id" : ObjectId("5e74829a0659f802b15f18dd"),
"message" : "¡Bienvenido!",
"lang" : "es"
}
{
"_id" : ObjectId("5e7487b90659f802b15f18de"),
"message" : "Welcome!",
"lang" : "en"
}
{
"_id" : ObjectId("5e7487b90659f802b15f18df"),
"message" : "Bonjour!",
"lang" : "fr"
}
</code></pre>
<p>Gorgeous! We can clearly see Mongo created an identifier for us automatically. The queries are also JSON, and support a bunch of operators (prefixed by <code>$</code>), known as <a href="https://docs.mongodb.com/manual/reference/operator/query/">Query Selectors</a>. Here’s a few:</p>
<table>
<thead>
<tr>
<th>
Operation
</th>
<th>
Syntax
</th>
<th>
RDBMS equivalent
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Equals
</td>
<td>
<code>
{key: {$eq: value}}
</code>
<br/>
Shorthand:
<code>
{key: value}
</code>
</td>
<td>
<code>
where key = value
</code>
</td>
</tr>
<tr>
<td>
Less Than
</td>
<td>
<code>
{key: {$lte: value}}
</code>
</td>
<td>
<code>
where key < value
</code>
</td>
</tr>
<tr>
<td>
Less Than or Equal
</td>
<td>
<code>
{key: {$lt: value}}
</code>
</td>
<td>
<code>
where key <= value
</code>
</td>
</tr>
<tr>
<td>
Greater Than
</td>
<td>
<code>
{key: {$gt: value}}
</code>
</td>
<td>
<code>
where key > value
</code>
</td>
</tr>
<tr>
<td>
Greater Than or Equal
</td>
<td>
<code>
{key: {$gte: value}}
</code>
</td>
<td>
<code>
where key >= value
</code>
</td>
</tr>
<tr>
<td>
Not Equal
</td>
<td>
<code>
{key: {$ne: value}}
</code>
</td>
<td>
<code>
where key != value
</code>
</td>
</tr>
<tr>
<td>
And
</td>
<td>
<code>
{$and: [{k1: v1}, {k2: v2}]}
</code>
</td>
<td>
<code>
where k1 = v1 and k2 = v2
</code>
</td>
</tr>
<tr>
<td>
Or
</td>
<td>
<code>
{$or: [{k1: v1}, {k2: v2}]}
</code>
</td>
<td>
<code>
where k1 = v1 or k2 = v2
</code>
</td>
</tr>
</tbody>
</table>
<p>The operations all do what you would expect them to do, and their names are really intuitive. Aggregating operations with <code>$and</code> or <code>$or</code> can be done anywhere in the query, nested any level deep.</p>
<h2 id="update"><a class="anchor" href="#update">¶</a>Update</h2>
<p>Updating a document can be done by using <a href="https://docs.mongodb.com/manual/reference/method/db.collection.save/index.html"><code>save</code></a> on an already-existing document (that is, the document we want to save has <code>_id</code> and it’s in the collection already). If the document is not in the collection yet, this method will create it.</p>
<pre><code>> db.greetings.save({_id: ObjectId("5e74829a0659f802b15f18dd"), message: "¡Bienvenido, humano!", "lang" : "es"})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.greetings.find({lang: "es"})
{ "_id" : ObjectId("5e74829a0659f802b15f18dd"), "message" : "¡Bienvenido, humano!", "lang" : "es" }
</code></pre>
<p>Alternatively, the <a href="https://docs.mongodb.com/manual/reference/method/db.collection.update/index.html"><code>update</code></a> method takes a query and new value.</p>
<pre><code>> db.greetings.update({lang: "en"}, {$set: {message: "Welcome, human!"}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.greetings.find({lang: "en"})
{ "_id" : ObjectId("5e7487b90659f802b15f18de"), "message" : "Welcome, human!", "lang" : "en" }
</code></pre>
<h2 id="indexing"><a class="anchor" href="#indexing">¶</a>Indexing</h2>
<p>Creating an index is done with <a href="https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/index.html"><code>createIndex</code></a>:</p>
<pre><code>> db.greetings.createIndex({lang: +1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
</code></pre>
<p>Here, we create an ascending index on the lang key. Descending order is done with <code>-1</code>. Now a query for <code>lang</code> in our three documents will be fast… well maybe iteration over three documents was faster than an index.</p>
<h2 id="delete"><a class="anchor" href="#delete">¶</a>Delete</h2>
<h3 id="delete_a_document"><a class="anchor" href="#delete_a_document">¶</a>Delete a document</h3>
<p>I have to confess, I can’t talk French. I learnt it long ago and it’s long forgotten, so let’s remove the translation I copied online from our greetings with <a href="https://docs.mongodb.com/manual/reference/method/db.collection.remove/index.html"><code>remove</code></a>.</p>
<pre><code>> db.greetings.remove({lang: "fr"})
WriteResult({ "nRemoved" : 1 })
</code></pre>
<h3 id="delete_a_collection"><a class="anchor" href="#delete_a_collection">¶</a>Delete a collection</h3>
<p>We never really used the <code>goodbyes</code> collection. Can we get rid of that?</p>
<pre><code>> db.goodbyes.drop()
true
</code></pre>
<p>Yes, it is <code>true</code> that we can <a href="https://docs.mongodb.com/manual/reference/method/db.collection.drop/index.html"><code>drop</code></a> it.</p>
<h3 id="delete_a_database"><a class="anchor" href="#delete_a_database">¶</a>Delete a database</h3>
<p>Now, I will be honest, I don’t really like our <code>greetings</code> database either. It stinks. Let’s get rid of it as well:</p>
<pre><code>> db.dropDatabase()
{ "dropped" : "helloworld", "ok" : 1 }
</code></pre>
<p>Yeah, take that! The <a href="https://docs.mongodb.com/manual/reference/method/db.dropDatabase/"><code>dropDatabase</code></a> can be used to drop databases.</p>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<p>The examples in this post are all fictional, and the methods that could be used where taken from Classmate’s post, and of course <a href="https://docs.mongodb.com/manual/reference/method/">Mongo’s documentation</a>.</p>
</main>
</body>
</html>
Introduction to Hadoop and its MapReducedist/introduction-to-hadoop-and-its-mapreduce/index.html2020-04-02T22:00:00+00:002020-03-31T22:00:00+00:00Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Introduction to Hadoop and its MapReduce</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.</p>
<div class="date-created-modified">Created 2020-04-01<br>
Modified 2020-04-03</div>
<p>Hadoop also offers a distributed filesystem (HDFS) enabling for fast transfer among nodes, and a way to program with MapReduce.</p>
<p>It aims to strive for the 4 V’s: Volume, Variety, Veracity and Velocity. For veracity, it is a secure environment that can be trusted.</p>
<h2 class="title" id="milestones"><a class="anchor" href="#milestones">¶</a>Milestones</h2>
<p>The creators of Hadoop are Doug Cutting and Mike Cafarella, who just wanted to design a search engine, Nutch, and quickly found the problems of dealing with large amounts of data. They found their solution with the papers Google published.</p>
<p>The name comes from the plush of Cutting’s child, a yellow elephant.</p>
<ul>
<li>In July 2005, Nutch used GFS to perform MapReduce operations.</li>
<li>In February 2006, Nutch started a Lucene subproject which led to Hadoop.</li>
<li>In April 2007, Yahoo used Hadoop in a 1 000-node cluster.</li>
<li>In January 2008, Apache took over and made Hadoop a top-level project.</li>
<li>In July 2008, Apache tested a 4000-node cluster. The performance was the fastest compared to other technologies that year.</li>
<li>In May 2009, Hadoop sorted a petabyte of data in 17 hours.</li>
<li>In December 2011, Hadoop reached 1.0.</li>
<li>In May 2012, Hadoop 2.0 was released with the addition of YARN (Yet Another Resource Navigator) on top of HDFS, splitting MapReduce and other processes into separate components, greatly improving the fault tolerance.</li>
</ul>
<p>From here onwards, many other alternatives have born, like Spark, Hive & Drill, Kafka, HBase, built around the Hadoop ecosystem.</p>
<p>As of 2017, Amazon has clusters between 1 and 100 nodes, Yahoo has over 100 000 CPUs running Hadoop, AOL has clusters with 50 machines, and Facebook has a 320-machine (2 560 cores) and 1.3PB of raw storage.</p>
<h2 id="why_not_use_rdbms_"><a class="anchor" href="#why_not_use_rdbms_">¶</a>Why not use RDBMS?</h2>
<p>Relational database management systems simply cannot scale horizontally, and vertical scaling will require very expensive servers. Similar to RDBMS, Hadoop has a notion of jobs (analogous to transactions), but without ACID or concurrency control. Hadoop supports any form of data (unstructured or semi-structured) in read-only mode, and failures are common but there’s a simple yet efficient fault tolerance.</p>
<p>So what problems does Hadoop solve? It solves the way we should think about problems, and distributing them, which is key to do anything related with BigData nowadays. We start working with clusters of nodes, and coordinating the jobs between them. Hadoop’s API makes this really easy.</p>
<p>Hadoop also takes very seriously the loss of data with replication, and if a node falls, they are moved to a different node.</p>
<h2 id="major_components"><a class="anchor" href="#major_components">¶</a>Major components</h2>
<p>The previously-mentioned HDFS runs on commodity machine, which are cost-friendly. It is very fault-tolerant and efficient enough to process huge amounts of data, because it splits large files into smaller chunks (or blocks) that can be more easily handled. Multiple nodes can work on multiple chunks at the same time.</p>
<p>NameNode stores the metadata of the various datablocks (map of blocks) along with their location. It is the brain and the master in Hadoop’s master-slave architecture, also known as the namespace, and makes use of the DataNode.</p>
<p>A secondary NameNode is a replica that can be used if the first NameNode dies, so that Hadoop doesn’t shutdown and can restart.</p>
<p>DataNode stores the blocks of data, and are the slaves in the architecture. This data is split into one or more files. Their only job is to manage this access to the data. They are often distributed among racks to avoid data lose.</p>
<p>JobTracker creates and schedules jobs from the clients for either map or reduce operations.</p>
<p>TaskTracker runs MapReduce tasks assigned to the current data node.</p>
<p>When clients need data, they first interact with the NameNode and replies with the location of the data in the correct DataNode. Client proceeds with interaction with the DataNode.</p>
<h2 id="mapreduce"><a class="anchor" href="#mapreduce">¶</a>MapReduce</h2>
<p>MapReduce, as the name implies, is split into two steps: the map and the reduce. The map stage is the «divide and conquer» strategy, while the reduce part is about combining and reducing the results.</p>
<p>The mapper has to process the input data (normally a file or directory), commonly line-by-line, and produce one or more outputs. The reducer uses all the results from the mapper as its input to produce a new output file itself.</p>
<p><img src="bitmap.png" alt="" /></p>
<p>When reading the data, some may be junk that we can choose to ignore. If it is valid data, however, we label it with a particular type that can be useful for the upcoming process. Hadoop is responsible for splitting the data accross the many nodes available to execute this process in parallel.</p>
<p>There is another part to MapReduce, known as the Shuffle-and-Sort. In this part, types or categories from one node get moved to a different node. This happens with all nodes, so that every node can work on a complete category. These categories are known as «keys», and allows Hadoop to scale linearly.</p>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://youtu.be/oT7kczq5A-0">YouTube – Hadoop Tutorial For Beginners | What Is Hadoop? | Hadoop Tutorial | Hadoop Training | Simplilearn</a></li>
<li><a href="https://youtu.be/bcjSe0xCHbE">YouTube – Learn MapReduce with Playing Cards</a></li>
<li><a href="https://youtu.be/j8ehT1_G5AY?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">YouTube – Video Post #2: Hadoop para torpes (I)-¿Qué es y para qué sirve?</a></li>
<li><a href="https://youtu.be/NQ8mjVPCDvk?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #3: Hadoop para torpes (II)-¿Cómo funciona? HDFS y MapReduce</a></li>
<li><a href="https://hadoop.apache.org/old/releases.html">Apache Hadoop Releases</a></li>
<li><a href="https://youtu.be/20qWx2KYqYg?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #4: Hadoop para torpes (III y fin)- Ecosistema y distribuciones</a></li>
<li><a href="http://www.hadoopbook.com/">Chapter 2 – Hadoop: The Definitive Guide, Fourth Edition</a> (<a href="http://grut-computing.com/HadoopBook.pdf">pdf,</a><a href="http://www.hadoopbook.com/code.html">code</a>)</li>
</ul>
</main>
</body>
</html>
Google’s BigTabledist/googles-bigtable/index.html2020-04-02T22:00:00+00:002020-03-31T22:00:00+00:00Let’s talk about BigTable, and why it is what it is. But before we get into that, let’s see some important aspects anybody should consider when dealing with a lot of data (something BigTable does!).<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Google’s BigTable</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>Let’s talk about BigTable, and why it is what it is. But before we get into that, let’s see some important aspects anybody should consider when dealing with a lot of data (something BigTable does!).</p>
<div class="date-created-modified">Created 2020-04-01<br>
Modified 2020-04-03</div>
<h2 class="title" id="the_basics"><a class="anchor" href="#the_basics">¶</a>The basics</h2>
<p>Converting a text document into a different format is often a great way to greatly speed up scanning of it in the future. It allows for efficient searches.</p>
<p>In addition, you generally want to store everything in a single, giant file. This will save a lot of time opening and closing files, because everything is in the same file! One proposal to make this happen is <a href="https://trec.nist.gov/file_help.html">Web TREC</a> (see also the <a href="https://en.wikipedia.org/wiki/Text_Retrieval_Conference">Wikipedia page on TREC</a>), which is basically HTML but every document is properly delimited from one another.</p>
<p>Because we will have a lot of data, it’s often a good idea to compress it. Most text consists of the same words, over and over again. Classic compression techniques such as <code>DEFLATE</code> or <code>LZW</code> do an excellent job here.</p>
<h2 id="so_what_s_bigtable_"><a class="anchor" href="#so_what_s_bigtable_">¶</a>So what’s BigTable?</h2>
<p>Okay, enough of an introduction to the basics on storing data. BigTable is what Google uses to store documents, and it’s a customized approach to save, search and update web pages.</p>
<p>BigTable is is a distributed storage system for managing structured data, able to scale to petabytes of data across thousands of commodity servers, with wide applicability, scalability, high performance, and high availability.</p>
<p>In a way, it’s kind of like databases and shares many implementation strategies with them, like parallel databases, or main-memory databases, but of course, with a different schema.</p>
<p>It consists of a big table known as the «Root tablet», with pointers to many other «tablets» (or metadata in between). These are stored in a replicated filesystem accessible by all BigTable servers. Any change to a tablet gets logged (said log also gets stored in a replicated filesystem).</p>
<p>If any of the tablets servers gets locked, a different one can take its place, read the log and deal with the problem.</p>
<p>There’s no query language, transactions occur at row-level only. Every read or write in a row is atomic. Each row stores a single web page, and by combining the row and column keys along with a timestamp, it is possible to retrieve a single cell in the row. More formally, it’s a map that looks like this:</p>
<pre><code>fetch(row: string, column: string, time: int64) -> string
</code></pre>
<p>A row may have as many columns as it needs, and these column groups are the same for everyone (but the columns themselves may vary), which is importan to reduce disk read time.</p>
<p>Rows are split in different tablets based on the row keys, which simplifies determining an appropriated server for them. The keys can be up to 64KB big, although most commonly they range 10-100 bytes.</p>
<h2 id="conclusions"><a class="anchor" href="#conclusions">¶</a>Conclusions</h2>
<p>BigTable is Google’s way to deal with large amounts of data on many of their services, and the ideas behind it are not too complex to understand.</p>
</main>
</body>
</html>
A practical example with Hadoopdist/a-practical-example-with-hadoop/index.html2020-04-02T22:00:00+00:002020-03-31T22:00:00+00:00In our <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>A practical example with Hadoop</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>In our <a href="/blog/ribw/introduction-to-hadoop-and-its-mapreduce/">previous Hadoop post</a>, we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.</p>
<div class="date-created-modified">Created 2020-04-01<br>
Modified 2020-04-03</div>
<p>This post will showcase my own implementation to implement a word counter for any plain text document that you want to analyze.</p>
<h2 class="title" id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
<p>Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to <a href="http://hadoop.apache.org/releases.html">Apache Hadoop’s releases</a> and download the <a href="https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz">latest binary version</a> at the time of writing (3.2.1).</p>
<p>We will be using the <a href="https://linuxmint.com/">Linux Mint</a> distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as <a href="https://ubuntu.com/">Ubuntu</a>.</p>
<p>Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as <a href="https://openjdk.java.net/">OpenJDK</a>.</p>
<p>Here are all the three steps in the command line:</p>
<pre><code>wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
hadoop-3.2.1/bin/hadoop version
</code></pre>
<h2 id="processing_data"><a class="anchor" href="#processing_data">¶</a>Processing data</h2>
<p>To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.</p>
<p>We will use Java, because it’s a dependency that we already have anyway, so might as well.</p>
<p>Our map function needs to split each of the lines we receive as input into words, and we will also convert them to lowercase, thus preparing the data for later use (counting words). There won’t be bad records, so we don’t have to worry about that.</p>
<p>Copy or reproduce the following code in a file called <code>WordCountMapper.java</code>, using any text editor of your choice:</p>
<pre><code>import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (String word : value.toString().split("\\W")) {
context.write(new Text(word.toLowerCase()), new IntWritable(1));
}
}
}
</code></pre>
<p>Now, let’s create the <code>WordCountReducer.java</code> file. Its job is to reduce the data from multiple values into just one. We do that by summing all the values (our word count so far):</p>
<pre><code>import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
</code></pre>
<p>Let’s just take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.</p>
<p>Last, let’s write the <code>main</code> method, or else we won’t be able to run it. In our new file <code>WordCount.java</code>:</p>
<pre><code>import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("usage: java WordCount <input path> <output path>");
System.exit(-1);
}
Job job = Job.getInstance();
job.setJobName("Word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
</code></pre>
<p>And compile by including the required <code>.jar</code> dependencies in Java’s classpath with the <code>-cp</code> switch:</p>
<pre><code>javac -cp "hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*" *.java
</code></pre>
<p>At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful). Let’s run it on the same <code>WordCount.java</code> source file we wrote:</p>
<pre><code>java -cp ".:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*" WordCount WordCount.java results
</code></pre>
<p>Hooray! We should have a new <code>results/</code> folder along with the following files:</p>
<pre><code>$ ls results
part-r-00000 _SUCCESS
$ cat results/part-r-00000
154
0 2
1 3
2 1
addinputpath 1
apache 6
args 4
boolean 1
class 6
count 1
err 1
exception 1
-snip- (output cut for clarity)
</code></pre>
<p>It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.</p>
</main>
</body>
</html>
How does Google’s Search Engine work?dist/how-does-googles-search-engine-work/index.html2020-03-27T23:00:00+00:002020-03-17T23:00:00+00:00The original implementation was written in C/++ for Linux/Solaris.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>How does Google’s Search Engine work?</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>The original implementation was written in C/++ for Linux/Solaris.</p>
<div class="date-created-modified">Created 2020-03-18<br>
Modified 2020-03-28</div>
<p>There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.</p>
<p><img src="image-1024x649.png" alt="" /></p>
<p>But before we talk about the different components, let’s take a look at how they store all of this information.</p>
<h2 class="title" id="data_structures"><a class="anchor" href="#data_structures">¶</a>Data structures</h2>
<p>A «BigFile» is a virtual file addressable by 64 bits.</p>
<p>There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.</p>
<table class="">
<tbody>
<tr>
<td>
sync
</td>
<td>
length
</td>
<td>
compressed packet
</td>
</tr>
</tbody>
</table>
<p>The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.</p>
<table class="">
<tbody>
<tr>
<td>
doc id
</td>
<td>
ecode
</td>
<td>
url len
</td>
<td>
page len
</td>
<td>
url
</td>
<td>
page
</td>
</tr>
</tbody>
</table>
<p>A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.</p>
<table class="">
<tbody>
<tr>
<td>
word id
</td>
<td>
n docs
</td>
</tr>
<tr>
<td>
word id
</td>
<td>
n docs
</td>
</tr>
</tbody>
</table>
<p>The Hit Lists store occurences of a word in a document.</p>
<table class="">
<tbody>
<tr>
<td>
<strong>
plain
</strong>
</td>
<td>
cap: 1
</td>
<td>
imp: 3
</td>
<td>
pos: 12
</td>
</tr>
<tr>
<td>
<strong>
fancy
</strong>
</td>
<td>
cap: 1
</td>
<td>
imp: 7
</td>
<td>
type: 4
</td>
<td>
pos: 8
</td>
</tr>
<tr>
<td>
<strong>
anchor
</strong>
</td>
<td>
cap: 1
</td>
<td>
imp: 7
</td>
<td>
type: 4
</td>
<td>
hash: 4
</td>
<td>
pos: 8
</td>
</tr>
</tbody>
</table>
<p>The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).</p>
<table class="">
<tbody>
<tr>
<td rowspan="3">
doc id
</td>
<td>
word id: 24
</td>
<td>
n hits: 8
</td>
<td>
hit hit hit hit hit hit hit hit
</td>
</tr>
<tr>
<td>
word id: 24
</td>
<td>
n hits: 8
</td>
<td>
hit hit hit hit hit hit hit hit
</td>
</tr>
<tr>
<td>
null word id
</td>
</tr>
</tbody>
</table>
<p>The Inverted Index can be sorted by either document identifier or by ranking of word occurence.</p>
<table class="">
<tbody>
<tr>
<td>
doc id: 23
</td>
<td>
n hits: 5
</td>
<td>
hit hit hit hit hit
</td>
</tr>
<tr>
<td>
doc id: 23
</td>
<td>
n hits: 3
</td>
<td>
hit hit hit
</td>
</tr>
<tr>
<td>
doc id: 23
</td>
<td>
n hits: 4
</td>
<td>
hit hit hit hit
</td>
</tr>
<tr>
<td>
doc id: 23
</td>
<td>
n hits: 2
</td>
<td>
hit hit
</td>
</tr>
</tbody>
</table>
<p>Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.</p>
<h2 id="crawling"><a class="anchor" href="#crawling">¶</a>Crawling</h2>
<p>The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.</p>
<p>Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.</p>
<p>The crawled pages need parsing to deal with typos or formatting issues.</p>
<h2 id="indexing"><a class="anchor" href="#indexing">¶</a>Indexing</h2>
<p>Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.</p>
<p>Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.</p>
<h2 id="searching"><a class="anchor" href="#searching">¶</a>Searching</h2>
<p>We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:</p>
<p><img src="8e1e61b119e107fcb4bdd7e78f649985.png" alt="" />
<em>PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))</em></p>
<p>Where:</p>
<ul>
<li><code>A</code> is a given page</li>
<li><code>T<sub>n</sub></code> are pages that point to A</li>
<li><code>d</code> is the damping factor in the range <code>[0, 1]</code> (often 0.85)</li>
<li><code>C(A)</code> is the number of links going out of page <code>A</code></li>
<li><code>PR(A)</code> is the page rank of page <code>A</code>
This formula indicates the probability that a random surfer visits a certain page, and <code>1 - d</code> is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.</li>
</ul>
<p>The anchor text in the links also help provide a better description and helps indexing for even better results.</p>
<p>While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.</p>
<p>Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.</p>
<h2 id="conclusion"><a class="anchor" href="#conclusion">¶</a>Conclusion</h2>
<p>Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.</p>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf">The anatomy of a large-scale hypertextual Web search engine</a></li>
<li><a href="https://www.site.uottawa.ca/%7Ediana/csi4107/Google_SearchEngine.pdf">The Anatomy of a Large-Scale Hypertextual Web Search Engine (slides)</a></li>
</ul>
</main>
</body>
</html>
Privado: PC-Crawler evaluation 2dist/pc-crawler-evaluation-2/index.html2020-03-27T23:00:00+00:002020-03-15T23:00:00+00:00As the student <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Privado: PC-Crawler evaluation 2</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>As the student <code>a(i)</code> where <code>i = 9</code>, I have been assigned to evaluate students <code>a(i - 1)</code> and <code>a(i - 2)</code>, these being:</p>
<div class="date-created-modified">Created 2020-03-16<br>
Modified 2020-03-28</div>
<ul>
<li>a08: Classmate (username)</li>
<li>a07: Classmate (username)</li>
</ul>
<p>The evaluation is done according to the criteria described in Segunda entrega del PC-Crawler.</p>
<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: A.</strong></p>
<p>This is the evaluation of Crawler – Thesauro.</p>
<p>It’s a well-written post, properly using WordPress code blocks, and they explain the process of improving the code and what it does. Because there are no noticeable issues with the post, they get the highest grading.</p>
<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: B.</strong></p>
<p>This is the evaluation of Actividad 2-Crawler.</p>
<p>They start with an introduction on what they will do.</p>
<p>Next, they show the code they have written, also describing what it does, although they don’t explain <em>why</em> they chose the data structures they used.</p>
<p>The style of the code leaves a lot to be desired, and they should have embedded the code in the post instead of taking screenshots. People that rely on screen readers will not be able to see the code.</p>
<p>I have graded them B and not A for this last reason.</p>
</main>
</body>
</html>
What is ElasticSearch and why should you care?dist/what-is-elasticsearch-and-why-should-you-care/index.html2020-03-26T23:00:00+00:002020-03-17T23:00:00+00:00ElasticSearch is a giant search index with powerful analytics capabilities. It’s like a database and search engine on steroids, really easy and fast to get up and running. One can think of it as your own Google, a search engine with analytics.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>What is ElasticSearch and why should you care?</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>ElasticSearch is a giant search index with powerful analytics capabilities. It’s like a database and search engine on steroids, really easy and fast to get up and running. One can think of it as your own Google, a search engine with analytics.</p>
<div class="date-created-modified">Created 2020-03-18<br>
Modified 2020-03-27</div>
<p>ElasticSearch is rich, stable, performs well, is well maintained, and able to scale to petabytes of any kind of data, whether it’s structured, semi-structured or not at all. It’s cost-effective and can be used to make business decisions.</p>
<p>Or, described in 10 seconds:</p>
<blockquote>
<p>Schema-free, REST & JSON based distributed document store
Open source: Apache License 2.0
Zero configuration</p>
</blockquote>
<p>-- Alex Reelsen</p>
<h2 class="title" id="basic_capabilities"><a class="anchor" href="#basic_capabilities">¶</a>Basic capabilities</h2>
<p>ElasticSearch lets you ask questions about your data, not just make queries. You may think SQL can do this too, but what’s important is making a pipeline of facets, and feed the results from query to query.</p>
<p>Instead of changing your data, you can be flexible with your questions with no need to re-index it every time the questions change.</p>
<p>ElasticSearch is not just to search for full-text data, either. It can search for structured data and return more than just the results. It also yields additional data, such as ranking, highlights, and allows for pagination.</p>
<p>It doesn’t take a lot of configuration to get running, either, which can be a good boost on productivity.</p>
<h2 id="how_does_it_work_"><a class="anchor" href="#how_does_it_work_">¶</a>How does it work?</h2>
<p>ElasticSearch depends on Java, and can work in a distributed cluster if you execute multiple instances. Data will be replicated and sharded as needed. The current version at the time of writing is 7.6.1, and it’s being developed fast!</p>
<p>It also has support for plugins, with an ever-growing ecosystem and integration on many programming languages. Tools around it are being built around it, too, like Kibana which helps you visualize your data.</p>
<p>The way you use it is through a JSON API, served over HTTP/S.</p>
<h2 id="how_can_i_use_it_"><a class="anchor" href="#how_can_i_use_it_">¶</a>How can I use it?</h2>
<p><a href="https://www.elastic.co/downloads/">You can try ElasticSearch out for free on Elastic Cloud</a>, however, it can also be <a href="https://www.elastic.co/downloads/elasticsearch">downloaded and ran offline</a>, which is what we’ll do. Download the file corresponding to your operating system, unzip it, and execute the binary. Running it is as simple as that!</p>
<p>Now you can make queries to it over HTTP, with for example <code>curl</code>:</p>
<pre><code>curl -X PUT localhost:9200/orders/order/1 -d '
{
"created_at": "2013/09/05 15:45:10",
"items": [
{
name: "HD Monitor"
}
],
"total": 249.95
}'
</code></pre>
<p>This will create a new order with some information, such as when it was created, what items it contains, and the total cost of the order.</p>
<p>You can then query or filter as needed, script it or even create statistics.</p>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://youtu.be/sKnkQSec1U0">YouTube – What is Elasticsearch?</a></li>
<li><a href="https://youtu.be/yWNiRC_hUAw">YouTube – GOTO 2013 • Elasticsearch – Beyond Full-text Search • Alex Reelsen</a></li>
<li><a href="https://www.elastic.co/kibana">Kibana – Your window into the Elastic Stack</a></li>
<li><a href="https://www.elastic.co/guide/index.html">Elastic Stack and Product Documentation</a></li>
</ul>
</main>
</body>
</html>
Privado: NoSQL evaluationdist/nosql-evaluation/index.html2020-03-26T23:00:00+00:002020-03-15T23:00:00+00:00I have decided to evaluate Classmate‘s post and Classmate‘s post, because they review databases I have not seen or used before, and I think it would be interesting to see new ones.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Privado: NoSQL evaluation</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>I have decided to evaluate Classmate‘s post and Classmate‘s post, because they review databases I have not seen or used before, and I think it would be interesting to see new ones.</p>
<div class="date-created-modified">Created 2020-03-16<br>
Modified 2020-03-27</div>
<p>The evaluation is based on the requirements defined by Trabajos en grupo sobre Bases de Datos NoSQL:</p>
<blockquote>
<p><strong>1ª entrada:</strong> Descripción de la finalidad de la tecnología y cómo funciona o trabaja la BD NoSQL, sus características, la arista que ocupa en el Teorema CAP, de dónde se descarga, y cómo se instala.</p>
</blockquote>
<p>-- Teacher</p>
<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: A.</strong></p>
<p>The post I have evaluated is BB.DD. NoSQL: Voldemort 1ª Fase.</p>
<p>The post doesn’t start very well, because the first sentence has (emphasis mine):</p>
<blockquote>
<p>En él repasaremos en qué consiste <strong>MongoDB</strong>, sus características, y cómo se instala, entre otros.</p>
</blockquote>
<p>-- Classmate</p>
<p>…yet the post is about Voldemort!</p>
<p>The post does detail how it works, its architecture, corner in the CAP theorem, download and installation.</p>
<p>I have graded the post with A because I think it meets all the requirements, even if they slipped a bit in the beginning.</p>
<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: A.</strong></p>
<p>The post I have evaluted is Raven.</p>
<p>They have done a good job describing the project’s goals, corner in the CAP theorem, download, and provide an extensive installation section.</p>
<p>They don’t seem to use some of WordPress features, such as lists, but otherwise the post is good and deserves an A grading.</p>
</main>
</body>
</html>
Integrating Apache Tika into our Crawlerdist/integrating-apache-tika-into-our-crawler/index.html2020-03-24T23:00:00+00:002020-03-17T23:00:00+00:00In our last crawler post<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Integrating Apache Tika into our Crawler</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p><a href="/blog/ribw/upgrading-our-baby-crawler/">In our last crawler post</a>, we detailed how our crawler worked, and although it did a fine job, it’s time for some extra upgrading.</p>
<div class="date-created-modified">Created 2020-03-18<br>
Modified 2020-03-25</div>
<h2 class="title" id="what_kind_of_upgrades_"><a class="anchor" href="#what_kind_of_upgrades_">¶</a>What kind of upgrades?</h2>
<p>A small but useful one. We are adding support for file types that contain text but cannot be processed by normal text editors because they are structured and not just plain text (such as PDF files, Excel, Word documents…).</p>
<p>And for this task, we will make use of the help offered by <a href="https://tika.apache.org/">Tika</a>, our friendly Apache tool.</p>
<h2 id="what_is_tika_"><a class="anchor" href="#what_is_tika_">¶</a>What is Tika?</h2>
<p><a href="https://tika.apache.org/">Tika</a> is a set of libraries offered by <a href="https://en.wikipedia.org/wiki/The_Apache_Software_Foundation">The Apache Software Foundation</a> that we can include in our project in order to extract the text and metadata of files from a <a href="https://tika.apache.org/1.24/formats.html">long list of supported formats</a>.</p>
<h2 id="changes_in_the_code"><a class="anchor" href="#changes_in_the_code">¶</a>Changes in the code</h2>
<p>Not much has changed in the structure of the crawler, we simply have added a new method in <code>Utils</code> that uses the class <code>Tika</code> from the previously mentioned library so as to process and extract the text of more filetypes.</p>
<p>Then, we use this text just like we would for our standard text file (checking the thesaurus and adding it to the word map) and voilà! We have just added support for a big range of file types.</p>
<h2 id="incorporating_gradle"><a class="anchor" href="#incorporating_gradle">¶</a>Incorporating Gradle</h2>
<p>In order for the previous code to work, we need to make use of external libraries. To make this process easier and because the project is growing, we decided to use <a href="https://gradle.org/">Gradle</a>, a build system that can be used for projects in various programming languages, such as Java.</p>
<p>We followed their <a href="https://guides.gradle.org/building-java-applications/">guide to Building Java Applications</a>, and in a few steps added the required <code>.gradle</code> files. Now we can compile and run the code without having to worry about juggling with Java and external dependencies in a single command:</p>
<pre><code>./gradlew run
</code></pre>
<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
<p>And here you can download the final result:</p>
<p><em>download removed</em></p>
</main>
</body>
</html>
Cassandra: Basic Operations and Architecturedist/nosql-databases-basic-operations-and-architecture/index.html2020-03-23T23:00:00+00:002020-03-04T23:00:00+00:00This is the second post in the NoSQL Databases series, with a brief description on the basic operations (such as insertion, retrieval, indexing…), and complete execution along with the data model / architecture.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Cassandra: Basic Operations and Architecture</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This is the second post in the NoSQL Databases series, with a brief description on the basic operations (such as insertion, retrieval, indexing…), and complete execution along with the data model / architecture.</p>
<div class="date-created-modified">Created 2020-03-05<br>
Modified 2020-03-24</div>
<p>Other posts in this series:</p>
<ul>
<li><a href="/blog/ribw/nosql-databases-an-introduction/">Cassandra: an Introduction</a></li>
<li><a href="/blog/ribw/nosql-databases-basic-operations-and-architecture/">Cassandra: Basic Operations and Architecture</a> (this post)</li>
</ul>
<hr />
<p>Cassandra uses it own Query Language for managing the databases, it is known as **CQL **(<strong>Cassandra Query Language</strong>). Cassandra stores data in <strong><em>tables</em></strong>, as in relational databases, and these tables are grouped in <strong><em>keyspaces</em></strong>. A keyspace defines a number of options that applies to all the tables it contains. The most used option is the **replication strategy. **It is recommended to have only one keyspace by application.</p>
<p>It is important to mention that <strong>tables and keyspaces</strong> are** case insensitive**, so myTable is equivalent to mytable, but it is possible to <strong>force case sensitivity</strong> using <strong>double-quotes</strong>.</p>
<p>To begin with the basic operations it is necessary to deploy Cassandra:</p>
<ol>
<li>Open a terminal in the root of the Apache Cassandra folder downloaded in the previous post.</li>
<li>Run the command:</li>
</ol>
<pre><code>$ bin/cassandra
</code></pre>
<p>Once Cassandra is deployed, it is time to open a** CQL Shell**, in <strong>other terminal</strong>, with the command: </p>
<pre><code>$ bin/cqlsh
</code></pre>
<p>It is possible to check if Cassandra is deployed if the SQL Shell prints the next message:</p>
<p><img src="uwqQgQte-cuYb_pePFOuY58re23kngrDKNgL1qz4yOfnBDZkqMIH3fFuCrye.png" alt="" />
<em>CQL Shell</em></p>
<h2 class="title" id="create_insert"><a class="anchor" href="#create_insert">¶</a>Create/Insert</h2>
<h3 id="ddl_data_definition_language_"><a class="anchor" href="#ddl_data_definition_language_">¶</a>DDL (Data Definition Language)</h3>
<h4 id="create_keyspace"><a class="anchor" href="#create_keyspace">¶</a>Create keyspace</h4>
<p>A keyspace is created using a **CREATE KEYSPACE **statement:</p>
<pre><code>$ **CREATE** KEYSPACE [ **IF** **NOT** **EXISTS** ] keyspace_name **WITH** options;
</code></pre>
<p>The supported “<strong>options</strong>” are:</p>
<ul>
<li>“<strong>replication</strong>”: this is **mandatory **and defines the <strong>replication strategy</strong> and the <strong>replication factor</strong> (the number of nodes that will have a copy of the data). Within this option there is a property called “<strong>class</strong>” in which the <strong>replication strategy</strong> is specified (“SimpleStrategy” or “NetworkTopologyStrategy”)</li>
<li>“<strong>durable_writes</strong>”: this is <strong>not mandatory</strong> and it is possible to use the <strong>commit logs for updates</strong>.
Attempting to create an already existing keyspace will return an error unless the **IF NOT EXISTS **directive is used. </li>
</ul>
<p>The example associated to this statement is create a keyspace with name “test_keyspace” with “SimpleStrategy” as “class” of replication and a “replication_factor” of 3.</p>
<pre><code>**CREATE** KEYSPACE test_keyspace
**WITH** **replication** = {'class': 'SimpleStrategy',
'replication_factor' : 3};
</code></pre>
<p>The **USE **statement allows to <strong>change</strong> the current <strong>keyspace</strong>. The syntax of this statement is very simple: </p>
<pre><code>**USE** keyspace_name;
</code></pre>
<p><img src="RDWIG2RwvEevUFQv6TGFtGzRm4_9ERpxPf0feriflaj3alvWw3FEIAr_ZdF1.png" alt="" />
<em>USE statement</em></p>
<p>It is also possible to get the metadata from a keyspace with the **DESCRIBE **statement.</p>
<pre><code>**DESCRIBE** KEYSPACES | KEYSPACE keyspace_name;
</code></pre>
<h4 id="create_table"><a class="anchor" href="#create_table">¶</a>Create table</h4>
<p>Creating a new table uses the **CREATE TABLE **statement:</p>
<pre><code>**CREATE** **TABLE** [ **IF** **NOT** **EXISTS** ] table_name
'('
column_definition
( ',' column_definition )*
[ ',' **PRIMARY** **KEY** '(' primary_key ')' ]
')' [ **WITH** table_options ];
</code></pre>
<p>With “column_definition” as: column_name cql_type [ STATIC ] [ PRIMARY KEY]; “primary_key” as: partition_key [ ‘,’ clustering_columns ]; and “table_options” as: COMPACT STORAGE [ AND table_options ] or CLUSTERING ORDER BY ‘(‘ clustering_order ‘)’ [ AND table_options ] or “options”.</p>
<p>Attempting to create an already existing table will return an error unless the <strong>IF NOT EXISTS</strong> directive is used.</p>
<p>The <strong>CQL types</strong> are described in the References section.</p>
<p>For example, we are going to create a table called “species_table” in the keyspace “test_keyspace” in which we will have a “species” text (as PRIMARY KEY), a “common_name” text, a “population” varint, a “average_size” int and a “sex” text. Besides, we are going to add a comment to the table: “Some species records”;</p>
<pre><code>**CREATE** **TABLE** species_table (
species text **PRIMARY** **KEY**,
common_name text,
population varint,
average_size **int**,
sex text,
) **WITH** **comment**='Some species records';
</code></pre>
<p>It is also possible to get the metadata from a table with the **DESCRIBE **statement.</p>
<pre><code>**DESCRIBE** **TABLES** | **TABLE** [keyspace_name.]table_name;
</code></pre>
<h3 id="dml_data_manipulation_language_"><a class="anchor" href="#dml_data_manipulation_language_">¶</a>DML (Data Manipulation Language)</h3>
<h4 id="insert_data"><a class="anchor" href="#insert_data">¶</a>Insert data</h4>
<p>Inserting data for a row is done using an **INSERT **statement:</p>
<pre><code>**INSERT** **INTO** table_name ( names_values | json_clause )
[ **IF** **NOT** **EXISTS** ]
[ **USING** update_parameter ( **AND** update_parameter )* ];
</code></pre>
<p>Where “names_values” is: names VALUES tuple_literal; “json_clause” is: JSON string [ DEFAULT ( NULL | UNSET ) ]; and “update_parameter” is usually: TTL.</p>
<p>For example we are going to use both VALUES and JSON clauses to insert data in the table “species_table”. In the VALUES clause it is necessary to supply the list of columns, not as in the JSON clause that is optional.</p>
<p>Note: TTL (Time To Live) and Timestamp are metrics for expiring data, so, when the time set is passed, the operation is expired.</p>
<p>In the VALUES clause we are going to insert a new specie called “White monkey”, with an average size of 3, its common name is “Copito de nieve”, population 0 and sex “male”.</p>
<pre><code>**INSERT** **INTO** species_table (species, common_name, population, average_size, sex)
**VALUES** ('White monkey', 'Copito de nieve', 0, 3, 'male');
</code></pre>
<p>In the JSON clause we are going to insert a new specie called “Cloned sheep”, with an average size of 1, its common name is “Dolly the sheep”, population 0 and sex “female”.</p>
<pre><code>**INSERT** **INTO** species_table JSON '{"species": "Cloned Sheep",
"common_name": "Dolly the Sheep",
"average_size":1,
"population":0,
"sex": "female"}';
</code></pre>
<p>Note: all updates for an **INSERT **are applied **atomically **and in <strong>isolation.</strong></p>
<h2 id="read"><a class="anchor" href="#read">¶</a>Read</h2>
<p>Querying data from data is done using a **SELECT **statement:</p>
<pre><code>**SELECT** [ JSON | **DISTINCT** ] ( select_clause | '*' )
**FROM** table_name
[ **WHERE** where_clause ]
[ **GROUP** **BY** group_by_clause ]
[ **ORDER** **BY** ordering_clause ]
[ PER **PARTITION** **LIMIT** (**integer** | bind_marker) ]
[ **LIMIT** (**integer** | bind_marker) ]
[ ALLOW FILTERING ];
</code></pre>
<p>The **CQL SELECT **statement is very **similar **to the **SQL SELECT **statement due to the fact that both allows filtering (<strong>WHERE</strong>), grouping data (<strong>GROUP BY</strong>), ordering the data (<strong>ORDER BY</strong>) and limit the number of data (<strong>LIMIT</strong>). Besides, **CQL offers **a **limit per partition **and allow the **filtering **of <strong>data</strong>.</p>
<p>Note: as in SQL it it possible to set alias to the data with the statement <strong>AS.</strong></p>
<p>For example we are going to retrieve all the information about those values from the tables “species_table” which “sex” is “male”. Allow filtering is mandatory when there is a WHERE statement.</p>
<pre><code>**SELECT** * **FROM** species_table **WHERE** sex = 'male' ALLOW FILTERING;
</code></pre>
<p><img src="s6GrKIGATvOSD7oGRNScUU5RnLN_-3X1JXvnVi_wDT_hrmPMZdnCdBI8DpIJ.png" alt="" />
<em>SELECT statement</em></p>
<p>Furthermore, we are going to test the SELECT JSON statement. For this, we are going to retrieve only the species name with a population of 0. </p>
<pre><code>**SELECT** JSON species **FROM** species_table **WHERE** population = 0 ALLOW FILTERING;
</code></pre>
<p><img src="Up_eHlqKQp2RI5XIbgPOvj1B5J3gLxz7v7EI0NDRgezQTipecdfDT6AQoso0.png" alt="" />
<em>SELECT JSON statement</em></p>
<h2 id="update"><a class="anchor" href="#update">¶</a>Update</h2>
<h3 id="ddl_data_definition_language__2"><a class="anchor" href="#ddl_data_definition_language__2">¶</a>DDL (Data Definition Language)</h3>
<h4 id="alter_keyspace"><a class="anchor" href="#alter_keyspace">¶</a>Alter keyspace</h4>
<p>The statement **ALTER KEYSPACE **allows to modify the options of a keyspace:</p>
<pre><code>**ALTER** KEYSPACE keyspace_name **WITH** options;
</code></pre>
<p>Note: the supported **options **are the same than for creating a keyspace, “<strong>replication</strong>” and “<strong>durable_writes</strong>”.</p>
<p>The example associated to this statement is to modify the keyspace with name “test_keyspace” and set a “replication_factor” of 4.</p>
<pre><code>**ALTER** KEYSPACE test_keyspace
**WITH** **replication** = {'class': 'SimpleStrategy', 'replication_factor' : 4};
</code></pre>
<h4 id="alter_table"><a class="anchor" href="#alter_table">¶</a>Alter table</h4>
<p>Altering an existing table uses the **ALTER TABLE **statement:</p>
<pre><code>**ALTER** **TABLE** table_name alter_table_instruction;
</code></pre>
<p>Where “alter_table_instruction” can be: ADD column_name cql_type ( ‘,’ column_name cql_type )<em>; or DROP column_name ( column_name )</em>; or WITH options</p>
<p>The example associated to this statement is to ADD a new column to the table “species_table”, called “extinct” with type “boolean”.</p>
<pre><code>**ALTER** **TABLE** species_table **ADD** extinct **boolean**;
</code></pre>
<p>Another example is to DROP the column called “sex” from the table “species_table”.</p>
<pre><code>**ALTER** **TABLE** species_table **DROP** sex;
</code></pre>
<p>Finally, alter the comment with the WITH clause and set the comment to “All species records”. </p>
<pre><code>**ALTER** **TABLE** species_table **WITH** **comment**='All species records';
</code></pre>
<p>These changes can be checked with the **DESCRIBE **statement:</p>
<pre><code>**DESCRIBE** **TABLE** species_table;
</code></pre>
<p><img src="xebKPqkWkn97YVHpRVXZYWvRUfeRUyCH-vPDs67aFaEeU53YTRbDOFscOlAr.png" alt="" />
<em>DESCRIBE table</em></p>
<h3 id="dml_data_manipulation_language__2"><a class="anchor" href="#dml_data_manipulation_language__2">¶</a>DML (Data Manipulation Language)</h3>
<h4 id="update_data"><a class="anchor" href="#update_data">¶</a>Update data</h4>
<p>Updating a row is done using an **UPDATE **statement:</p>
<pre><code>**UPDATE** table_name
[ **USING** update_parameter ( **AND** update_parameter )* ]
**SET** assignment ( ',' assignment )*
**WHERE** where_clause
[ **IF** ( **EXISTS** | condition ( **AND** condition )*) ];
</code></pre>
<p>Where the update_parameter is: ( TIMESTAMP | TTL) (integer | bind_marker)</p>
<p>It is important to mention that the **WHERE **clause is used to select the row to update and **must <strong>include ** all columns</strong> composing the <strong>PRIMARY KEY.</strong></p>
<p>We are going to test this statement updating the column “extinct” to true to the column with name ‘White monkey’.</p>
<pre><code>**UPDATE** species_table **SET** extinct = **true** **WHERE** species='White monkey';
</code></pre>
<p><img src="IcaCe6VEC5c0ZQIygz-CiclzFyt491u7xPMg2muJLR8grmqaiUzkoQsVCoHf.png" alt="" />
<em>SELECT statement</em></p>
<h2 id="delete"><a class="anchor" href="#delete">¶</a>Delete</h2>
<h3 id="ddl_data_definition_language__3"><a class="anchor" href="#ddl_data_definition_language__3">¶</a>DDL (Data Definition Language)</h3>
<h4 id="drop_keyspace"><a class="anchor" href="#drop_keyspace">¶</a>Drop keyspace</h4>
<p>Dropping a keyspace can be done using the **DROP KEYSPACE **statement:</p>
<pre><code>**DROP** KEYSPACE [ **IF** **EXISTS** ] keyspace_name;
</code></pre>
<p>For example, drop the keyspace called “test_keyspace_2” if it exists:</p>
<pre><code>**DROP** KEYSPACE **IF** **EXISTS** test_keyspace_2;
</code></pre>
<p>As this keyspace does not exists, this sentence will do nothing.</p>
<h4 id="drop_table"><a class="anchor" href="#drop_table">¶</a>Drop table</h4>
<p>Dropping a table uses the **DROP TABLE **statement:</p>
<pre><code>**DROP** **TABLE** [ **IF** **EXISTS** ] table_name;
</code></pre>
<p>For example, drop the table called “species_2” if it exists: </p>
<pre><code>**DROP** **TABLE** **IF** **EXISTS** species_2;
</code></pre>
<p>As this table does not exists, this sentence will do nothing.</p>
<h4 id="truncate_table_"><a class="anchor" href="#truncate_table_">¶</a>Truncate (table)</h4>
<p>A table can be truncated using the **TRUNCATE **statement:</p>
<pre><code>**TRUNCATE** [ **TABLE** ] table_name;
</code></pre>
<p>Do not execute this command now, because if you do it, you will need to insert the previous data again.</p>
<p>Note: as tables are the only object that can be truncated the keyword TABLE can be omitted.</p>
<p><img src="FOkhfpxlWFQCzcdfeWxLTy7wx5inDv0xwVeVhE79Pqtk3yYzWsZJnz_SBhUi.png" alt="" />
<em>TRUNCATE statement</em></p>
<h3 id="dml_data_manipulation_language__3"><a class="anchor" href="#dml_data_manipulation_language__3">¶</a>DML (Data Manipulation Language)</h3>
<h4 id="delete_data"><a class="anchor" href="#delete_data">¶</a>Delete data</h4>
<p>Deleting rows or parts of rows uses the **DELETE **statement:</p>
<pre><code>**DELETE** [ simple_selection ( ',' simple_selection ) ]
**FROM** table_name
[ **USING** update_parameter ( **AND** update_parameter )* ]
**WHERE** where_clause
[ **IF** ( **EXISTS** | condition ( **AND** condition )*) ]
</code></pre>
<p>Now we are going to delete the value of the column “average_size” from “Cloned Sheep”. </p>
<pre><code>**DELETE** average_size **FROM** species_table **WHERE** species = 'Cloned Sheep';
</code></pre>
<p><img src="CyuQokVL5J9TAelq-WEWhNl6kFtbIYs0R1AeU5NX4EkG-YQI81mNHdnf2yWN.png" alt="" />
<em>DELETE value statement</em></p>
<p>And we are going to delete the same row as mentioned before. </p>
<pre><code>**DELETE** **FROM** species_table **WHERE** species = 'Cloned Sheep';
</code></pre>
<p><img src="jvQ5cXJ5GTVQ6giVhBEpPJmrJw-zwKKyB9nsTm5PRcGSTzkmh-WO4kTeuLpB.png" alt="" />
<em>DELETE row statement</em></p>
<h2 id="batch"><a class="anchor" href="#batch">¶</a>Batch</h2>
<p>Multiple <strong>INSERT</strong>, **UPDATE **and **DELETE **can be executed in a <strong>single statement</strong> by grouping them through a **BATCH **statement.</p>
<pre><code>**BEGIN** [ UNLOGGED | COUNTER ] BATCH
[ **USING** update_parameter ( **AND** update_parameter )* ]
modification_statement ( ';' modification_statement )*
APPLY BATCH;
</code></pre>
<p>Where modification_statement can be a insert_statement or an update_statement or a delete_statement.</p>
<ul>
<li>**UNLOGGED **means that either all operations in a batch eventually complete or none will.</li>
<li><strong>COUNTER</strong> means that the updates are not idempotent, so each time we execute the updates in a batch, we will have different results.
For example:</li>
</ul>
<pre><code>**BEGIN** BATCH
**INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
**VALUES** ('Blue Shark', 'Tiburillo', 30, 10, **false**);
**INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
**VALUES** ('Cloned sheep', 'Dolly the Sheep', 1, 1, **true**);
**UPDATE** species_table **SET** population = 2 **WHERE** species='Cloned sheep';
**DELETE** **FROM** species_table **WHERE** species = 'White monkey';
APPLY BATCH;
</code></pre>
<p><img src="EL9Dac26o0FqkVoeAKmopEKQe0wWq-xYI14b9RzGxtUkFJA3i2eTiR6qkuuJ.png" alt="" />
<em>BATCH statement</em></p>
<h2 id="index"><a class="anchor" href="#index">¶</a>Index</h2>
<p>CQL support creating secondary indexes on tables, allowing queries on the table to use those indexes. </p>
<p>**Creating **a secondary index on a table uses the **CREATE INDEX **statement:</p>
<pre><code>**CREATE** [ CUSTOM ] **INDEX** [ **IF** **NOT** **EXISTS** ] [ index_name ]
**ON** table_name '(' index_identifier ')'
[ **USING** string [ **WITH** OPTIONS = map_literal ] ];
</code></pre>
<p>For example we are going to create a index called “population_idx” that is related to the column “population” in the table “species_table”.</p>
<pre><code>**CREATE** **INDEX** population_idx **ON** species_table (population);
</code></pre>
<p>**Dropping **a secondary index uses the <strong>DROP INDEX</strong> statement: </p>
<pre><code>**DROP** **INDEX** [ **IF** **EXISTS** ] index_name;
</code></pre>
<p>Now, we are going to drop the previous index: </p>
<pre><code>**DROP** **INDEX** **IF** **EXISTS** population_idx;
</code></pre>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://cassandra.apache.org/doc/latest/cql/ddl.html">Cassandra CQL</a></li>
<li><a href="https://techdifferences.com/difference-between-ddl-and-dml-in-dbms.html">Differences between DML and DDL</a></li>
<li><a href="https://docs.datastax.com/en/dse/5.1/cql/cql/cql_reference/cqlReferenceTOC.html">Datastax CQL</a></li>
<li><a href="https://cassandra.apache.org/doc/latest/cql/types.html#grammar-token-cql-type">Cassandra CQL Types</a></li>
<li><a href="https://cassandra.apache.org/doc/latest/cql/indexes.html">Cassandra Index</a></li>
</ul>
</main>
</body>
</html>
Upgrading our Baby Crawlerdist/upgrading-our-baby-crawler/index.html2020-03-17T23:00:00+00:002020-03-10T23:00:00+00:00In our <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Upgrading our Baby Crawler</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>In our <a href="/blog/ribw/build-your-own-pc/">last post on this series</a>, we presented the code for our Personal Crawler. However, we didn’t quite explain what a crawler even is! We will use this moment to go a bit more in-depth, and make some upgrades to it.</p>
<div class="date-created-modified">Created 2020-03-11<br>
Modified 2020-03-18</div>
<h2 class="title" id="what_is_a_crawler_"><a class="anchor" href="#what_is_a_crawler_">¶</a>What is a Crawler?</h2>
<p>A crawler is a program whose job is to analyze documents and extract data from them. For example, search engines like <a href="http://duckduckgo.com/">DuckDuckGo</a>, <a href="https://bing.com/">Bing</a> or <a href="http://google.com/">Google</a> all have crawlers to analyze websites and build a database around them. They are some kind of «trackers», because they keep track of everything they find.</p>
<p>Their basic behaviour can be described as follows: given a starting list of URLs, follow them all and identify hyperlinks inside the documents. Add these to the list of links to follow, and repeat <em>ad infinitum</em>.</p>
<ul>
<li>This lets us create an index to quickly search across them all.</li>
<li>We can also identify broken links.</li>
<li>We can gather any other type of information that we found.
Our crawler will work offline, within our own computer, scanning the text documents it finds on the root we tell it to scan.</li>
</ul>
<h2 id="design_decissions"><a class="anchor" href="#design_decissions">¶</a>Design Decissions</h2>
<ul>
<li>We will use Java. Its runtime is quite ubiquitous, so it should be able to run in virtually anywhere. The language is typed, which helps catch errors early on.</li>
<li>Our solution is iterative. While recursion can be seen as more elegants by some, iterative solutions are often more performant with less need for optimization.</li>
</ul>
<h2 id="requirements"><a class="anchor" href="#requirements">¶</a>Requirements</h2>
<p>If you don’t have Java installed yet, you can <a href="https://java.com/en/download/">Download Free Java Software</a> from Oracle’s site. To compile the code, the <a href="https://www.oracle.com/java/technologies/javase-jdk8-downloads.html">Java Development Kit</a> is also necessary.</p>
<p>We don’t depend on any other external libraries, for easier deployment and compilation.</p>
<h2 id="implementation"><a class="anchor" href="#implementation">¶</a>Implementation</h2>
<p>Because the code was getting pretty large, it has been split into several files, and we have also upgraded it to use a Graphical User Interface instead! We decided to use Swing, based on the Java tutorial <a href="https://docs.oracle.com/javase/tutorial/uiswing/">Creating a GUI With JFC/Swing</a>.</p>
<h3 id="app"><a class="anchor" href="#app">¶</a>App</h3>
<p>This file is the entry point of our application. Its job is to initialize the components, lay them out in the main panel, and connect the event handlers.</p>
<p>Most widgets are pretty standard, and are defined as class variables. However, some variables are notable. The <code>[DefaultTableModel](https://docs.oracle.com/javase/8/docs/api/javax/swing/table/DefaultTableModel.html)</code> is used because it allows to <a href="https://stackoverflow.com/a/22550106">dynamically add rows</a>, and we also have a <code>[SwingWorker](https://docs.oracle.com/javase/8/docs/api/javax/swing/SwingWorker.html)</code> subclass responsible for performing the word analysis (which is quite CPU intensive and should not be ran in the UI thread!).</p>
<p>There’s a few utility methods to ease some common operations, such as <code>updateStatus</code> which changes the status label in the main window, informing the user of the latest changes.</p>
<h3 id="thesaurus"><a class="anchor" href="#thesaurus">¶</a>Thesaurus</h3>
<p>A thesaurus is a collection of words or terms used to represent concepts. In literature this is commonly known as a dictionary.</p>
<p>On the subject of this project, we are using a thesaurus based on how relevant is a word for the meaning of a sentence, filtering out those that barely give us any information.</p>
<p>This file contains a simple thesaurus implementation, which can trivially be used as a normal or inverted thesaurus. However, we only treat it as inverted, and its job is loading itself and determining if words are valid or should otherwise be ignored.</p>
<h3 id="utils"><a class="anchor" href="#utils">¶</a>Utils</h3>
<p>Several utility functions used across the codebase.</p>
<h3 id="wordmap"><a class="anchor" href="#wordmap">¶</a>WordMap</h3>
<p>This file is the important one, and its implementation hasn’t changed much since our last post. Instances of a word map contain… wait for it… a map of words! It stores the mapping <code>word → count</code> in memory, and offers methods to query the count of a word or iterate over the word count entries.</p>
<p>It can be loaded from cache or told to analyze a root path. Once an instance is created, additional files could be analyzed one by one if desired.</p>
<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
<p>The code was getting a bit too large to embed it within the blog post itself, so instead you can download it as a<code>.zip</code> file.</p>
<p><em>download removed</em></p>
</main>
</body>
</html>
Cassandra: an Introductiondist/cassandra-an-introduction/index.html2020-03-17T23:00:00+00:002020-03-04T23:00:00+00:00This is the first post in the Cassandra series, where we will introduce the Cassandra database system and take a look at its features and installation methods.<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Cassandra: an Introduction</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This is the first post in the Cassandra series, where we will introduce the Cassandra database system and take a look at its features and installation methods.</p>
<div class="date-created-modified">Created 2020-03-05<br>
Modified 2020-03-18</div>
<p>Other posts in this series:</p>
<ul>
<li><a href="/blog/ribw/cassandra-an-introduction/">Cassandra: an Introduction</a> (this post)</li>
</ul>
<p>This post is co-authored wih Classmate.</p>
<hr />
<div class="image-container">
<img src="cassandra-database-e1584191543401.jpg" alt="NoSQL database – Apache Cassandra – First delivery" />
<div class="image-caption"></div>
</div>
<p>
<h2 class="title" id="purpose_of_technology"><a class="anchor" href="#purpose_of_technology">¶</a>Purpose of technology</h2>
<p>Apache Cassandra is a <strong>NoSQL</strong>, <strong>open-source</strong>, <strong>distributed “key-value” database</strong>. It allows <strong>large volumes of distributed data</strong>. The main **goal **is provide <strong>linear scalability and availabilitywithout compromising performance</strong>. Besides, Cassandra <strong>supports replication</strong> across multiple datacenters, providing low latency. </p>
<h2 id="how_it_works"><a class="anchor" href="#how_it_works">¶</a>How it works</h2>
<p>Cassandra’s distributed **architecture **is based on a series of <strong>equal nodes</strong> that communicate with a <strong>P2P protocol</strong> so that <strong>redundancy is maximum</strong>. It offers robust support for multiple datacenters, with <strong>asynchronous replication</strong> without the need for a master server. </p>
<p>Besides, Cassandra’s <strong>data model consists of partitioning the rows</strong>, which are rearranged into <strong>different tables</strong>. The primary keys of each table have a first component that is the <strong>partition key</strong>. Within a partition, the rows are grouped by the remaining columns of the key. The other columns can be indexed separately from the primary key.</p>
<p>These tables can be <strong>created, deleted, updated and queried****at runtime without blocking</strong> each other. However it does <strong>not support joins or subqueries</strong>, but instead <strong>emphasizes denormalization</strong> through features like collections.</p>
<p>Nowadays, Cassandra uses its own query language called <strong>CQL</strong> (<strong>Cassandra Query Language</strong>), with a <strong>similar syntax to SQL</strong>. It also allows access from <strong>JDBC</strong>.</p>
<p><img src="s0GHpggGZXOFcdhypRWV4trU-PkSI6lukEv54pLZnoirh0GlDVAc4LamB1Dy.png" alt="" />
_ Cassandra architecture _</p>
<h2 id="features"><a class="anchor" href="#features">¶</a>Features</h2>
<ul>
<li><strong>Decentralized</strong>: there are <strong>no single points of failure</strong>, every **node **in the cluster has the <strong>same role</strong> and there is <strong>no master node</strong>, so each node <strong>can service any request</strong>, besides the data is distributed across the cluster.</li>
<li>Supports **replication **and multiple replication of <strong>data center</strong>: the replication strategies are <strong>configurable</strong>. </li>
<li>**Scalability: **reading and writing performance increases linearly as new nodes are added, also <strong>new nodes</strong> can be <strong>added without interrupting</strong> application <strong>execution</strong>.</li>
<li><strong>Fault tolerance: data replication</strong> is done **automatically **in several nodes in order to recover from failures. It is possible to <strong>replace failure nodes****without <strong>making</strong> inactivity time or interruptions</strong> to the application.</li>
<li>**Consistency: **a choice of consistency level is provided for <strong>reading and writing</strong>.</li>
<li><strong>MapReduce support</strong>: it is **integrated **with <strong>Apache Hadoop</strong> to support MapReduce.</li>
<li><strong>Query language</strong>: it has its own query language called **CQL (Cassandra Query Language) **</li>
</ul>
<h2 id="corner_in_cap_theorem"><a class="anchor" href="#corner_in_cap_theorem">¶</a>Corner in CAP theorem</h2>
<p><strong>Apache Cassandra</strong> is usually described as an “<strong>AP</strong>” system because it guarantees <strong>availability</strong> and <strong>partition/fault tolerance</strong>. So it errs on the side of ensuring data availability even if this means <strong>sacrificing consistency</strong>. But, despite this fact, Apache Cassandra <strong>seeks to satisfy all three requirements</strong> (Consistency, Availability and Fault tolerance) simultaneously and can be <strong>configured to behave</strong> like a “<strong>CP</strong>” database, guaranteeing <strong>consistency and partition/fault tolerance</strong>. </p>
<p><img src="rf3n9LTOKCQVbx4qrn7NPSVcRcwE1LxR_khi-9Qc51Hcbg6BHHPu-0GZjUwD.png" alt="" />
<em>Cassandra in CAP Theorem</em></p>
<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
<p>In order to download the file, with extension .tar.gz. you must visit the <a href="https://cassandra.apache.org/download/">download site</a> and click on the file “<a href="https://ftp.cixug.es/apache/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz">https://ftp.cixug.es/apache/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz</a>”. It is important to mention that the previous link is related to the 3.11.6 version.</p>
<h2 id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
<p>This database can only be installed on Linux distributions and Mac OS X systems, so, it is not possible to install it on Microsoft Windows.</p>
<p>The first main requirement is having installed Java 8 in <strong>Ubuntu</strong>, the OS that we will use. Therefore, the Java 8 installation is explained below. First open a terminal and execute the next command:</p>
<pre><code>sudo apt update
sudo apt install openjdk-8-jdk openjdk-8-jre
</code></pre>
<p>In order to establish Java as a environment variable it is needed to open the file “/.bashrc”: </p>
<pre><code>nano ~/.bashrc
</code></pre>
<p>And add at the end of it the path where Java is installed, as follows: </p>
<pre><code>export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export PATH=$PATH:$JAVA_HOME/bin
</code></pre>
<p>At this point, save the file and execute the next command, note that it does the same effect re-opening the terminal: </p>
<pre><code>source ~/.bashrc
</code></pre>
<p>In order to check if the Java environment variable is set correctly, run the next command: </p>
<pre><code>echo $JAVA_HOME
</code></pre>
<p><img src="JUUmX5MIHynJR_K9EdCgKeJcpINeCGRRt2QRu4JLPtRhCVidOhcbWwVTQjyu.png" alt="" />
<em>$JAVAHOME variable</em></p>
<p>Afterwards, it is possible to check the installed Java version with the command: </p>
<pre><code>java -version
</code></pre>
<p><img src="z9v1-0hpZwjI4U5UZej9cRGN5-Y4AZl0WUPWyQ_-JlzTAIvZtTFPnKY2xMQ_.png" alt="" />
<em>Java version</em></p>
<p>The next requirement is having installed the latest version of Python 2.7. This can be checked with the command: </p>
<pre><code>python --version
</code></pre>
<p>If it is not installed, to install it, it is as simple as run the next command in the terminal: </p>
<pre><code>sudo apt install python
</code></pre>
<p>Note: it is better to use “python2” instead of “python” because in that way, you force to user Python 2.7. Modern distributions will use Python 3 for the «python» command.</p>
<p>Therefore, it is possible to check the installed Python version with the command:</p>
<pre><code>python --version
</code></pre>
<p><img src="Ger5Vw_e1HIK84QgRub-BwGmzIGKasgiYb4jHdfRNRrvG4d6Msp_3Vk62-9i.png" alt="" />
<em>Python version</em></p>
<p>Once both requirements are ready, next step is to unzip the file previously downloaded, right click on the file and select “Extract here” or with the next command, on the directory where is the downloaded file. </p>
<pre><code>tar -zxvf apache-cassandra-x.x.x-bin.tar.gz
</code></pre>
<p>In order to check if the installation is completed, you can execute the next command, in the root folder of the project. This will start Cassandra in a single node. </p>
<pre><code>/bin/cassandra
</code></pre>
<p>It is possible to make a get some data from Cassandra with CQL (Cassandra Query Language). To check this execute the next command in another terminal. </p>
<pre><code>/bin/cqlsh localhost
</code></pre>
<p>Once CQL is open, type the next sentence and check the result: </p>
<pre><code>SELECT cluster_name, listen_address from system.local;
</code></pre>
<p>The output should be:</p>
<p><img src="miUO60A-RtyEAOOVFJqlkPRC18H4RKUhot6RWzhO9FmtzgTPOYHFtwxqgZEf.png" alt="" />
<em>Sentence output</em></p>
<p>Finally, the installation guide provided by the website of the database is attached in this <a href="https://cassandra.apache.org/doc/latest/getting_started/installing.html">installation guide</a>. </p>
<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
<ul>
<li><a href="https://es.wikipedia.org/wiki/Apache_Cassandra">Wikipedia</a></li>
<li><a href="https://cassandra.apache.org/">Apache Cassandra</a></li>
<li><a href="https://www.datastax.com/blog/2019/05/how-apache-cassandratm-balances-consistency-availability-and-performance">Datastax</a></li>
<li><a href="https://blog.yugabyte.com/apache-cassandra-architecture-how-it-works-lightweight-transactions/">yugabyte</a></li>
</ul>
</main>
</body>
</html>
Privado: PC-Crawler evaluationdist/pc-crawler-evaluation/index.html2020-03-17T23:00:00+00:002020-03-03T23:00:00+00:00As the student <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Privado: PC-Crawler evaluation</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>As the student <code>a(i)</code> where <code>i = 9</code>, I have been assigned to evaluate students <code>a(i + 3)</code> and <code>a(i + 4)</code>, these being:</p>
<div class="date-created-modified">Created 2020-03-04<br>
Modified 2020-03-18</div>
<ul>
<li>a12: Classmate (username)</li>
<li>a13: Classmate (username)</li>
</ul>
<h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: B.</strong></p>
<p>I think they mix up a bit their considerations with program usage and how it works, not justifying why the considerations are the ones they chose, or what the alternatives would be.</p>
<p>The implementation notes are quite well-written. Even someone without knowledge of Java’s syntax can read the notes and more or less make sense of what’s going on, with the relevant code excerpts on each section.</p>
<p>Implementation-wise, some methods could definitely use some improvement:</p>
<ul>
<li><code>esExtensionTextual</code> is overly complicated. It could use a <code>for</code> loop and Java’s <code>String.endsWith</code>.</li>
<li><code>calcularFrecuencia</code> has quite some duplication (e.g. <code>this.getFicherosYDirectorios().remove(0)</code>) and could definitely be cleaned up.</li>
</ul>
<p>However, all the desired functionality is implemented.</p>
<p>Style-wise, some of the newlines and avoiding braces on <code>if</code> and <code>while</code> could be changed to improve the readability.</p>
<p>The post is written in Spanish, but uses some words that don’t translate well («remover» could better be said as «eliminar» or «quitar»).</p>
<h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s evaluation</h2>
<p><strong>Grading: B.</strong></p>
<p>Their post starts with an explanation on what a crawler is, common uses for them, and what type of crawler they will be developing. This is a very good start. Regarding the post style, it seems they are not properly using some of WordPress features, such as lists, and instead rely on paragraphs with special characters prefixing each list item.</p>
<p>The post also contains some details on how to install the requirements to run the program, which can be very useful for someone not used to working with Java.</p>
<p>They do not explain their implementation and the filename of the download has a typo.</p>
<p>Implementation-wise, the code seems to be well-organized, into several packages and files, although the naming is a bit inconsistent. They even designed a GUI, which is quite impressive.</p>
<p>Some of the methods are documented, although the code inside them is not very commented, including missing rationale for the data structures chosen. There also seem to be several other unused main functions, which I’m unsure why they were kept.</p>
<p>However, all the desired functionality is implemented.</p>
<p>Similar to Classmate, the code style could be improved and settled on some standard, as well as making use of Java features such as <code>for</code> loops over iterators instead of manual loops.</p>
</main>
</body>
</html>
Introduction to NoSQLdist/introduction-to-nosql/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00This post will primarly focus on the talk held in the <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Introduction to NoSQL</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This post will primarly focus on the talk held in the <a href="https://youtu.be/qI_g07C_Q5I">GOTO 2012 conference: Introduction to NoSQL by Martin Fowler</a>. It can be seen as an informal, summarized transcript of the talk</p>
<div class="date-created-modified">Created 2020-02-25<br>
Modified 2020-03-18</div>
<hr />
<p>The relational database model is affected by the <em><a href="https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch">impedance mismatch problem</a></em>. This occurs because we have to match our high-level design with the separate columns and rows used by relational databases.</p>
<p>Taking the in-memory objects and putting them into a relational database (which were dominant at the time) simply didn’t work out. Why? Relational databases were more than just databases, they served as a an integration mechanism across applications, up to the 2000s. For 20 years!</p>
<p>With the rise of the Internet and the sheer amount of traffic, databases needed to scale. Unfortunately, relational databases only scale well vertically (by upgrading a <em>single</em> node). This is <em>very</em> expensive, and not something many could afford.</p>
<p>The problem are those pesky <code>JOIN</code>‘s, and its friends <code>GROUP BY</code>. Because our program and reality model don’t match the tables used by SQL, we have to rely on them to query the data. It is because the model doesn’t map directly.</p>
<p>Furthermore, graphs don’t map very well at all to relational models.</p>
<p>We needed a way to scale horizontally (by increasing the <em>amount</em> of nodes), something relational databases were not designed to do.</p>
<blockquote>
<p><em>We need to do something different, relational across nodes is an unnatural act</em></p>
</blockquote>
<p>This inspired the NoSQL movement.</p>
<blockquote>
<p><em>#nosql was only meant to be a hashtag to advertise it, but unfortunately it’s how it is called now</em></p>
</blockquote>
<p>It is not possible to define NoSQL, but we can identify some of its characteristics:</p>
<ul>
<li>
<p>Non-relational</p>
</li>
<li>
<p><strong>Cluster-friendly</strong> (this was the original spark)</p>
</li>
<li>
<p>Open-source (until now, generally)</p>
</li>
<li>
<p>21st century web culture</p>
</li>
<li>
<p>Schema-less (easier integration or conjugation of several models, structure aggregation)
These databases use different data models to those used by the relational model. However, it is possible to identify 4 broad chunks (some may say 3, or even 2!):</p>
</li>
<li>
<p><strong>Key-value store</strong>. With a certain key, you obtain the value corresponding to it. It knows nothing else, nor does it care. We say the data is opaque.</p>
</li>
<li>
<p><strong>Document-based</strong>. It stores an entire mass of documents with complex structure, normally through the use of JSON (XML has been left behind). Then, you can ask for certain fields, structures, or portions. We say the data is transparent.</p>
</li>
<li>
<p><strong>Column-family</strong>. There is a «row key», and within it we store multiple «column families» (columns that fit together, our aggregate). We access by row-key and column-family name.
All of these kind of serve to store documents without any <em>explicit</em> schema. Just shove in anything! This gives a lot of flexibility and ease of migration, except… that’s not really true. There’s an <em>implicit</em> schema when querying.</p>
</li>
</ul>
<p>For example, a query where we may do <code>anOrder['price'] * anOrder['quantity']</code> is assuming that <code>anOrder</code> has both a <code>price</code> and a <code>quantity</code>, and that both of these can be multiplied together. «Schema-less» is a fuzzy term.</p>
<p>However, it is the lack of a <em>fixed</em> schema that gives flexibility.</p>
<p>One could argue that the line between key-value and document-based is very fuzzy, and they would be right! Key-value databases often let you include additional metadata that behaves like an index, and in document-based, documents often have an identifier anyway.</p>
<p>The common notion between these three types is what matters. They save an entire structure as an <em>unit</em>. We can refer to these as «Aggregate Oriented Databases». Aggregate, because we group things when designing or modeling our systems, as opposed to relational databases that scatter the information across many tables.</p>
<p>There exists a notable outlier, though, and that’s:</p>
<ul>
<li><strong>Graph</strong> databases. They use a node-and-arc graph structure. They are great for moving on relationships across things. Ironically, relational databases are not very good at jumping across relationships! It is possibly to perform very interesting queries in graph databases which would be really hard and costly on relational models. Unlike the aggregated databases, graphs break things into even smaller units.
NoSQL is not <em>the</em> solution. It depends on how you’ll work with your data. Do you need an aggregate database? Will you have a lot of relationships? Or would the relational model be good fit for you?</li>
</ul>
<p>NoSQL, however, is a good fit for large-scale projects (data will <em>always</em> grow) and faster development (the impedance mismatch is drastically reduced).</p>
<p>Regardless of our choice, it is important to remember that NoSQL is a young technology, which is still evolving really fast (SQL has been stable for <em>decades</em>). But the <em>polyglot persistence</em> is what matters. One must know the alternatives, and be able to choose.</p>
<hr />
<p>Relational databases have the well-known ACID properties: Atomicity, Consistency, Isolation and Durability.</p>
<p>NoSQL (except graph-based!) are about being BASE instead: Basically Available, Soft state, Eventual consistency.</p>
<p>SQL needs transactions because we don’t want to perform a read while we’re only half-way done with a write! The readers and writers are the problem, and ensuring consistency results in a performance hit, even if the risk is low (two writers are extremely rare but it still must be handled).</p>
<p>NoSQL on the other hand doesn’t need ACID because the aggregate <em>is</em> the transaction boundary. Even before NoSQL itself existed! Any update is atomic by nature. When updating many documents it <em>is</em> a problem, but this is very rare.</p>
<p>We have to distinguish between logical and replication consistency. During an update and if a conflict occurs, it must be resolved to preserve the logical consistency. Replication consistency on the other hand is preserveed when distributing the data across many machines, for example during sharding or copies.</p>
<p>Replication buys us more processing power and resillence (at the cost of more storage) in case some of the nodes die. But what happens if what dies is the communication across the nodes? We could drop the requests and preserve the consistency, or accept the risk to continue and instead preserve the availability.</p>
<p>The choice on whether trading consistency for availability is acceptable or not depends on the domain rules. It is the domain’s choice, the business people will choose. If you’re Amazon, you always want to be able to sell, but if you’re a bank, you probably don’t want your clients to have negative numbers in their account!</p>
<p>Regardless of what we do, in a distributed system, the CAP theorem always applies: Consistecy, Availability, Partitioning-tolerancy (error tolerancy). It is <strong>impossible</strong> to guarantee all 3 at 100%. Most of the times, it does work, but it is mathematically impossible to guarantee at 100%.</p>
<p>A database has to choose what to give up at some point. When designing a distributed system, this must be considered. Normally, the choice is made between consistency or response time.</p>
<h2 class="title" id="further_reading"><a class="anchor" href="#further_reading">¶</a>Further reading</h2>
<ul>
<li><a href="https://www.martinfowler.com/articles/nosql-intro-original.pdf">The future is: <del>NoSQL Databases</del> Polyglot Persistence</a></li>
<li><a href="https://www.thoughtworks.com/insights/blog/nosql-databases-overview">NoSQL Databases: An Overview</a></li>
</ul>
</main>
</body>
</html>
Build your own PCdist/build-your-own-pc/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00…where PC obviously stands for Personal Crawler<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Build your own PC</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p><em>…where PC obviously stands for Personal Crawler</em>.</p>
<div class="date-created-modified">Created 2020-02-25<br>
Modified 2020-03-18</div>
<hr />
<p>This post contains the source code for a very simple crawler written in Java. You can compile and run it on any file or directory, and it will calculate the frequency of all the words it finds.</p>
<h2 class="title" id="source_code"><a class="anchor" href="#source_code">¶</a>Source code</h2>
<p>Paste the following code in a new file called <code>Crawl.java</code>:</p>
<pre><code>import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Crawl {
// Regex used to tokenize the words from a line of text
private final static Pattern WORDS = Pattern.compile("\\w+");
// The file where we will cache our results
private final static File INDEX_FILE = new File("index.bin");
// Helper method to determine if a file is a text file or not
private static boolean isTextFile(File file) {
String name = file.getName().toLowerCase();
return name.endsWith(".txt")
|| name.endsWith(".java")
|| name.endsWith(".c")
|| name.endsWith(".cpp")
|| name.endsWith(".h")
|| name.endsWith(".hpp")
|| name.endsWith(".html")
|| name.endsWith(".css")
|| name.endsWith(".js");
}
// Normalizes a string by converting it to lowercase and removing accents
private static String normalize(String string) {
return string.toLowerCase()
.replace("á", "a")
.replace("é", "e")
.replace("í", "i")
.replace("ó", "o")
.replace("ú", "u");
}
// Recursively fills the map with the count of words found on all the text files
static void fillWordMap(Map<String, Integer> map, File root) throws IOException {
// Our file queue begins with the root
Queue<File> fileQueue = new ArrayDeque<>();
fileQueue.add(root);
// For as long as the queue is not empty...
File file;
while ((file = fileQueue.poll()) != null) {
if (!file.exists() || !file.canRead()) {
// ...ignore files for which we don't have permission...
System.err.println("warning: cannot read file: " + file);
} else if (file.isDirectory()) {
// ...else if it's a directory, extend our queue with its files...
File[] files = file.listFiles();
if (files == null) {
System.err.println("warning: cannot list dir: " + file);
} else {
fileQueue.addAll(Arrays.asList(files));
}
} else if (isTextFile(file)) {
// ...otherwise, count the words in the file.
countWordsInFile(map, file);
}
}
}
// Counts the words in a single file and adds the count to the map.
public static void countWordsInFile(Map<String, Integer> map, File file) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(file));
String line;
while ((line = reader.readLine()) != null) {
Matcher matcher = WORDS.matcher(line);
while (matcher.find()) {
String token = normalize(matcher.group());
Integer count = map.get(token);
if (count == null) {
map.put(token, 1);
} else {
map.put(token, count + 1);
}
}
}
reader.close();
}
// Prints the map of word count to the desired output stream.
public static void printWordMap(Map<String, Integer> map, PrintStream writer) {
List<String> keys = new ArrayList<>(map.keySet());
Collections.sort(keys);
for (String key : keys) {
writer.println(key + "\t" + map.get(key));
}
}
@SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException, ClassNotFoundException {
// Validate arguments
if (args.length == 1 && args[0].equals("--help")) {
System.err.println("usage: java Crawl [input]");
return;
}
File root = new File(args.length > 0 ? args[0] : ".");
// Loading or generating the map where we aggregate the data {word: count}
Map<String, Integer> map;
if (INDEX_FILE.isFile()) {
System.err.println("Found existing index file: " + INDEX_FILE);
try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(INDEX_FILE))) {
map = (Map<String, Integer>) ois.readObject();
}
} else {
System.err.println("Index file not found: " + INDEX_FILE + "; indexing...");
map = new TreeMap<>();
fillWordMap(map, root);
// Cache the results to avoid doing the work a next time
try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(INDEX_FILE))) {
out.writeObject(map);
}
}
// Ask the user in a loop to query for words
Scanner scanner = new Scanner(System.in);
while (true) {
System.out.print("Escriba palabra a consultar (o Enter para salir): ");
System.out.flush();
String line = scanner.nextLine().trim();
if (line.isEmpty()) {
break;
}
line = normalize(line);
Integer count = map.get(line);
if (count == null) {
System.out.println(String.format("La palabra \"%s\" no está presente", line));
} else if (count == 1) {
System.out.println(String.format("La palabra \"%s\" está presente 1 vez", line));
} else {
System.out.println(String.format("La palabra \"%s\" está presente %d veces", line, count));
}
}
}
}
</code></pre>
<p>It can be compiled and executed as follows:</p>
<pre><code>javac Crawl.java
java Crawl
</code></pre>
<p>Instead of copy-pasting the code, you may also download it as a <code>.zip</code>:</p>
<p><em>(contents removed)</em></p>
<h2 id="addendum"><a class="anchor" href="#addendum">¶</a>Addendum</h2>
<p>The following simple function can be used if one desires to print the contents of a file:</p>
<pre><code>public static void printFile(File file) {
if (isTextFile(file)) {
System.out.println('\n' + file.getName());
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (FileNotFoundException ignored) {
System.err.println("warning: file disappeared while reading: " + file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
</code></pre>
</main>
</body>
</html>
About Boolean Retrievaldist/about-boolean-retrieval/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00This entry will discuss the section on the <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>About Boolean Retrieval</title>
<link rel="stylesheet" href="../css/style.css">
</head>
<body>
<main>
<p>This entry will discuss the section on the <em><a href="https://nlp.stanford.edu/IR-book/pdf/01bool.pdf">Boolean retrieval</a></em> section of the book <em><a href="https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf">An Introduction to Information Retrieval</a></em>.</p>
<div class="date-created-modified">Created 2020-02-25<br>
Modified 2020-03-18</div>
<h2 class="title" id="summary_on_the_topic"><a class="anchor" href="#summary_on_the_topic">¶</a>Summary on the topic</h2>
<p>Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called <em>search</em>.</p>
<p>A simple way to retrieve information is to <em>grep</em> through the text (term named after the Unix tool <code>grep</code>), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.</p>
<p>To avoid linear scanning, we build an <em>index</em> and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document <em>incidence matrix</em>. Such a possible matrix is:</p>
<table class="">
<tbody>
<tr>
<td>
<em>
word/play
</em>
</td>
<td>
<strong>
Antony and Cleopatra
</strong>
</td>
<td>
<strong>
Julius Caesar
</strong>
</td>
<td>
<strong>
The Tempest
</strong>
</td>
<td>
<strong>
…
</strong>
</td>
</tr>
<tr>
<td>
<strong>
Antony
</strong>
</td>
<td>
1
</td>
<td>
1
</td>
<td>
0
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
Brutus
</strong>
</td>
<td>
1
</td>
<td>
1
</td>
<td>
0
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
Caesar
</strong>
</td>
<td>
1
</td>
<td>
1
</td>
<td>
0
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
Calpurnia
</strong>
</td>
<td>
0
</td>
<td>
1
</td>
<td>
0
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
Cleopatra
</strong>
</td>
<td>
1
</td>
<td>
0
</td>
<td>
0
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
mercy
</strong>
</td>
<td>
1
</td>
<td>
0
</td>
<td>
1
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
worser
</strong>
</td>
<td>
1
</td>
<td>
0
</td>
<td>
1
</td>
<td>
</td>
</tr>
<tr>
<td>
<strong>
…
</strong>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>
<p>We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.</p>
<p>Now, answering a query such as <code>Brutus AND Caesar AND NOT Calpurnia</code> becomes trivial:</p>
<pre><code>VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
= 110 AND 110 AND COMPLEMENT(010)
= 110 AND 110 AND 101
= 100
</code></pre>
<p>The query is only satisfied for our first column.</p>
<p>The <em>Boolean retrieval model</em> is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with <code>OR</code>, <code>AND</code>, and <code>NOT</code>.</p>
<p>Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be <em>missing</em> when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the <em>positions</em> of the 1’s. This is known as an <em>inverted index</em>.</p>
<p>The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (<em>postings</em>). Applied to boolean retrieval, we would:</p>
<ol>
<li>Collects the documents to be indexed, assign a unique identifier each</li>
<li>Tokenize the text in the documents into a list of terms</li>
<li>Normalize the tokens, which now become indexing terms</li>
<li>Index the documents</li>
</ol>
<table class="">
<tbody>
<tr>
<td>
<strong>
Dictionary
</strong>
</td>
<td>
<strong>
Postings
</strong>
</td>
</tr>
<tr>
<td>
Brutus
</td>
<td>
1, 2, 4, 11, 31, 45, 173, 174
</td>
</tr>
<tr>
<td>
Caesar
</td>
<td>
1, 2, 4, 5, 6, 16, 57, 132, …
</td>
</tr>
<tr>
<td>
Calpurnia
</td>
<td>
2, 31, 54, 101
</td>
</tr>
<tr>
<td>
…
</td>
<td>
</td>
</tr>
</tbody>
</table>
<p>Sort the pairs <code>(term, document_id)</code> so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.</p>
<table class="">
<tbody>
<tr>
<td>
<strong>
term
</strong>
</td>
<td>
<strong>
document_id
</strong>
</td>
</tr>
<tr>
<td>
I
</td>
<td>
1
</td>
</tr>
<tr>
<td>
did
</td>
<td>
1
</td>
</tr>
<tr>
<td>
…
</td>
<td>
</td>
</tr>
<tr>
<td>
with
</td>
<td>
2
</td>
</tr>
</tbody>
</table>
<table class="">
<tbody>
<tr>
<td>
<strong>
term
</strong>
</td>
<td>
<strong>
document_id
</strong>
</td>
</tr>
<tr>
<td>
be
</td>
<td>
2
</td>
</tr>
<tr>
<td>
brutus
</td>
<td>
1
</td>
</tr>
<tr>
<td>
brutus
</td>
<td>
2
</td>
</tr>
<tr>
<td>
…
</td>
<td>
</td>
</tr>
</tbody>
</table>
<table class="">
<tbody>
<tr>
<td>
<strong>
term
</strong>
</td>
<td>
<strong>
frequency
</strong>
</td>
<td>
<strong>
postings list
</strong>
</td>
</tr>
<tr>
<td>
be
</td>
<td>
1
</td>
<td>
2
</td>
</tr>
<tr>
<td>
brutus
</td>
<td>
2
</td>
<td>
1, 2
</td>
</tr>
<tr>
<td>
capitol
</td>
<td>
1
</td>
<td>
1
</td>
</tr>
<tr>
<td>
…
</td>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>
<p>Intersecting posting lists now becomes of transversing both lists in order:</p>
<pre><code>Brutus : 1 -> 2 -> 4 -> 11 -> 31 -> 45 -> 173 -> 174
Calpurnia: 2 -> 31 -> 54 -> 101
Intersect: 2 -> 31
</code></pre>
<p>A simple conjunctive query (e.g. <code>Brutus AND Calpurnia</code>) is executed as follows:</p>
<ol>
<li>Locate <code>Brutus</code> in the dictionary</li>
<li>Retrieve its postings</li>
<li>Locate <code>Calpurnia</code> in the dictionary</li>
<li>Retrieve its postings</li>
<li>Intersect (<em>merge</em>) both postings</li>
</ol>
<p>Since the lists are sorted, walking both of them can be done in <em>O(n)</em> time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.</p>
<h2 id="thoughts"><a class="anchor" href="#thoughts">¶</a>Thoughts</h2>
<p>The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.</p>
<p>However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.</p>
<p>All in all, it’s an interesting way to look at the data and query it efficiently.</p>
</main>
</body>
</html>