pagong

Information Retrieval and Web Search

2020-10-02T22:00:00+00:00

Information Retrieval and Web Search

¶Information Retrieval and Web Search

2020-10-03

During 2020 at university, this subject ("Recuperación de la Información y Búsqueda en la Web") had us write blog posts as assignments. I think it would be really fun and I wanted to preserve that work here, with the hopes it's interesting to someone.

The posts were auto-generated from the original HTML files and manually anonymized later.

Privado: Final NoSQL evaluation

2020-05-13T22:00:00+00:00

Privado: Final NoSQL evaluation

This evaluation is a bit different to my previous one because this time I have been tasked to evaluate the student a(i - 2), and because I am a = 9 that happens to be a(7) = Classmate.

Created 2020-05-13
Modified 2020-05-14

Unfortunately for Classmate, the only entry related to NoSQL I have found in their blog is Prima y segunda Actividad: Base de datos NoSQL which does not develop an application as requested for the third entry (as of 14th of May).

This means that, instead, I will evaluate a(i - 3) which happens to be a(6) = Classmate and they do have an entry.

¶Classmate’s Evaluation

Grading: B.

The post I have evaluated is BB.DD. NoSQL RethinkDB 3ª Fase. Aplicación.

It starts with an introduction, properly explaining what database they have chosen and why, but not what application they will be making.

This is detailed just below in the next section, although it’s a bit vague.

The next section talks about the Python dependencies that are required, but they never said they would be making a Python application or that we need to install Python!

The next section talks about the file structure of the project, and they detail what everything part does, although I have missed some code snippets.

The final result is pretty cool and contains many interesting graphs, they provide a download to the source code and list all the relevant references used.

Except for a weird «necesario falta» in the text, it’s otherwise well-written, although given the issues above I cannot grade it with the highest score.

Developing a Python application for MongoDB

2020-04-15T22:00:00+00:00

Developing a Python application for MongoDB

This is the third and last post in the MongoDB series, where we will develop a Python application to process and store OpenData inside Mongo.

Created 2020-03-25
Modified 2020-04-16

Other posts in this series:

This post is co-authored wih a Classmate.

¶What are we making?

We are going to develop a web application that renders a map, in this case, the town of Cáceres, with which users can interact. When the user clicks somewhere on the map, the selected location will be sent to the server to process. This server will perform geospatial queries to Mongo and once the results are ready, the information is presented back at the webpage.

The data used for the application comes from Cáceres’ OpenData, and our goal is that users will be able to find information about certain areas in a quick and intuitive way, such as precise coordinates, noise level, and such.

¶What are we using?

The web application will be using Python for the backend, Svelte for the frontend, and Mongo as our storage database and processing center.

Why Python? It’s a comfortable language to write and to read, and has a great ecosystem with plenty of libraries.
Why Svelte? Svelte is the New Thing™ in the world of component frameworks for JavaScript. It is similar to React or Vue, but compiled and with a lot less boilerplate. Check out their Svelte post to learn more.
Why Mongo? We believe NoSQL is the right approach for doing the kind of processing and storage that we expect, and it’s very easy to use. In addition, we will be making Geospatial Queries which Mongo supports.

Why didn’t we choose to make a smaller project, you may ask? You will be shocked to hear that we do not have an answer for that!

Note that we will not be embedding all the code of the project in this post, or it would be too long! We will include only the relevant snippets needed to understand the core ideas of the project, and not the unnecessary parts of it (for example, parsing configuration files to easily change the port where the server runs is not included).

¶Python dependencies

Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the Python downloads section, but if you’re on Linux, chances are you have it installed already.

Once Python 3.7 or above is installed, install motor (Asynchronous Python driver for MongoDB) and the aiohttp server through pip:

pip install aiohttp motor

Make sure that Mongo is running in the background (this has been described in previous posts), and we should be able to get to work.

¶Web dependencies

To work with Svelte and its dependencies, we will need [npm](https://www.npmjs.com/) which comes with NodeJS, so go and install Node from their site. The download will be different depending on your operating system.

Following the easiest way to get started with Svelte, we will put our project in a client/ folder (because this is what the clients see, the frontend). Feel free to tinker a bit with the configuration files to change the name and such, although this isn’t relevant for the rest of the post.

¶Finding the data

We are going to work with the JSON files provided by OpenData Cáceres. In particular, we want information about the noise, census, vias and trees. To save you the time from searching each of these, we will automate the download with code.

If you want to save the data offline or just know what data we’ll be using for other purposes though, you can right click on the following links and select «Save Link As…» with the name of the link:

[noise.json](http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&format=json)
[census.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&year=2017&format=json)
[vias.json](http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionPadron&year=2017&format=json)
[trees.json](http://opendata.caceres.es/GetData/GetData?dataset=om:Arbol&format=json)

¶Backend

It’s time to get started with some code! We will put it in a server/ folder because it will contain the Python server, that is, the backend of our application.

We are using aiohttp because we would like our server to be async. We don’t expect a lot of users at the same time, but it’s good to know our server would be well-designed for that use-case. As a bonus, it makes IO points clear in the code, which can help reason about it. The implicit synchronization between await is also a nice bonus.

¶Saving the data in Mongo

Before running the server, we must ensure that the data we need is already stored and indexed in Mongo. Our server/data.py will take care of downloading the files, cleaning them up a little (Cáceres’ OpenData can be a bit awkward sometimes), inserting them into Mongo and indexing them.

Downloading the JSON data can be done with [ClientSession.get](https://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientSession.get). We also take this opportunity to clean up the messy encoding from the JSON, which does not seem to be UTF-8 in some cases.

async def load_json(session, url):
    fixes = [(old, new.encode('utf-8')) for old, new in [
        (b'\xc3\x83\\u2018', 'Ñ'),
        (b'\xc3\x83\\u0081', 'Á'),
        (b'\xc3\x83\\u2030', 'É'),
        (b'\xc3\x83\\u008D', 'Í'),
        (b'\xc3\x83\\u201C', 'Ó'),
        (b'\xc3\x83\xc5\xa1', 'Ú'),
        (b'\xc3\x83\xc2\xa1', 'á'),
    ]]

    async with session.get(url) as resp:
        data = await resp.read()

    # Yes, this feels inefficient, but it's not really worth improving.
    for old, new in fixes:
        data = data.replace(old, new)

    data = data.decode('utf-8')
    return json.loads(data)

Later on, it can be reused for the various different URLs:

import aiohttp

NOISE_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:MedicionRuido&format=json'
# (...other needed URLs here)

async def insert_to_db(db):
    async with aiohttp.ClientSession() as session:
        data = await load_json(session, NOISE_URL)
        # now we have the JSON data cleaned up, ready to be parsed

¶Data model

With the JSON data in our hands, it’s time to parse it. Always remember to parse, don’t validate. With Python 3.7 dataclasses it’s trivial to define classes that will store only the fields we care about, typed, and with proper names:

from dataclasses import dataclass

Longitude = float
Latitude = float

@dataclass
class GSON:
    type: str
    coordinates: (Longitude, Latitude)

@dataclass
class Noise:
    id: int
    geo: GSON
    level: float

This makes it really easy to see that, if we have a Noise, we can access its geo data which is a GSON with a type and coordinates, having Longitude and Latitude respectively. dataclasses and [typing](https://docs.python.org/3/library/typing.html) make dealing with this very easy and clear.

Every dataclass will be on its own collection inside Mongo, and these are:

Noise
Integer id
GeoJSON geo
String type
Longitude-latitude pair coordinates
Floating-point number level
Tree
String name
String gender
Integer units
Floating-point number height
Floating-point number cup_diameter
Floating-point number trunk_diameter
Optional string variety
Optional string distribution
GeoJSON geo
Optional string irrigation
Census
Integer year
Via via
String name
String kind
Integer code
Optional string history
Optional string old_name
Optional floating-point number length
Optional GeoJSON start
GeoJSON middle
Optional GeoJSON end
Optional list with geometry pairs geometry
Integer count
Mapping year-to-count count_per_year
Mapping gender-to-count count_per_gender
Mapping nationality-to-count count_per_nationality
Integer time_year

Now, let’s define a method to actually parse the JSON and yield instances from these new data classes:

@classmethod
def iter_from_json(cls, data):
    for row in data['results']['bindings']:
        noise_id = int(row['uri']['value'].split('/')[-1])
        long = float(row['geo_long']['value'])
        lat = float(row['geo_lat']['value'])
        level = float(row['om_nivelRuido']['value'])

        yield cls(
            id=noise_id,
            geo=GSON(type='Point', coordinates=[long, lat]),
            level=level
        )

Here we iterate over the input JSON data bindings and yield cls instances with more consistent naming than the original one. We also extract the data from the many unnecessary nested levels of the JSON and have something a lot flatter to work with.

For those of you who don’t know what yield does (after all, not everyone is used to seeing generators), here’s two functions that work nearly the same:

def squares_return(n):
    result = []
    for i in range(n):
        result.append(n ** 2)
    return result

def squares_yield(n):
    for i in range(n):
        yield n ** 2

The difference is that the one with yield is «lazy» and doesn’t need to do all the work up-front. It will generate (yield) more values as they are needed when you use a for loop. Generally, it’s a better idea to create generator functions than do all the work early which may be unnecessary. See What does the «yield» keyword do? if you still have questions.

With everything parsed, it’s time to insert the data into Mongo. If the data was not present yet (0 documents), then we will download the file, parse it, insert it as documents into the given Mongo db, and index it:

from dataclasses import asdict

async def insert_to_db(db):
    async with aiohttp.ClientSession() as session:
        if await db.noise.estimated_document_count() == 0:
            data = await load_json(session, NOISE_URL)

            await db.noise.insert_many(asdict(noise) for noise in Noise.iter_from_json(data))
            await db.noise.create_index([('geo', '2dsphere')])

We repeat this process for all the other data, and just like that, Mongo is ready to be used in our server.

¶Indices

In order to execute our geospatial queries we have to create an index on the attribute that represents the location, because the operators that we will use requires it. This attribute can be a GeoJSON object or a legacy coordinate pair.

We have decided to use a GeoJSON object because we want to avoid legacy features that may be deprecated in the future.

The attribute is called geo for the Tree and Noise objects and start, middle or end for the Via class. In the Via we are going to index the attribute middle because it is the most representative field for us. Because the Via is inside the Census and it doesn’t have its own collection, we create the index on the Census collection.

The used index type is 2dsphere because it supports queries that work on geometries on an earth-like sphere. Another option is the 2d index but it’s not a good fit for our because it is for queries that calculate geometries on a two-dimensional plane.

¶Running the server

If we ignore the configuration part of the server creation, our server.py file is pretty simple. Its job is to create a server application, setup Mongo and return it to the caller so that they can run it:

import asyncio
import subprocess
import motor.motor_asyncio

from aiohttp import web

from . import rest, data

def create_app():
    ret = subprocess.run('npm run build', cwd='../client', shell=True).returncode
    if ret != 0:
        exit(ret)

    db = motor.motor_asyncio.AsyncIOMotorClient().opendata
    loop = asyncio.get_event_loop()
    loop.run_until_complete(data.insert_to_db(db))

    app = web.Application()
    app['db'] = db

    app.router.add_routes([
        web.get('/', lambda r: web.HTTPSeeOther('/index.html')),
        *rest.ROUTES,
        web.static('/', os.path.join(config['www']['root'], 'public')),
    ])

    return app

There’s a bit going on here, but it’s nothing too complex:

We automatically run npm run build on the frontend because it’s very comfortable to have the frontend built automatically before the server runs.
We create a Motor client and access the opendata database. Into it, we load the data, effectively saving it in Mongo for the server to use.
We create the server application and save a reference to the Mongo database in it, so that it can be used later on any endpoint without needing to recreate it.
We define the routes of our app: root, REST and static (where the frontend files live). We’ll get to the rest part soon. Running the server is now simple:

def main():
    from aiohttp import web
    from . import server

    app = server.create_app()
    web.run_app(app)

if __name__ == '__main__':
    main()

¶REST endpoints

The frontend will communicate with the backend via REST calls, so that it can ask for things like «give me the information associated with this area», and the web server can query the Mongo server to reply with a HTTP response. This little diagram should help:

What we need to do, then, is define those REST endpoints we mentioned earlier when creating the server. We will process the HTTP request, ask Mongo for the data, and return the HTTP response:

import asyncio
import pymongo

from aiohttp import web

async def get_area_info(request):
    try:
        long = float(request.query['long'])
        lat = float(request.query['lat'])
        distance = float(request.query['distance'])
    except KeyError as e:
        raise web.HTTPBadRequest(reason=f'a required parameter was missing: {e.args[0]}')
    except ValueError:
        raise web.HTTPBadRequest(reason='one of the parameters was not a valid float')

    geo_avg_noise_pipeline = [{
        '$geoNear': {
            'near' : {'type': 'Point', 'coordinates': [long, lat]},
            'maxDistance': distance,
            'minDistance': 0,
            'spherical' : 'true',
            'distanceField' : 'distance'
        }
    }]

    db = request.app['db']

    try:
        noise_count, sum_noise, avg_noise = 0, 0, 0
        async for item in db.noise.aggregate(geo_avg_noise_pipeline):
            noise_count += 1
            sum_noise += item['level']

        if noise_count != 0:
            avg_noise = sum_noise / noise_count
        else:
            avg_noise = None

    except pymongo.errors.ConnectionFailure:
        raise web.HTTPServiceUnavailable(reason='no connection to database')

    return web.json_response({
        'tree_count': tree_count,
        'trees_per_type': [[k, v] for k, v in trees_per_type.items()],
        'census_count': census_count,
        'avg_noise': avg_noise,
    })

ROUTES = [
    web.get('/rest/get-area-info', get_area_info)
]

In this code, we’re only showing how to return the average noise because that’s the simplest we can do. The real code also fetches tree count, tree count per type, and census count.

Again, there’s quite a bit to go through, so let’s go step by step:

We parse the frontend’s request.query into float that we can use. In particular, the frontend is asking us for information at a certain latitude, longitude, and distance. If the query is malformed, we return a proper error.
We create our query for Mongo outside, just so it’s clearer to read.
We access the database reference we stored earlier when creating the server with request.app['db']. Handy!
We try to query Mongo. It may fail if the Mongo server is not running, so we should handle that and tell the client what’s happening. If it succeeds though, we will gather information about the average noise.
We return a json_response with Mongo results for the frontend to present to the user. You may have noticed we defined a ROUTES list at the bottom. This will make it easier to expand in the future, and the server creation won’t need to change anything in its code, because it’s already unpacking all the routes we define here.

¶Geospatial queries

In order to retrieve the information from Mongo database we have defined two geospatial queries:

geo_query = {
    '$nearSphere' : {
        '$geometry': {
            'type': 'Point',
            'coordinates': [long, lat]
         },
        '$maxDistance': distance,
        '$minDistance': 0
    }
}

This query uses the operator $nearSphere which return geospatial objects in proximity to a point on a sphere.

The sphere point is represented by the $geometry operator where it is specified the type of geometry and the coordinates (given by the HTTP request).

The maximum and minimum distance are represented by $maxDistance and $minDistance respectively. We specify that the maximum distance is the radio selected by the user.

geo_avg_noise_pipeline = [{
    '$geoNear': {
        'near' : {'type': 'Point', 'coordinates': [long, lat]},
        'maxDistance': distance,
        'minDistance': 0,
        'spherical' : 'true',
        'distanceField' : 'distance'
    }
}]

This query uses the aggregation pipeline stage $geoNear which returns an ordered stream of documents based on the proximity to a geospatial point. The output documents include an additional distance field.

The near field is mandatory and is the point for which to find the closest documents. In this field it is specified the type of geometry and the coordinates (given by the HTTP request).

The distanceField field is also mandatory and is the output field that will contain the calculated distance. In this case we’ve just called it distance.

Some other fields are maxDistance that indicates the maximum allowed distance from the center of the point, minDistance for the minimum distance, and spherical which tells MongoDB how to calculate the distance between two points.

We specify the maximum distance as the radio selected by the user in the frontend.

¶Frontend

As said earlier, our frontend will use Svelte. We already downloaded the template, so we can start developing. For some, this is the most fun part, because they can finally see and interact with some of the results. But for this interaction to work, we needed a functional backend which we now have!

¶REST queries

The frontend has to query the server to get any meaningful data to show on the page. The Fetch API does not throw an exception if the server doesn’t respond with HTTP OK, but we would like one if things go wrong, so that we can handle them gracefully. The first we’ll do is define our own exception which is not pretty:

function NetworkError(message, status) {
    var instance = new Error(message);
    instance.name = 'NetworkError';
    instance.status = status;
    Object.setPrototypeOf(instance, Object.getPrototypeOf(this));
    if (Error.captureStackTrace) {
        Error.captureStackTrace(instance, NetworkError);
    }
    return instance;
}

NetworkError.prototype = Object.create(Error.prototype, {
    constructor: {
        value: Error,
        enumerable: false,
        writable: true,
        configurable: true
    }
});
Object.setPrototypeOf(NetworkError, Error);

But hey, now we have a proper and reusable NetworkError! Next, let’s make a proper and reusabe query function that deals with fetch for us:

async function query(endpoint) {
    const res = await fetch(endpoint, {
        // if we ever use cookies, this is important
        credentials: 'include'
    });
    if (res.ok) {
        return await res.json();
    } else {
        throw new NetworkError(await res.text(), res.status);
    }
}

At last, we can query our web server. The export here tells Svelte that this function should be visible to outer modules (public) as opposed to being private:

export function get_area_info(long, lat, distance) {
    return query(`/rest/get-area-info?long=${long}&lat=${lat}&distance=${distance}`);
}

The attentive reader will have noticed that query is async, but get_area_info is not. This is intentional, because we don’t need to await for anything inside of it. We can just return the [Promise](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise) that query created and let the caller await it as they see fit. The await here would have been redundant.

For those of you who don’t know what a JavaScript promise is, think of it as an object that represents «an eventual result». The result may not be there yet, but we promised it will be present in the future, and we can await for it. You can also find the same concept in other languages like Python under a different name, such as Future.

¶Map component

In Svelte, we can define self-contained components that are issolated from the rest. This makes it really easy to create a modular application. Think of a Svelte component as your own HTML tag, which you can customize however you want, building upon the already-existing components HTML has to offer.

The main thing that our map needs to do is render the map as an image and overlay the selection area as the user hovers the map with their mouse. We could render the image in the canvas itself, but instead we’ll use the HTML <img> tag for that and put a transparent <canvas> on top with some CSS. This should make it cheaper and easier to render things on the canvas.

The Map component will thus render as the user moves the mouse over it, and produce an event when they click so that whatever component is using a Map knows that it was clicked. Here’s the final CSS and HTML:

<style>
div {
    position: relative;
}
canvas {
    position: absolute;
    left: 0;
    top: 0;
    cursor: crosshair;
}
</style>

<div>
    <img bind:this={img} on:load={handleLoad} {height} src="caceres-municipality.svg" alt="Cáceres (municipality)"/>
    <canvas
        bind:this={canvas}
        on:mousemove={handleMove}
        on:wheel={handleWheel}
        on:mouseup={handleClick}/>
</div>

We hardcode a map source here, but ideally this would be provided by the server. The project is already complex enough, so we tried to avoid more complexity than necessary.

We bind the tags to some variables declared in the JavaScript code of the component, along with some functions and parameters to let the users of Map customize it just a little.

Here’s the gist of the JavaScript code:

<script>
    import { createEventDispatcher, onMount } from 'svelte';

    export let height = 200;

    const dispatch = createEventDispatcher();

    let img;
    let canvas;

    const LONG_WEST = -6.426881;
    const LONG_EAST = -6.354143;
    const LAT_NORTH = 39.500064;
    const LAT_SOUTH = 39.443201;

    let x = 0;
    let y = 0;
    let clickInfo = null; // [x, y, radius]
    let radiusDelta = 0.005 * height;
    let maxRadius = 0.2 * height;
    let minRadius = 0.01 * height;
    let radius = 0.05 * height;

    function handleLoad() {
        canvas.width = img.width;
        canvas.height = img.height;
    }

    function handleMove(event) {
        const { left, top } = this.getBoundingClientRect();
        x = Math.round(event.clientX - left);
        y = Math.round(event.clientY - top);
    }

    function handleWheel(event) {
        if (event.deltaY < 0) {
            if (radius < maxRadius) {
                radius += radiusDelta;
            }
        } else {
            if (radius > minRadius) {
                radius -= radiusDelta;
            }
        }
        event.preventDefault();
    }

    function handleClick(event) {
        dispatch('click', {
            // the real code here maps the x/y/radius values to the right range, here omitted
            x: ...,
            y: ...,
            radius: ...,
        });
    }

    onMount(() => {
        const ctx = canvas.getContext('2d');
        let frame;

        (function loop() {
            frame = requestAnimationFrame(loop);

            // the real code renders mouse area/selection, here omitted for brevity
            ...
        }());

        return () => {
            cancelAnimationFrame(frame);
        };
    });
</script>

Let’s go through bit-by-bit:

We define a few variables and constants for later use in the final code.
We define the handlers to react to mouse movement and clicks. On click, we dispatch an event to outer components.
We setup the render loop with animation frames, and cancel the current frame appropriatedly if the component disappears.

¶App component

Time to put everything together! We wil include our function to make REST queries along with our Map component to render things on screen.

<script>
    import Map from './Map.svelte';
    import { get_area_info } from './rest.js'
    let selection = null;
    let area_info_promise = null;
    function handleMapSelection(event) {
        selection = event.detail;
        area_info_promise = get_area_info(selection.x, selection.y, selection.radius);
    }
    function format_avg_noise(avg_noise) {
        if (avg_noise === null) {
            return '(no data)';
        } else {
            return `${avg_noise.toFixed(2)} dB`;
        }
    }
</script>

<div class="container-fluid">
    <div class="row">
        <div class="col-3" style="max-width: 300em;">
            <div class="text-center">
                <h1>Caceres Data Consultory</h1>
            </div>
            <Map height={400} on:click={handleMapSelection}/>
            <div class="text-center mt-4">
                {#if selection === null}
                        <p class="m-1 p-3 border border-bottom-0 bg-info text-white">Click on the map to select the area you wish to see details for.</p>
                {:else}
                        <h2 class="bg-dark text-white">Selected area</h2>
                        <p><b>Coordinates:</b> ({selection.x}, {selection.y})</p>
                        <p><b>Radius:</b> {selection.radius} meters</p>
                {/if}
            </div>
        </div>
        <div class="col-sm-4">
            <div class="row">
            {#if area_info_promise !== null}
                {#await area_info_promise}
                    <p>Fetching area information…</p>
                {:then area_info}
                    <div class="col">
                        <div class="text-center">
                            <h2 class="m-1 bg-dark text-white">Area information</h2>
                            <ul class="list-unstyled">
                                <li>There are <b>{area_info.tree_count} trees </b> within the area</li>
                                <li>The <b>average noise</b> is <b>{format_avg_noise(area_info.avg_noise)}</b></li>
                                <li>There are <b>{area_info.census_count} persons </b> within the area</li>
                            </ul>
                        </div>
                        {#if area_info.trees_per_type.length > 0} 
                            <div class="text-center">
                                <h2 class="m-1 bg-dark text-white">Tree count per type</h2>
                            </div>
                            <ul class="list-group">
                                {#each area_info.trees_per_type as [type, count]}
                                    <li class="list-group-item">{type} <span class="badge badge-dark float-right">{count}</span></li>
                                {/each}
                            </ul>
                        {/if}
                    </div>
                {:catch error}
                    <p>Failed to fetch area information: {error.message}</p>
                {/await}
            {/if}
            </div>
        </div>
    </div>
</div>

We import the Map component and REST function so we can use them.
We define a listener for the events that the Map produces. Such event will trigger a REST call to the server and save the result in a promise used later.
We’re using Bootstrap for the layout because it’s a lot easier. In the body we add our Map and another column to show the selection information.
We make use of Svelte’s {#await} to nicely notify the user when the call is being made, when it was successful, and when it failed. If it’s successful, we display the info.

¶Results

Lo and behold, watch our application run!

In this video you can see our application running, but let’s describe what is happening in more detail.

When the application starts running (by opening it in your web browser of choice), you can see a map with the town of Cáceres. Then you, the user, can click to retrieve the information within the selected area.

It is important to note that one can make the selection area larger or smaller by trying to scroll up or down, respectively.

Once an area is selected, it is colored green in order to let the user know which area they have selected. Under the map, the selected coordinates and the radius (in meters) is also shown for the curious. At the right side the information concerning the selected area is shown, such as the number of trees, the average noise and the number of persons. If there are trees in the area, the application also displays the trees per type, sorted by the number of trees.

¶Download

We hope you enjoyed reading this post as much as we enjoyed writing it! Feel free to download the final project and play around with it. Maybe you can adapt it for even more interesting purposes!

download removed

To run the above code:

Unzip the downloaded file.
Make a copy of example-server-config.ini and rename it to server-config.ini, then edit the file to suit your needs.
Run the server with python -m server.
Open localhost:9000 in your web browser (or whatever port you chose) and enjoy!

MongoDB: an Introduction

2020-04-07T22:00:00+00:00

MongoDB: an Introduction

This is the first post in the MongoDB series, where we will introduce the MongoDB database system and take a look at its features and installation methods.

Created 2020-03-05
Modified 2020-04-08

Other posts in this series:

This post is co-authored wih Classmate.

¶Purpose of technology

MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era, with the scalability and flexibility that you want with the querying and indexing that you need. It being a document database means it stores data in JSON-like documents.

The Mongo team believes this is the most natural way to think about data, which is (they claim) much more expressive and powerful than the traditional row/column model, since programmers think in objects.

¶How it works

MongoDB’s architecture can be summarized as follows:

Document data model.
Distributed systems design.
Unified experience with freedom to run it anywhere.

For a more in-depth explanation, MongoDB offers a download to the MongoDB Architecture Guide with roughly ten pages worth of text.

_ Overview of MongoDB’s architecture_

Regarding usage, MongoDB comes with a really nice introduction along with JavaScript, Python, Java, C++ or C# code at our choice, which describes the steps necessary to make it work. Below we will describe a common workflow.

First, we must connect to a running MongoDB instance. Once the connection succeeds, we can access individual «collections», which we can think of as tables where collections of data is stored.

For instance, we could insert an arbitrary JSON document into the restaurants collection to store information about a restaurant.

At any other point in time, we can query these collections. The queries range from trivial, empty ones (which would retrieve all the documents and fields) to more rich and complex queries (for instance, using AND and OR operators, checking if data exists, and then looking for a value in a list).

MongoDB also supports the creation of indices, similar to those in other database systems. It allows for the creation of indices on any field or subfields.

In Mongo, the aggregation pipeline allows us to filter and analyze data based on a given set of criteria. For example, we could pull all the documents in the restaurants collection that have a category of Bakery using the $match operator. Then, we can group them by their star rating using the $group operator. Using the accumulator operator, $sum, we can see how many bakeries in our collection have each star rating.

¶Features

The features can be seen all over the place in their site, because it’s something they make a lot of emphasis on:

Easy development, thanks to the document data model, something they claim to be «the best way to work with data».
Data is stored in flexible JSON-like documents.
This model directly maps to the objects in the application’s code.
Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyze the data.
Powerful query language, with a rich and expressive query language that allows filtering and sorting by any field, no matter how nested it may be within a document. The queries are themselves JSON, and thus easily composable.
Support for aggregations and other modern use-cases such as geo-based search, graph search, and text search.
A distributed systems design, which allows developers to intelligently put data where they want it. High availability, horizontal scaling, and geographic distribution are built in and easy to use.
A unified experience with the freedom to run anywhere, which allows developers to future-proof their work and eliminate vendor lock-in.

¶Corner in CAP theorem

MongoDB’s position in the CAP theorem (Consistency, Availability, Partition Tolerance) depends on the database and driver configurations, and the type of disaster.

With no partitions, the main focus is CA.
If there are **partitions **but the system is strongly connected, the main focus is AP: non-synchronized writes from the old primary are ignored.
If there are partitions but the system is not strongly connected, the main focus is CP: only read access is provided to avoid inconsistencies. The general consensus seems to be that Mongo is CP.

¶Download

We will be using the apt-based installation.

The Community version can be downloaded by anyone through MongoDB Download Center, where one can choose the version, Operating System and Package.MongoDB also seems to be available in Ubuntu’s PPAs.

¶Installation

We will be using an Ubuntu-based system, with apt available. To install MongoDB, we open a terminal and run the following command:

apt install mongodb

After confirming that we do indeed want to install the package, we should be able to run the following command to verify that the installation was successful:

mongod --version

The output should be similar to the following:

db version v4.0.16
git version: 2a5433168a53044cb6b4fa8083e4cfd7ba142221
OpenSSL version: OpenSSL 1.1.1  11 Sep 2018
allocator: tcmalloc
modules: none
build environment:
	distmod: ubuntu1804
	distarch: x86_64
	target_arch: x86_64

¶References

MongoDB: Basic Operations and Architecture

2020-04-07T22:00:00+00:00

MongoDB: Basic Operations and Architecture

This is the second post in the MongoDB series, where we will take a look at the CRUD operations they support, the data model and architecture used.

Created 2020-03-05
Modified 2020-04-08

Other posts in this series:

This post is co-authored wih Classmate, and in it we will take an explorative approach using the mongo command line shell to execute commands against the database. It even has TAB auto-completion, which is awesome!

Before creating any documents, we first need to create somewhere for the documents to be in. And before we create anything, the database has to be running, so let’s do that first. If we don’t have a service installed, we can run the mongod command ourselves in some local folder to make things easier:

$ mkdir -p mongo-database
$ mongod --dbpath mongo-database

Just like that, we will have Mongo running. Now, let’s connect to it using the mongo command in another terminal (don’t close the terminal where the server is running, we need it!). By default, it connects to localhost, which is just what we need.

$ mongo

¶Create

¶Create a database

Let’s list the databases:

> show databases
admin   0.000GB
config  0.000GB
local   0.000GB

Oh, how interesting! There’s already some databases, even though we just created the directory where Mongo will store everything. However, they seem empty, which make sense.

Creating a new database is done by use-ing a name that doesn’t exist. Let’s call our new database «helloworld».

> use helloworld
switched to db helloworld

Good! Now the «local variable» called db points to our helloworld database.

> db
helloworld

What happens if we print the databases again? Surely our new database will show up now…

> show databases
admin   0.000GB
config  0.000GB
local   0.000GB

…maybe not! It seems Mongo won’t create the database until we create some collections and documents in it. Databases contain collections, and inside collections (which you can think of as tables) we can insert new documents (which you can think of as rows). Like in many programming languages, the dot operator is used to access these «members».

¶Create a document

Let’s add a new greeting into the greetings collection:

> db.greetings.insert({message: "¡Bienvenido!", lang: "es"})
WriteResult({ "nInserted" : 1 })

> show collections
greetings

> show databases
admin       0.000GB
config      0.000GB
helloworld  0.000GB
local       0.000GB

That looks promising! We can also see our new helloworld database also shows up. The Mongo shell actually works on JavaScript-like code, which is why we can use a variant of JSON (BSON) to insert documents (note the lack of quotes around the keys, convenient!).

The insert method actually supports a list of documents, and by default Mongo will assign a unique identifier to each. If we don’t want that though, all we have to do is add the _id key to our documents.

> db.greetings.insert([
... {message: "Welcome!", lang: "en"},
... {message: "Bonjour!", lang: "fr"},
... ])
BulkWriteResult({
    "writeErrors" : [ ],
    "writeConcernErrors" : [ ],
    "nInserted" : 2,
    "nUpserted" : 0,
    "nMatched" : 0,
    "nModified" : 0,
    "nRemoved" : 0,
    "upserted" : [ ]
})

¶Create a collection

In this example, we created the collection greetings implicitly, but behind the scenes Mongo made a call to createCollection. Let’s do just that:

> db.createCollection("goodbyes")
{ "ok" : 1 }

> show collections
goodbyes
greetings

The method actually has a default parameter to configure other options, like the maximum size of the collection or maximum amount of documents in it, validation-related options, and so on. These are all described in more details in the documentation.

¶Read

To read the contents of a document, we have to find it.

> db.greetings.find()
{ "_id" : ObjectId("5e74829a0659f802b15f18dd"), "message" : "¡Bienvenido!", "lang" : "es" }
{ "_id" : ObjectId("5e7487b90659f802b15f18de"), "message" : "Welcome!", "lang" : "en" }
{ "_id" : ObjectId("5e7487b90659f802b15f18df"), "message" : "Bonjour!", "lang" : "fr" }

That’s a bit unreadable for my taste, can we make it more pretty?

> db.greetings.find().pretty()
{
    "_id" : ObjectId("5e74829a0659f802b15f18dd"),
    "message" : "¡Bienvenido!",
    "lang" : "es"
}
{
    "_id" : ObjectId("5e7487b90659f802b15f18de"),
    "message" : "Welcome!",
    "lang" : "en"
}
{
    "_id" : ObjectId("5e7487b90659f802b15f18df"),
    "message" : "Bonjour!",
    "lang" : "fr"
}

Gorgeous! We can clearly see Mongo created an identifier for us automatically. The queries are also JSON, and support a bunch of operators (prefixed by $), known as Query Selectors. Here’s a few:

Operation	Syntax	RDBMS equivalent
Equals	`{key: {$eq: value}}` Shorthand: `{key: value}`	`where key = value`
Less Than	`{key: {$lte: value}}`	`where key < value`
Less Than or Equal	`{key: {$lt: value}}`	`where key <= value`
Greater Than	`{key: {$gt: value}}`	`where key > value`
Greater Than or Equal	`{key: {$gte: value}}`	`where key >= value`
Not Equal	`{key: {$ne: value}}`	`where key != value`
And	`{$and: [{k1: v1}, {k2: v2}]}`	`where k1 = v1 and k2 = v2`
Or	`{$or: [{k1: v1}, {k2: v2}]}`	`where k1 = v1 or k2 = v2`

The operations all do what you would expect them to do, and their names are really intuitive. Aggregating operations with $and or $or can be done anywhere in the query, nested any level deep.

¶Update

Updating a document can be done by using save on an already-existing document (that is, the document we want to save has _id and it’s in the collection already). If the document is not in the collection yet, this method will create it.

> db.greetings.save({_id: ObjectId("5e74829a0659f802b15f18dd"), message: "¡Bienvenido, humano!", "lang" : "es"})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.greetings.find({lang: "es"})
{ "_id" : ObjectId("5e74829a0659f802b15f18dd"), "message" : "¡Bienvenido, humano!", "lang" : "es" }

Alternatively, the update method takes a query and new value.

> db.greetings.update({lang: "en"}, {$set: {message: "Welcome, human!"}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

> db.greetings.find({lang: "en"})
{ "_id" : ObjectId("5e7487b90659f802b15f18de"), "message" : "Welcome, human!", "lang" : "en" }

¶Indexing

Creating an index is done with createIndex:

> db.greetings.createIndex({lang: +1})
{
    "createdCollectionAutomatically" : false,
    "numIndexesBefore" : 1,
    "numIndexesAfter" : 2,
    "ok" : 1
}

Here, we create an ascending index on the lang key. Descending order is done with -1. Now a query for lang in our three documents will be fast… well maybe iteration over three documents was faster than an index.

¶Delete

¶Delete a document

I have to confess, I can’t talk French. I learnt it long ago and it’s long forgotten, so let’s remove the translation I copied online from our greetings with remove.

> db.greetings.remove({lang: "fr"})
WriteResult({ "nRemoved" : 1 })

¶Delete a collection

We never really used the goodbyes collection. Can we get rid of that?

> db.goodbyes.drop()
true

Yes, it is true that we can drop it.

¶Delete a database

Now, I will be honest, I don’t really like our greetings database either. It stinks. Let’s get rid of it as well:

> db.dropDatabase()
{ "dropped" : "helloworld", "ok" : 1 }

Yeah, take that! The dropDatabase can be used to drop databases.

¶References

The examples in this post are all fictional, and the methods that could be used where taken from Classmate’s post, and of course Mongo’s documentation.

Introduction to Hadoop and its MapReduce

2020-04-02T22:00:00+00:00

Introduction to Hadoop and its MapReduce

Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.

Created 2020-04-01
Modified 2020-04-03

Hadoop also offers a distributed filesystem (HDFS) enabling for fast transfer among nodes, and a way to program with MapReduce.

It aims to strive for the 4 V’s: Volume, Variety, Veracity and Velocity. For veracity, it is a secure environment that can be trusted.

¶Milestones

The creators of Hadoop are Doug Cutting and Mike Cafarella, who just wanted to design a search engine, Nutch, and quickly found the problems of dealing with large amounts of data. They found their solution with the papers Google published.

The name comes from the plush of Cutting’s child, a yellow elephant.

In July 2005, Nutch used GFS to perform MapReduce operations.
In February 2006, Nutch started a Lucene subproject which led to Hadoop.
In April 2007, Yahoo used Hadoop in a 1 000-node cluster.
In January 2008, Apache took over and made Hadoop a top-level project.
In July 2008, Apache tested a 4000-node cluster. The performance was the fastest compared to other technologies that year.
In May 2009, Hadoop sorted a petabyte of data in 17 hours.
In December 2011, Hadoop reached 1.0.
In May 2012, Hadoop 2.0 was released with the addition of YARN (Yet Another Resource Navigator) on top of HDFS, splitting MapReduce and other processes into separate components, greatly improving the fault tolerance.

From here onwards, many other alternatives have born, like Spark, Hive & Drill, Kafka, HBase, built around the Hadoop ecosystem.

As of 2017, Amazon has clusters between 1 and 100 nodes, Yahoo has over 100 000 CPUs running Hadoop, AOL has clusters with 50 machines, and Facebook has a 320-machine (2 560 cores) and 1.3PB of raw storage.

¶Why not use RDBMS?

Relational database management systems simply cannot scale horizontally, and vertical scaling will require very expensive servers. Similar to RDBMS, Hadoop has a notion of jobs (analogous to transactions), but without ACID or concurrency control. Hadoop supports any form of data (unstructured or semi-structured) in read-only mode, and failures are common but there’s a simple yet efficient fault tolerance.

So what problems does Hadoop solve? It solves the way we should think about problems, and distributing them, which is key to do anything related with BigData nowadays. We start working with clusters of nodes, and coordinating the jobs between them. Hadoop’s API makes this really easy.

Hadoop also takes very seriously the loss of data with replication, and if a node falls, they are moved to a different node.

¶Major components

The previously-mentioned HDFS runs on commodity machine, which are cost-friendly. It is very fault-tolerant and efficient enough to process huge amounts of data, because it splits large files into smaller chunks (or blocks) that can be more easily handled. Multiple nodes can work on multiple chunks at the same time.

NameNode stores the metadata of the various datablocks (map of blocks) along with their location. It is the brain and the master in Hadoop’s master-slave architecture, also known as the namespace, and makes use of the DataNode.

A secondary NameNode is a replica that can be used if the first NameNode dies, so that Hadoop doesn’t shutdown and can restart.

DataNode stores the blocks of data, and are the slaves in the architecture. This data is split into one or more files. Their only job is to manage this access to the data. They are often distributed among racks to avoid data lose.

JobTracker creates and schedules jobs from the clients for either map or reduce operations.

TaskTracker runs MapReduce tasks assigned to the current data node.

When clients need data, they first interact with the NameNode and replies with the location of the data in the correct DataNode. Client proceeds with interaction with the DataNode.

¶MapReduce

MapReduce, as the name implies, is split into two steps: the map and the reduce. The map stage is the «divide and conquer» strategy, while the reduce part is about combining and reducing the results.

The mapper has to process the input data (normally a file or directory), commonly line-by-line, and produce one or more outputs. The reducer uses all the results from the mapper as its input to produce a new output file itself.

When reading the data, some may be junk that we can choose to ignore. If it is valid data, however, we label it with a particular type that can be useful for the upcoming process. Hadoop is responsible for splitting the data accross the many nodes available to execute this process in parallel.

There is another part to MapReduce, known as the Shuffle-and-Sort. In this part, types or categories from one node get moved to a different node. This happens with all nodes, so that every node can work on a complete category. These categories are known as «keys», and allows Hadoop to scale linearly.

¶References

Google’s BigTable

2020-04-02T22:00:00+00:00

Google’s BigTable

Let’s talk about BigTable, and why it is what it is. But before we get into that, let’s see some important aspects anybody should consider when dealing with a lot of data (something BigTable does!).

Created 2020-04-01
Modified 2020-04-03

¶The basics

Converting a text document into a different format is often a great way to greatly speed up scanning of it in the future. It allows for efficient searches.

In addition, you generally want to store everything in a single, giant file. This will save a lot of time opening and closing files, because everything is in the same file! One proposal to make this happen is Web TREC (see also the Wikipedia page on TREC), which is basically HTML but every document is properly delimited from one another.

Because we will have a lot of data, it’s often a good idea to compress it. Most text consists of the same words, over and over again. Classic compression techniques such as DEFLATE or LZW do an excellent job here.

¶So what’s BigTable?

Okay, enough of an introduction to the basics on storing data. BigTable is what Google uses to store documents, and it’s a customized approach to save, search and update web pages.

BigTable is is a distributed storage system for managing structured data, able to scale to petabytes of data across thousands of commodity servers, with wide applicability, scalability, high performance, and high availability.

In a way, it’s kind of like databases and shares many implementation strategies with them, like parallel databases, or main-memory databases, but of course, with a different schema.

It consists of a big table known as the «Root tablet», with pointers to many other «tablets» (or metadata in between). These are stored in a replicated filesystem accessible by all BigTable servers. Any change to a tablet gets logged (said log also gets stored in a replicated filesystem).

If any of the tablets servers gets locked, a different one can take its place, read the log and deal with the problem.

There’s no query language, transactions occur at row-level only. Every read or write in a row is atomic. Each row stores a single web page, and by combining the row and column keys along with a timestamp, it is possible to retrieve a single cell in the row. More formally, it’s a map that looks like this:

fetch(row: string, column: string, time: int64) -> string

A row may have as many columns as it needs, and these column groups are the same for everyone (but the columns themselves may vary), which is importan to reduce disk read time.

Rows are split in different tablets based on the row keys, which simplifies determining an appropriated server for them. The keys can be up to 64KB big, although most commonly they range 10-100 bytes.

¶Conclusions

BigTable is Google’s way to deal with large amounts of data on many of their services, and the ideas behind it are not too complex to understand.

A practical example with Hadoop

2020-04-02T22:00:00+00:00

A practical example with Hadoop

In our previous Hadoop post, we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.

Created 2020-04-01
Modified 2020-04-03

This post will showcase my own implementation to implement a word counter for any plain text document that you want to analyze.

¶Installation

Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to Apache Hadoop’s releases and download the latest binary version at the time of writing (3.2.1).

We will be using the Linux Mint distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as Ubuntu.

Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as OpenJDK.

Here are all the three steps in the command line:

wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
hadoop-3.2.1/bin/hadoop version

¶Processing data

To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.

We will use Java, because it’s a dependency that we already have anyway, so might as well.

Our map function needs to split each of the lines we receive as input into words, and we will also convert them to lowercase, thus preparing the data for later use (counting words). There won’t be bad records, so we don’t have to worry about that.

Copy or reproduce the following code in a file called WordCountMapper.java, using any text editor of your choice:

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        for (String word : value.toString().split("\\W")) {
            context.write(new Text(word.toLowerCase()), new IntWritable(1));
        }
    }
}

Now, let’s create the WordCountReducer.java file. Its job is to reduce the data from multiple values into just one. We do that by summing all the values (our word count so far):

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
}

Let’s just take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.

Last, let’s write the main method, or else we won’t be able to run it. In our new file WordCount.java:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("usage: java WordCount <input path> <output path>");
            System.exit(-1);
        }

        Job job = Job.getInstance();

        job.setJobName("Word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);
    }
}

And compile by including the required .jar dependencies in Java’s classpath with the -cp switch:

javac -cp "hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*" *.java

At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful). Let’s run it on the same WordCount.java source file we wrote:

java -cp ".:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*" WordCount WordCount.java results

Hooray! We should have a new results/ folder along with the following files:

$ ls results
part-r-00000  _SUCCESS
$ cat results/part-r-00000 
	154
0	2
1	3
2	1
addinputpath	1
apache	6
args	4
boolean	1
class	6
count	1
err	1
exception	1
-snip- (output cut for clarity)

It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.

How does Google’s Search Engine work?

2020-03-27T23:00:00+00:00

How does Google’s Search Engine work?

The original implementation was written in C/++ for Linux/Solaris.

Created 2020-03-18
Modified 2020-03-28

There are three major components in the system’s anatomy, which can be thought as steps to be performed for Google to be what it is today.

But before we talk about the different components, let’s take a look at how they store all of this information.

¶Data structures

A «BigFile» is a virtual file addressable by 64 bits.

There exists a repository with the full HTML of every page compressed, along with a document identifier, length and URL.

sync

length

compressed packet

The Document Index has the document identifier, a pointer into the repository, a checksum and various other statistics.

doc id

ecode

url len

page len

url

page

A Lexicon stores the repository of words, implemented with a hashtable over pointers linking to the barrels (sorted linked lists) of the Inverted Index.

word id	n docs
word id	n docs

The Hit Lists store occurences of a word in a document.

plain	cap: 1	imp: 3	pos: 12
fancy	cap: 1	imp: 7	type: 4	pos: 8
anchor	cap: 1	imp: 7	type: 4	hash: 4	pos: 8

The Forward Index is a barrel with a range of word identifiers (document identifier and list of word identifiers).

doc id	word id: 24	n hits: 8	hit hit hit hit hit hit hit hit
	word id: 24	n hits: 8	hit hit hit hit hit hit hit hit
	null word id

The Inverted Index can be sorted by either document identifier or by ranking of word occurence.

doc id: 23	n hits: 5	hit hit hit hit hit
doc id: 23	n hits: 3	hit hit hit
doc id: 23	n hits: 4	hit hit hit hit
doc id: 23	n hits: 2	hit hit

Back in 1998, Google compressed its repository to 53GB and had 24 million pages. The indices, lexicon, and other temporary storage required about 55GB.

¶Crawling

The crawling must be reliable, fast and robust, and also respect the decision of some authors not wanting their pages crawled. Originally, it took a week or more, so simultaneous execution became a must.

Back in 1998, Google had between 3 and 4 crawlers running at 100 web pages per second maximum. These were implemented in Python.

The crawled pages need parsing to deal with typos or formatting issues.

¶Indexing

Indexing is about putting the pages into barrels, converting words into word identifiers, and occurences into hit lists.

Once indexing is done, sorting of the barrels happens to have them ordered by word identifier, producing the inverted index. This process also had to be done in parallel over many machines, or would otherwise have been too slow.

¶Searching

We need to find quality results efficiently. Plenty of weights are considered nowadays, but at its heart, PageRank is used. It is the algorithm they use to map the web, which is formally defined as follows:

PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Where:

A is a given page
T<sub>n</sub> are pages that point to A
d is the damping factor in the range [0, 1] (often 0.85)
C(A) is the number of links going out of page A
PR(A) is the page rank of page A This formula indicates the probability that a random surfer visits a certain page, and 1 - d is used to indicate when it will «get bored» and stop surfing. More intuitively, the page rank of a page will grow as more pages link to it, or the few that link to it have high page rank.

The anchor text in the links also help provide a better description and helps indexing for even better results.

While searching, the concern is disk I/O which takes up most of the time. Caching is very important to improve performance up to 30 times.

Now, in order to turn user queries into something we can search, we must parse the query and convert the words into word identifiers.

¶Conclusion

Google is designed to be a efficient, scalable, high-quality search engine. There are still bottlenecks in CPU, memory, disk speed and network I/O, but major data structures are used to make efficient use of the resources.

¶References

Privado: PC-Crawler evaluation 2

2020-03-27T23:00:00+00:00

Privado: PC-Crawler evaluation 2

As the student a(i) where i = 9, I have been assigned to evaluate students a(i - 1) and a(i - 2), these being:

Created 2020-03-16
Modified 2020-03-28

a08: Classmate (username)
a07: Classmate (username)

The evaluation is done according to the criteria described in Segunda entrega del PC-Crawler.

¶Classmate’s evaluation

Grading: A.

This is the evaluation of Crawler – Thesauro.

It’s a well-written post, properly using WordPress code blocks, and they explain the process of improving the code and what it does. Because there are no noticeable issues with the post, they get the highest grading.

¶Classmate’s evaluation

Grading: B.

This is the evaluation of Actividad 2-Crawler.

They start with an introduction on what they will do.

Next, they show the code they have written, also describing what it does, although they don’t explain why they chose the data structures they used.

The style of the code leaves a lot to be desired, and they should have embedded the code in the post instead of taking screenshots. People that rely on screen readers will not be able to see the code.

I have graded them B and not A for this last reason.

What is ElasticSearch and why should you care?

2020-03-26T23:00:00+00:00

What is ElasticSearch and why should you care?

ElasticSearch is a giant search index with powerful analytics capabilities. It’s like a database and search engine on steroids, really easy and fast to get up and running. One can think of it as your own Google, a search engine with analytics.

Created 2020-03-18
Modified 2020-03-27

ElasticSearch is rich, stable, performs well, is well maintained, and able to scale to petabytes of any kind of data, whether it’s structured, semi-structured or not at all. It’s cost-effective and can be used to make business decisions.

Or, described in 10 seconds:

Schema-free, REST & JSON based distributed document store Open source: Apache License 2.0 Zero configuration

-- Alex Reelsen

¶Basic capabilities

ElasticSearch lets you ask questions about your data, not just make queries. You may think SQL can do this too, but what’s important is making a pipeline of facets, and feed the results from query to query.

Instead of changing your data, you can be flexible with your questions with no need to re-index it every time the questions change.

ElasticSearch is not just to search for full-text data, either. It can search for structured data and return more than just the results. It also yields additional data, such as ranking, highlights, and allows for pagination.

It doesn’t take a lot of configuration to get running, either, which can be a good boost on productivity.

¶How does it work?

ElasticSearch depends on Java, and can work in a distributed cluster if you execute multiple instances. Data will be replicated and sharded as needed. The current version at the time of writing is 7.6.1, and it’s being developed fast!

It also has support for plugins, with an ever-growing ecosystem and integration on many programming languages. Tools around it are being built around it, too, like Kibana which helps you visualize your data.

The way you use it is through a JSON API, served over HTTP/S.

¶How can I use it?

You can try ElasticSearch out for free on Elastic Cloud, however, it can also be downloaded and ran offline, which is what we’ll do. Download the file corresponding to your operating system, unzip it, and execute the binary. Running it is as simple as that!

Now you can make queries to it over HTTP, with for example curl:

curl -X PUT localhost:9200/orders/order/1 -d '
{
  "created_at": "2013/09/05 15:45:10",
  "items": [
    {
      name: "HD Monitor"
    }
  ],
  "total": 249.95
}'

This will create a new order with some information, such as when it was created, what items it contains, and the total cost of the order.

You can then query or filter as needed, script it or even create statistics.

¶References

Privado: NoSQL evaluation

2020-03-26T23:00:00+00:00

Privado: NoSQL evaluation

I have decided to evaluate Classmate‘s post and Classmate‘s post, because they review databases I have not seen or used before, and I think it would be interesting to see new ones.

Created 2020-03-16
Modified 2020-03-27

The evaluation is based on the requirements defined by Trabajos en grupo sobre Bases de Datos NoSQL:

1ª entrada: Descripción de la finalidad de la tecnología y cómo funciona o trabaja la BD NoSQL, sus características, la arista que ocupa en el Teorema CAP, de dónde se descarga, y cómo se instala.

-- Teacher

¶Classmate’s evaluation

Grading: A.

The post I have evaluated is BB.DD. NoSQL: Voldemort 1ª Fase.

The post doesn’t start very well, because the first sentence has (emphasis mine):

En él repasaremos en qué consiste MongoDB, sus características, y cómo se instala, entre otros.

-- Classmate

…yet the post is about Voldemort!

The post does detail how it works, its architecture, corner in the CAP theorem, download and installation.

I have graded the post with A because I think it meets all the requirements, even if they slipped a bit in the beginning.

¶Classmate’s evaluation

Grading: A.

The post I have evaluted is Raven.

They have done a good job describing the project’s goals, corner in the CAP theorem, download, and provide an extensive installation section.

They don’t seem to use some of WordPress features, such as lists, but otherwise the post is good and deserves an A grading.

Integrating Apache Tika into our Crawler

2020-03-24T23:00:00+00:00

Integrating Apache Tika into our Crawler

In our last crawler post, we detailed how our crawler worked, and although it did a fine job, it’s time for some extra upgrading.

Created 2020-03-18
Modified 2020-03-25

¶What kind of upgrades?

A small but useful one. We are adding support for file types that contain text but cannot be processed by normal text editors because they are structured and not just plain text (such as PDF files, Excel, Word documents…).

And for this task, we will make use of the help offered by Tika, our friendly Apache tool.

¶What is Tika?

Tika is a set of libraries offered by The Apache Software Foundation that we can include in our project in order to extract the text and metadata of files from a long list of supported formats.

¶Changes in the code

Not much has changed in the structure of the crawler, we simply have added a new method in Utils that uses the class Tika from the previously mentioned library so as to process and extract the text of more filetypes.

Then, we use this text just like we would for our standard text file (checking the thesaurus and adding it to the word map) and voilà! We have just added support for a big range of file types.

¶Incorporating Gradle

In order for the previous code to work, we need to make use of external libraries. To make this process easier and because the project is growing, we decided to use Gradle, a build system that can be used for projects in various programming languages, such as Java.

We followed their guide to Building Java Applications, and in a few steps added the required .gradle files. Now we can compile and run the code without having to worry about juggling with Java and external dependencies in a single command:

./gradlew run

¶Download

And here you can download the final result:

download removed

Cassandra: Basic Operations and Architecture

2020-03-23T23:00:00+00:00

Cassandra: Basic Operations and Architecture

This is the second post in the NoSQL Databases series, with a brief description on the basic operations (such as insertion, retrieval, indexing…), and complete execution along with the data model / architecture.

Created 2020-03-05
Modified 2020-03-24

Other posts in this series:

Cassandra: an Introduction
Cassandra: Basic Operations and Architecture (this post)

Cassandra uses it own Query Language for managing the databases, it is known as **CQL **(Cassandra Query Language). Cassandra stores data in tables, as in relational databases, and these tables are grouped in keyspaces. A keyspace defines a number of options that applies to all the tables it contains. The most used option is the **replication strategy. **It is recommended to have only one keyspace by application.

It is important to mention that tables and keyspaces are** case insensitive**, so myTable is equivalent to mytable, but it is possible to force case sensitivity using double-quotes.

To begin with the basic operations it is necessary to deploy Cassandra:

Open a terminal in the root of the Apache Cassandra folder downloaded in the previous post.
Run the command:

$ bin/cassandra

Once Cassandra is deployed, it is time to open a** CQL Shell**, in other terminal, with the command:

$ bin/cqlsh

It is possible to check if Cassandra is deployed if the SQL Shell prints the next message:

CQL Shell

¶Create/Insert

¶DDL (Data Definition Language)

¶Create keyspace

A keyspace is created using a **CREATE KEYSPACE **statement:

$ **CREATE** KEYSPACE [ **IF** **NOT** **EXISTS** ] keyspace_name **WITH** options;

The supported “options” are:

“replication”: this is **mandatory **and defines the replication strategy and the replication factor (the number of nodes that will have a copy of the data). Within this option there is a property called “class” in which the replication strategy is specified (“SimpleStrategy” or “NetworkTopologyStrategy”)
“durable_writes”: this is not mandatory and it is possible to use the commit logs for updates. Attempting to create an already existing keyspace will return an error unless the **IF NOT EXISTS **directive is used.

The example associated to this statement is create a keyspace with name “test_keyspace” with “SimpleStrategy” as “class” of replication and a “replication_factor” of 3.

**CREATE** KEYSPACE test_keyspace
    **WITH** **replication** = {'class': 'SimpleStrategy',
                        'replication_factor' : 3};

The **USE **statement allows to change the current keyspace. The syntax of this statement is very simple:

**USE** keyspace_name;

USE statement

It is also possible to get the metadata from a keyspace with the **DESCRIBE **statement.

**DESCRIBE** KEYSPACES | KEYSPACE keyspace_name;

¶Create table

Creating a new table uses the **CREATE TABLE **statement:

**CREATE** **TABLE** [ **IF** **NOT** **EXISTS** ] table_name
    '('
        column_definition
        ( ',' column_definition )*
        [ ',' **PRIMARY** **KEY** '(' primary_key ')' ]
    ')' [ **WITH** table_options ];

With “column_definition” as: column_name cql_type [ STATIC ] [ PRIMARY KEY]; “primary_key” as: partition_key [ ‘,’ clustering_columns ]; and “table_options” as: COMPACT STORAGE [ AND table_options ] or CLUSTERING ORDER BY ‘(‘ clustering_order ‘)’ [ AND table_options ] or “options”.

Attempting to create an already existing table will return an error unless the IF NOT EXISTS directive is used.

The CQL types are described in the References section.

For example, we are going to create a table called “species_table” in the keyspace “test_keyspace” in which we will have a “species” text (as PRIMARY KEY), a “common_name” text, a “population” varint, a “average_size” int and a “sex” text. Besides, we are going to add a comment to the table: “Some species records”;

**CREATE** **TABLE** species_table (
    species text **PRIMARY** **KEY**,
    common_name text,
    population varint,
    average_size **int**,
    sex text,
) **WITH** **comment**='Some species records';

It is also possible to get the metadata from a table with the **DESCRIBE **statement.

**DESCRIBE** **TABLES** | **TABLE** [keyspace_name.]table_name;

¶DML (Data Manipulation Language)

¶Insert data

Inserting data for a row is done using an **INSERT **statement:

**INSERT** **INTO** table_name ( names_values | json_clause )
                      [ **IF** **NOT** **EXISTS** ]
                      [ **USING** update_parameter ( **AND** update_parameter )* ];

Where “names_values” is: names VALUES tuple_literal; “json_clause” is: JSON string [ DEFAULT ( NULL | UNSET ) ]; and “update_parameter” is usually: TTL.

For example we are going to use both VALUES and JSON clauses to insert data in the table “species_table”. In the VALUES clause it is necessary to supply the list of columns, not as in the JSON clause that is optional.

Note: TTL (Time To Live) and Timestamp are metrics for expiring data, so, when the time set is passed, the operation is expired.

In the VALUES clause we are going to insert a new specie called “White monkey”, with an average size of 3, its common name is “Copito de nieve”, population 0 and sex “male”.

**INSERT** **INTO** species_table (species, common_name, population, average_size, sex)
                **VALUES** ('White monkey', 'Copito de nieve', 0, 3, 'male');

In the JSON clause we are going to insert a new specie called “Cloned sheep”, with an average size of 1, its common name is “Dolly the sheep”, population 0 and sex “female”.

**INSERT** **INTO** species_table JSON '{"species": "Cloned Sheep",
                              "common_name": "Dolly the Sheep",
                              "average_size":1,
                              "population":0,
                              "sex": "female"}';

Note: all updates for an **INSERT **are applied **atomically **and in isolation.

¶Read

Querying data from data is done using a **SELECT **statement:

**SELECT** [ JSON | **DISTINCT** ] ( select_clause | '*' )
                      **FROM** table_name
                      [ **WHERE** where_clause ]
                      [ **GROUP** **BY** group_by_clause ]
                      [ **ORDER** **BY** ordering_clause ]
                      [ PER **PARTITION** **LIMIT** (**integer** | bind_marker) ]
                      [ **LIMIT** (**integer** | bind_marker) ]
                      [ ALLOW FILTERING ];

The **CQL SELECT **statement is very **similar **to the **SQL SELECT **statement due to the fact that both allows filtering (WHERE), grouping data (GROUP BY), ordering the data (ORDER BY) and limit the number of data (LIMIT). Besides, **CQL offers **a **limit per partition **and allow the **filtering **of data.

Note: as in SQL it it possible to set alias to the data with the statement AS.

For example we are going to retrieve all the information about those values from the tables “species_table” which “sex” is “male”. Allow filtering is mandatory when there is a WHERE statement.

**SELECT** * **FROM** species_table **WHERE** sex = 'male' ALLOW FILTERING;

SELECT statement

Furthermore, we are going to test the SELECT JSON statement. For this, we are going to retrieve only the species name with a population of 0.

**SELECT** JSON species **FROM** species_table **WHERE** population = 0 ALLOW FILTERING;

SELECT JSON statement

¶Update

¶DDL (Data Definition Language)

¶Alter keyspace

The statement **ALTER KEYSPACE **allows to modify the options of a keyspace:

**ALTER** KEYSPACE keyspace_name **WITH** options;

Note: the supported **options **are the same than for creating a keyspace, “replication” and “durable_writes”.

The example associated to this statement is to modify the keyspace with name “test_keyspace” and set a “replication_factor” of 4.

**ALTER** KEYSPACE test_keyspace
    **WITH** **replication** = {'class': 'SimpleStrategy', 'replication_factor' : 4};

¶Alter table

Altering an existing table uses the **ALTER TABLE **statement:

**ALTER** **TABLE** table_name alter_table_instruction;

Where “alter_table_instruction” can be: ADD column_name cql_type ( ‘,’ column_name cql_type ); or DROP column_name ( column_name ); or WITH options

The example associated to this statement is to ADD a new column to the table “species_table”, called “extinct” with type “boolean”.

**ALTER** **TABLE** species_table **ADD** extinct **boolean**;

Another example is to DROP the column called “sex” from the table “species_table”.

**ALTER** **TABLE** species_table **DROP** sex;

Finally, alter the comment with the WITH clause and set the comment to “All species records”.

**ALTER** **TABLE** species_table **WITH** **comment**='All species records';

These changes can be checked with the **DESCRIBE **statement:

**DESCRIBE** **TABLE** species_table;

DESCRIBE table

¶DML (Data Manipulation Language)

¶Update data

Updating a row is done using an **UPDATE **statement:

**UPDATE** table_name
[ **USING** update_parameter ( **AND** update_parameter )* ]
**SET** assignment ( ',' assignment )*
**WHERE** where_clause
[ **IF** ( **EXISTS** | condition ( **AND** condition )*) ];

Where the update_parameter is: ( TIMESTAMP | TTL) (integer | bind_marker)

It is important to mention that the **WHERE **clause is used to select the row to update and **must include ** all columns composing the PRIMARY KEY.

We are going to test this statement updating the column “extinct” to true to the column with name ‘White monkey’.

**UPDATE** species_table **SET** extinct = **true** **WHERE** species='White monkey';

SELECT statement

¶Delete

¶DDL (Data Definition Language)

¶Drop keyspace

Dropping a keyspace can be done using the **DROP KEYSPACE **statement:

**DROP** KEYSPACE [ **IF** **EXISTS** ] keyspace_name;

For example, drop the keyspace called “test_keyspace_2” if it exists:

**DROP** KEYSPACE **IF** **EXISTS** test_keyspace_2;

As this keyspace does not exists, this sentence will do nothing.

¶Drop table

Dropping a table uses the **DROP TABLE **statement:

**DROP** **TABLE** [ **IF** **EXISTS** ] table_name;

For example, drop the table called “species_2” if it exists:

**DROP** **TABLE** **IF** **EXISTS** species_2;

As this table does not exists, this sentence will do nothing.

¶Truncate (table)

A table can be truncated using the **TRUNCATE **statement:

**TRUNCATE** [ **TABLE** ] table_name;

Do not execute this command now, because if you do it, you will need to insert the previous data again.

Note: as tables are the only object that can be truncated the keyword TABLE can be omitted.

TRUNCATE statement

¶DML (Data Manipulation Language)

¶Delete data

Deleting rows or parts of rows uses the **DELETE **statement:

**DELETE** [ simple_selection ( ',' simple_selection ) ]
                      **FROM** table_name
                      [ **USING** update_parameter ( **AND** update_parameter )* ]
                      **WHERE** where_clause
                      [ **IF** ( **EXISTS** | condition ( **AND** condition )*) ]

Now we are going to delete the value of the column “average_size” from “Cloned Sheep”.

**DELETE** average_size **FROM** species_table **WHERE** species = 'Cloned Sheep';

DELETE value statement

And we are going to delete the same row as mentioned before.

**DELETE** **FROM** species_table **WHERE** species = 'Cloned Sheep';

DELETE row statement

¶Batch

Multiple INSERT, **UPDATE **and **DELETE **can be executed in a single statement by grouping them through a **BATCH **statement.

**BEGIN** [ UNLOGGED | COUNTER ] BATCH
                            [ **USING** update_parameter ( **AND** update_parameter )* ]
                            modification_statement ( ';' modification_statement )*
                            APPLY BATCH;

Where modification_statement can be a insert_statement or an update_statement or a delete_statement.

**UNLOGGED **means that either all operations in a batch eventually complete or none will.
COUNTER means that the updates are not idempotent, so each time we execute the updates in a batch, we will have different results. For example:

**BEGIN** BATCH
   **INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
                **VALUES** ('Blue Shark', 'Tiburillo', 30, 10, **false**);
   **INSERT** **INTO** species_table (species, common_name, population, average_size, extinct)
                **VALUES** ('Cloned sheep', 'Dolly the Sheep', 1, 1, **true**);
   **UPDATE** species_table **SET** population = 2 **WHERE** species='Cloned sheep';
   **DELETE** **FROM** species_table **WHERE** species =  'White monkey';
APPLY BATCH;

BATCH statement

¶Index

CQL support creating secondary indexes on tables, allowing queries on the table to use those indexes.

**Creating **a secondary index on a table uses the **CREATE INDEX **statement:

**CREATE** [ CUSTOM ] **INDEX** [ **IF** **NOT** **EXISTS** ] [ index_name ]
                                **ON** table_name '(' index_identifier ')'
                                [ **USING** string [ **WITH** OPTIONS = map_literal ] ];

For example we are going to create a index called “population_idx” that is related to the column “population” in the table “species_table”.

**CREATE** **INDEX** population_idx **ON** species_table (population);

**Dropping **a secondary index uses the DROP INDEX statement:

**DROP** **INDEX** [ **IF** **EXISTS** ] index_name;

Now, we are going to drop the previous index:

**DROP** **INDEX** **IF** **EXISTS** population_idx;

¶References

Upgrading our Baby Crawler

2020-03-17T23:00:00+00:00

Upgrading our Baby Crawler

In our last post on this series, we presented the code for our Personal Crawler. However, we didn’t quite explain what a crawler even is! We will use this moment to go a bit more in-depth, and make some upgrades to it.

Created 2020-03-11
Modified 2020-03-18

¶What is a Crawler?

A crawler is a program whose job is to analyze documents and extract data from them. For example, search engines like DuckDuckGo, Bing or Google all have crawlers to analyze websites and build a database around them. They are some kind of «trackers», because they keep track of everything they find.

Their basic behaviour can be described as follows: given a starting list of URLs, follow them all and identify hyperlinks inside the documents. Add these to the list of links to follow, and repeat ad infinitum.

This lets us create an index to quickly search across them all.
We can also identify broken links.
We can gather any other type of information that we found. Our crawler will work offline, within our own computer, scanning the text documents it finds on the root we tell it to scan.

¶Design Decissions

We will use Java. Its runtime is quite ubiquitous, so it should be able to run in virtually anywhere. The language is typed, which helps catch errors early on.
Our solution is iterative. While recursion can be seen as more elegants by some, iterative solutions are often more performant with less need for optimization.

¶Requirements

If you don’t have Java installed yet, you can Download Free Java Software from Oracle’s site. To compile the code, the Java Development Kit is also necessary.

We don’t depend on any other external libraries, for easier deployment and compilation.

¶Implementation

Because the code was getting pretty large, it has been split into several files, and we have also upgraded it to use a Graphical User Interface instead! We decided to use Swing, based on the Java tutorial Creating a GUI With JFC/Swing.

¶App

This file is the entry point of our application. Its job is to initialize the components, lay them out in the main panel, and connect the event handlers.

Most widgets are pretty standard, and are defined as class variables. However, some variables are notable. The [DefaultTableModel](https://docs.oracle.com/javase/8/docs/api/javax/swing/table/DefaultTableModel.html) is used because it allows to dynamically add rows, and we also have a [SwingWorker](https://docs.oracle.com/javase/8/docs/api/javax/swing/SwingWorker.html) subclass responsible for performing the word analysis (which is quite CPU intensive and should not be ran in the UI thread!).

There’s a few utility methods to ease some common operations, such as updateStatus which changes the status label in the main window, informing the user of the latest changes.

¶Thesaurus

A thesaurus is a collection of words or terms used to represent concepts. In literature this is commonly known as a dictionary.

On the subject of this project, we are using a thesaurus based on how relevant is a word for the meaning of a sentence, filtering out those that barely give us any information.

This file contains a simple thesaurus implementation, which can trivially be used as a normal or inverted thesaurus. However, we only treat it as inverted, and its job is loading itself and determining if words are valid or should otherwise be ignored.

¶Utils

Several utility functions used across the codebase.

¶WordMap

This file is the important one, and its implementation hasn’t changed much since our last post. Instances of a word map contain… wait for it… a map of words! It stores the mapping word → count in memory, and offers methods to query the count of a word or iterate over the word count entries.

It can be loaded from cache or told to analyze a root path. Once an instance is created, additional files could be analyzed one by one if desired.

¶Download

The code was getting a bit too large to embed it within the blog post itself, so instead you can download it as a.zip file.

download removed

Cassandra: an Introduction

2020-03-17T23:00:00+00:00

Cassandra: an Introduction

This is the first post in the Cassandra series, where we will introduce the Cassandra database system and take a look at its features and installation methods.

Created 2020-03-05
Modified 2020-03-18

Other posts in this series:

Cassandra: an Introduction (this post)

This post is co-authored wih Classmate.

¶Purpose of technology

Apache Cassandra is a NoSQL, open-source, distributed “key-value” database. It allows large volumes of distributed data. The main **goal **is provide linear scalability and availabilitywithout compromising performance. Besides, Cassandra supports replication across multiple datacenters, providing low latency.

¶How it works

Cassandra’s distributed **architecture **is based on a series of equal nodes that communicate with a P2P protocol so that redundancy is maximum. It offers robust support for multiple datacenters, with asynchronous replication without the need for a master server.

Besides, Cassandra’s data model consists of partitioning the rows, which are rearranged into different tables. The primary keys of each table have a first component that is the partition key. Within a partition, the rows are grouped by the remaining columns of the key. The other columns can be indexed separately from the primary key.

These tables can be created, deleted, updated and queried****at runtime without blocking each other. However it does not support joins or subqueries, but instead emphasizes denormalization through features like collections.

Nowadays, Cassandra uses its own query language called CQL (Cassandra Query Language), with a similar syntax to SQL. It also allows access from JDBC.

_ Cassandra architecture _

¶Features

Decentralized: there are no single points of failure, every **node **in the cluster has the same role and there is no master node, so each node can service any request, besides the data is distributed across the cluster.
Supports **replication **and multiple replication of data center: the replication strategies are configurable.
**Scalability: **reading and writing performance increases linearly as new nodes are added, also new nodes can be added without interrupting application execution.
Fault tolerance: data replication is done **automatically **in several nodes in order to recover from failures. It is possible to replace failure nodes****without making inactivity time or interruptions to the application.
**Consistency: **a choice of consistency level is provided for reading and writing.
MapReduce support: it is **integrated **with Apache Hadoop to support MapReduce.
Query language: it has its own query language called **CQL (Cassandra Query Language) **

¶Corner in CAP theorem

Apache Cassandra is usually described as an “AP” system because it guarantees availability and partition/fault tolerance. So it errs on the side of ensuring data availability even if this means sacrificing consistency. But, despite this fact, Apache Cassandra seeks to satisfy all three requirements (Consistency, Availability and Fault tolerance) simultaneously and can be configured to behave like a “CP” database, guaranteeing consistency and partition/fault tolerance.

Cassandra in CAP Theorem

¶Download

In order to download the file, with extension .tar.gz. you must visit the download site and click on the file “https://ftp.cixug.es/apache/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz”. It is important to mention that the previous link is related to the 3.11.6 version.

¶Installation

This database can only be installed on Linux distributions and Mac OS X systems, so, it is not possible to install it on Microsoft Windows.

The first main requirement is having installed Java 8 in Ubuntu, the OS that we will use. Therefore, the Java 8 installation is explained below. First open a terminal and execute the next command:

sudo apt update
sudo apt install openjdk-8-jdk openjdk-8-jre

In order to establish Java as a environment variable it is needed to open the file “/.bashrc”:

nano ~/.bashrc

And add at the end of it the path where Java is installed, as follows:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export PATH=$PATH:$JAVA_HOME/bin

At this point, save the file and execute the next command, note that it does the same effect re-opening the terminal:

source ~/.bashrc

In order to check if the Java environment variable is set correctly, run the next command:

echo $JAVA_HOME

$JAVAHOME variable

Afterwards, it is possible to check the installed Java version with the command:

java -version

Java version

The next requirement is having installed the latest version of Python 2.7. This can be checked with the command:

python --version

If it is not installed, to install it, it is as simple as run the next command in the terminal:

sudo apt install python

Note: it is better to use “python2” instead of “python” because in that way, you force to user Python 2.7. Modern distributions will use Python 3 for the «python» command.

Therefore, it is possible to check the installed Python version with the command:

python --version

Python version

Once both requirements are ready, next step is to unzip the file previously downloaded, right click on the file and select “Extract here” or with the next command, on the directory where is the downloaded file.

tar -zxvf apache-cassandra-x.x.x-bin.tar.gz

In order to check if the installation is completed, you can execute the next command, in the root folder of the project. This will start Cassandra in a single node.

/bin/cassandra

It is possible to make a get some data from Cassandra with CQL (Cassandra Query Language). To check this execute the next command in another terminal.

/bin/cqlsh localhost

Once CQL is open, type the next sentence and check the result:

SELECT cluster_name, listen_address from system.local;

The output should be:

Sentence output

Finally, the installation guide provided by the website of the database is attached in this installation guide.

¶References

Privado: PC-Crawler evaluation

2020-03-17T23:00:00+00:00

Privado: PC-Crawler evaluation

As the student a(i) where i = 9, I have been assigned to evaluate students a(i + 3) and a(i + 4), these being:

Created 2020-03-04
Modified 2020-03-18

a12: Classmate (username)
a13: Classmate (username)

¶Classmate’s evaluation

Grading: B.

I think they mix up a bit their considerations with program usage and how it works, not justifying why the considerations are the ones they chose, or what the alternatives would be.

The implementation notes are quite well-written. Even someone without knowledge of Java’s syntax can read the notes and more or less make sense of what’s going on, with the relevant code excerpts on each section.

Implementation-wise, some methods could definitely use some improvement:

esExtensionTextual is overly complicated. It could use a for loop and Java’s String.endsWith.
calcularFrecuencia has quite some duplication (e.g. this.getFicherosYDirectorios().remove(0)) and could definitely be cleaned up.

However, all the desired functionality is implemented.

Style-wise, some of the newlines and avoiding braces on if and while could be changed to improve the readability.

The post is written in Spanish, but uses some words that don’t translate well («remover» could better be said as «eliminar» or «quitar»).

¶Classmate’s evaluation

Grading: B.

Their post starts with an explanation on what a crawler is, common uses for them, and what type of crawler they will be developing. This is a very good start. Regarding the post style, it seems they are not properly using some of WordPress features, such as lists, and instead rely on paragraphs with special characters prefixing each list item.

The post also contains some details on how to install the requirements to run the program, which can be very useful for someone not used to working with Java.

They do not explain their implementation and the filename of the download has a typo.

Implementation-wise, the code seems to be well-organized, into several packages and files, although the naming is a bit inconsistent. They even designed a GUI, which is quite impressive.

Some of the methods are documented, although the code inside them is not very commented, including missing rationale for the data structures chosen. There also seem to be several other unused main functions, which I’m unsure why they were kept.

However, all the desired functionality is implemented.

Similar to Classmate, the code style could be improved and settled on some standard, as well as making use of Java features such as for loops over iterators instead of manual loops.

Introduction to NoSQL

2020-03-17T23:00:00+00:00

Introduction to NoSQL

This post will primarly focus on the talk held in the GOTO 2012 conference: Introduction to NoSQL by Martin Fowler. It can be seen as an informal, summarized transcript of the talk

Created 2020-02-25
Modified 2020-03-18

The relational database model is affected by the impedance mismatch problem. This occurs because we have to match our high-level design with the separate columns and rows used by relational databases.

Taking the in-memory objects and putting them into a relational database (which were dominant at the time) simply didn’t work out. Why? Relational databases were more than just databases, they served as a an integration mechanism across applications, up to the 2000s. For 20 years!

With the rise of the Internet and the sheer amount of traffic, databases needed to scale. Unfortunately, relational databases only scale well vertically (by upgrading a single node). This is very expensive, and not something many could afford.

The problem are those pesky JOIN‘s, and its friends GROUP BY. Because our program and reality model don’t match the tables used by SQL, we have to rely on them to query the data. It is because the model doesn’t map directly.

Furthermore, graphs don’t map very well at all to relational models.

We needed a way to scale horizontally (by increasing the amount of nodes), something relational databases were not designed to do.

We need to do something different, relational across nodes is an unnatural act

This inspired the NoSQL movement.

#nosql was only meant to be a hashtag to advertise it, but unfortunately it’s how it is called now

It is not possible to define NoSQL, but we can identify some of its characteristics:

Non-relational
Cluster-friendly (this was the original spark)
Open-source (until now, generally)
21st century web culture
Schema-less (easier integration or conjugation of several models, structure aggregation) These databases use different data models to those used by the relational model. However, it is possible to identify 4 broad chunks (some may say 3, or even 2!):
Key-value store. With a certain key, you obtain the value corresponding to it. It knows nothing else, nor does it care. We say the data is opaque.
Document-based. It stores an entire mass of documents with complex structure, normally through the use of JSON (XML has been left behind). Then, you can ask for certain fields, structures, or portions. We say the data is transparent.
Column-family. There is a «row key», and within it we store multiple «column families» (columns that fit together, our aggregate). We access by row-key and column-family name. All of these kind of serve to store documents without any explicit schema. Just shove in anything! This gives a lot of flexibility and ease of migration, except… that’s not really true. There’s an implicit schema when querying.

For example, a query where we may do anOrder['price'] * anOrder['quantity'] is assuming that anOrder has both a price and a quantity, and that both of these can be multiplied together. «Schema-less» is a fuzzy term.

However, it is the lack of a fixed schema that gives flexibility.

One could argue that the line between key-value and document-based is very fuzzy, and they would be right! Key-value databases often let you include additional metadata that behaves like an index, and in document-based, documents often have an identifier anyway.

The common notion between these three types is what matters. They save an entire structure as an unit. We can refer to these as «Aggregate Oriented Databases». Aggregate, because we group things when designing or modeling our systems, as opposed to relational databases that scatter the information across many tables.

There exists a notable outlier, though, and that’s:

Graph databases. They use a node-and-arc graph structure. They are great for moving on relationships across things. Ironically, relational databases are not very good at jumping across relationships! It is possibly to perform very interesting queries in graph databases which would be really hard and costly on relational models. Unlike the aggregated databases, graphs break things into even smaller units. NoSQL is not the solution. It depends on how you’ll work with your data. Do you need an aggregate database? Will you have a lot of relationships? Or would the relational model be good fit for you?

NoSQL, however, is a good fit for large-scale projects (data will always grow) and faster development (the impedance mismatch is drastically reduced).

Regardless of our choice, it is important to remember that NoSQL is a young technology, which is still evolving really fast (SQL has been stable for decades). But the polyglot persistence is what matters. One must know the alternatives, and be able to choose.

Relational databases have the well-known ACID properties: Atomicity, Consistency, Isolation and Durability.

NoSQL (except graph-based!) are about being BASE instead: Basically Available, Soft state, Eventual consistency.

SQL needs transactions because we don’t want to perform a read while we’re only half-way done with a write! The readers and writers are the problem, and ensuring consistency results in a performance hit, even if the risk is low (two writers are extremely rare but it still must be handled).

NoSQL on the other hand doesn’t need ACID because the aggregate is the transaction boundary. Even before NoSQL itself existed! Any update is atomic by nature. When updating many documents it is a problem, but this is very rare.

We have to distinguish between logical and replication consistency. During an update and if a conflict occurs, it must be resolved to preserve the logical consistency. Replication consistency on the other hand is preserveed when distributing the data across many machines, for example during sharding or copies.

Replication buys us more processing power and resillence (at the cost of more storage) in case some of the nodes die. But what happens if what dies is the communication across the nodes? We could drop the requests and preserve the consistency, or accept the risk to continue and instead preserve the availability.

The choice on whether trading consistency for availability is acceptable or not depends on the domain rules. It is the domain’s choice, the business people will choose. If you’re Amazon, you always want to be able to sell, but if you’re a bank, you probably don’t want your clients to have negative numbers in their account!

Regardless of what we do, in a distributed system, the CAP theorem always applies: Consistecy, Availability, Partitioning-tolerancy (error tolerancy). It is impossible to guarantee all 3 at 100%. Most of the times, it does work, but it is mathematically impossible to guarantee at 100%.

A database has to choose what to give up at some point. When designing a distributed system, this must be considered. Normally, the choice is made between consistency or response time.

¶Further reading

Build your own PC

2020-03-17T23:00:00+00:00

Build your own PC

…where PC obviously stands for Personal Crawler.

Created 2020-02-25
Modified 2020-03-18

This post contains the source code for a very simple crawler written in Java. You can compile and run it on any file or directory, and it will calculate the frequency of all the words it finds.

¶Source code

Paste the following code in a new file called Crawl.java:

import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Crawl {
	// Regex used to tokenize the words from a line of text
	private final static Pattern WORDS = Pattern.compile("\\w+");

	// The file where we will cache our results
	private final static File INDEX_FILE = new File("index.bin");

	// Helper method to determine if a file is a text file or not
	private static boolean isTextFile(File file) {
		String name = file.getName().toLowerCase();
		return name.endsWith(".txt")
				|| name.endsWith(".java")
				|| name.endsWith(".c")
				|| name.endsWith(".cpp")
				|| name.endsWith(".h")
				|| name.endsWith(".hpp")
				|| name.endsWith(".html")
				|| name.endsWith(".css")
				|| name.endsWith(".js");
	}

	// Normalizes a string by converting it to lowercase and removing accents
	private static String normalize(String string) {
		return string.toLowerCase()
				.replace("á", "a")
				.replace("é", "e")
				.replace("í", "i")
				.replace("ó", "o")
				.replace("ú", "u");
	}

	// Recursively fills the map with the count of words found on all the text files
	static void fillWordMap(Map<String, Integer> map, File root) throws IOException {
		// Our file queue begins with the root
		Queue<File> fileQueue = new ArrayDeque<>();
		fileQueue.add(root);

		// For as long as the queue is not empty...
		File file;
		while ((file = fileQueue.poll()) != null) {
			if (!file.exists() || !file.canRead()) {
				// ...ignore files for which we don't have permission...
				System.err.println("warning: cannot read file: " + file);
			} else if (file.isDirectory()) {
				// ...else if it's a directory, extend our queue with its files...
				File[] files = file.listFiles();
				if (files == null) {
					System.err.println("warning: cannot list dir: " + file);
				} else {
					fileQueue.addAll(Arrays.asList(files));
				}
			} else if (isTextFile(file)) {
				// ...otherwise, count the words in the file.
				countWordsInFile(map, file);
			}
		}
	}

	// Counts the words in a single file and adds the count to the map.
	public static void countWordsInFile(Map<String, Integer> map, File file) throws IOException {
		BufferedReader reader = new BufferedReader(new FileReader(file));

		String line;
		while ((line = reader.readLine()) != null) {
			Matcher matcher = WORDS.matcher(line);
			while (matcher.find()) {
				String token = normalize(matcher.group());
				Integer count = map.get(token);
				if (count == null) {
					map.put(token, 1);
				} else {
					map.put(token, count + 1);
				}
			}
		}

		reader.close();
	}

	// Prints the map of word count to the desired output stream.
	public static void printWordMap(Map<String, Integer> map, PrintStream writer) {
		List<String> keys = new ArrayList<>(map.keySet());
		Collections.sort(keys);
		for (String key : keys) {
			writer.println(key + "\t" + map.get(key));
		}
	}

	@SuppressWarnings("unchecked")
	public static void main(String[] args) throws IOException, ClassNotFoundException {
		// Validate arguments
		if (args.length == 1 && args[0].equals("--help")) {
			System.err.println("usage: java Crawl [input]");
			return;
		}

		File root = new File(args.length > 0 ? args[0] : ".");

		// Loading or generating the map where we aggregate the data  {word: count}
		Map<String, Integer> map;
		if (INDEX_FILE.isFile()) {
			System.err.println("Found existing index file: " + INDEX_FILE);
			try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(INDEX_FILE))) {
				map = (Map<String, Integer>) ois.readObject();
			}
		} else {
			System.err.println("Index file not found: " + INDEX_FILE + "; indexing...");
			map = new TreeMap<>();
			fillWordMap(map, root);
			// Cache the results to avoid doing the work a next time
			try (ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream(INDEX_FILE))) {
				out.writeObject(map);
			}
		}

		// Ask the user in a loop to query for words
		Scanner scanner = new Scanner(System.in);
		while (true) {
			System.out.print("Escriba palabra a consultar (o Enter para salir): ");
			System.out.flush();
			String line = scanner.nextLine().trim();
			if (line.isEmpty()) {
				break;
			}

			line = normalize(line);
			Integer count = map.get(line);
			if (count == null) {
				System.out.println(String.format("La palabra \"%s\" no está presente", line));
			} else if (count == 1) {
				System.out.println(String.format("La palabra \"%s\" está presente 1 vez", line));
			} else {
				System.out.println(String.format("La palabra \"%s\" está presente %d veces", line, count));
			}
		}
	}
}

It can be compiled and executed as follows:

javac Crawl.java
java Crawl

Instead of copy-pasting the code, you may also download it as a .zip:

(contents removed)

¶Addendum

The following simple function can be used if one desires to print the contents of a file:

public static void printFile(File file) {
	if (isTextFile(file)) {
		System.out.println('\n' + file.getName());
		try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
			String line;
			while ((line = reader.readLine()) != null) {
				System.out.println(line);
			}
		} catch (FileNotFoundException ignored) {
			System.err.println("warning: file disappeared while reading: " + file);
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

About Boolean Retrieval

2020-03-17T23:00:00+00:00

About Boolean Retrieval

This entry will discuss the section on the Boolean retrieval section of the book An Introduction to Information Retrieval.

Created 2020-02-25
Modified 2020-03-18

¶Summary on the topic

Boolean retrieval is one of the many ways information retrieval (finding materials that satisfy an information need), often simply called search.

A simple way to retrieve information is to grep through the text (term named after the Unix tool grep), scanning text linearly and excluding it on certain criteria. However, this falls short when the volume of the data grows, more complex queries are desired, or one seeks some sort of ranking.

To avoid linear scanning, we build an index and record for each document whether it contains each term out of our full dictionary of terms (which may be words in a chapter and words in the book). This results in a binary term-document incidence matrix. Such a possible matrix is:

word/play	Antony and Cleopatra	Julius Caesar	The Tempest	…
Antony	1	1	0
Brutus	1	1	0
Caesar	1	1	0
Calpurnia	0	1	0
Cleopatra	1	0	0
mercy	1	0	1
worser	1	0	1
…

We can look at this matrix’s rows or columns to obtain a vector for each term indicating where it appears, or a vector for each document indicating the terms it contains.

Now, answering a query such as Brutus AND Caesar AND NOT Calpurnia becomes trivial:

VECTOR(Brutus) AND VECTOR(Caesar) AND COMPLEMENT(VECTOR(Calpurnia))
= 110 AND 110 AND COMPLEMENT(010)
= 110 AND 110 AND 101
= 100

The query is only satisfied for our first column.

The Boolean retrieval model is thus a model that treats documents as a set of terms, in which we can perform any query in the form of Boolean expressions of terms, combined with OR, AND, and NOT.

Now, building such a matrix is often not feasible due to the sheer amount of data (say, a matrix with 500,000 terms across 1,000,000 documents, each with roughly 1,000 terms). However, it is important to notice that most of the terms will be missing when examining each document. In our example, this means 99.8% or more of the cells will be 0. We can instead record the positions of the 1’s. This is known as an inverted index.

The inverted index is a dictionary of terms, each containing a list that records in which documents it appears (postings). Applied to boolean retrieval, we would:

Collects the documents to be indexed, assign a unique identifier each
Tokenize the text in the documents into a list of terms
Normalize the tokens, which now become indexing terms
Index the documents

Dictionary	Postings
Brutus	1, 2, 4, 11, 31, 45, 173, 174
Caesar	1, 2, 4, 5, 6, 16, 57, 132, …
Calpurnia	2, 31, 54, 101
…

Sort the pairs (term, document_id) so that the terms are alphabetical, and merge multiple occurences into one. Group instances of the same term and split again into a sorted list of postings.

term	document_id
I	1
did	1
…
with	2

term	document_id
be	2
brutus	1
brutus	2
…

term	frequency	postings list
be	1	2
brutus	2	1, 2
capitol	1	1
…

Intersecting posting lists now becomes of transversing both lists in order:

Brutus   : 1 -> 2 -> 4 -> 11 -> 31 -> 45 ->           173 -> 174
Calpurnia:      2 ->            31 ->       54 -> 101
Intersect:      2 ->            31

A simple conjunctive query (e.g. Brutus AND Calpurnia) is executed as follows:

Locate Brutus in the dictionary
Retrieve its postings
Locate Calpurnia in the dictionary
Retrieve its postings
Intersect (merge) both postings

Since the lists are sorted, walking both of them can be done in O(n) time. By also storing the frequency, we can optimize the order in which we execute arbitrary queries, although we won’t go into detail.

¶Thoughts

The boolean retrieval model can be implemented with relative ease, and can help with storage and efficient querying of the information if we intend to perform boolean queries.

However, the basic design lacks other useful operations, such as a «near» operator, or the ability to rank the results.

All in all, it’s an interesting way to look at the data and query it efficiently.