pagongpagong2020-08-23T22:00:00+00:00Data Mining and Data Warehousingdist/index/index.html2020-08-23T22:00:00+00:002020-08-23T22:00:00+00:00During 2020 at university, this subject ("Minería de Datos y Almacenes de Datos") had us write<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Data Mining and Data Warehousing</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <h1 class="title" id="data_mining_and_data_warehousing"><a class="anchor" href="#data_mining_and_data_warehousing">¶</a>Data Mining and Data Warehousing</h1> <div class="date-created-modified">2020-08-24</div> <p>During 2020 at university, this subject (&quot;Minería de Datos y Almacenes de Datos&quot;) had us write blog posts as assignments. I think it would be really fun and I wanted to preserve that work here, with the hopes it's interesting to someone.</p> <p>The posts were auto-generated from the original HTML files and manually anonymized later.</p> </main> </body> </html> Privado: Final NoSQL evaluationdist/final-nosql-evaluation/index.html2020-05-13T22:00:00+00:002020-05-12T22:00:00+00:00This evaluation is a bit different to my <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Privado: Final NoSQL evaluation</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>This evaluation is a bit different to my <a href="/blog/mdad/nosql-evaluation/">previous one</a> because this time I have been tasked to evaluate student <code>a(i - 2)</code>, and because I am <code>i = 11</code> that happens to be <code>a(9) =</code> a classmate.</p> <div class="date-created-modified">Created 2020-05-13<br> Modified 2020-05-14</div> <h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s Evaluation</h2> <p><strong>Grading: A.</strong></p> <p>The post I have evaluated is Trabajo en grupo – Bases de datos NoSQL, 3ª entrada: Aplicación con una Base de datos NoSQL seleccionada.</p> <p>It starts with a very brief introduction with who has written the post, what data they will be using, and what database they have chosen.</p> <p>They properly describe their objective, how they will do it and what library will be used.</p> <p>They also explain where they obtain the data from, and what other things the site can do, which is a nice bonus.</p> <p>The post continues listing and briefly explaining all the tools used and what they are for, including commands to execute.</p> <p>At last, they list what files their project uses, what they do, and contains a showcase of images which lets the reader know what the application does.</p> <p>All in all, in my opinion, it’s clear they have put work into this entry and I have not noticed any major flaws, so they deserve the highest grade.</p> </main> </body> </html> A practical example with Hadoopdist/a-practical-example-with-hadoop/index.html2020-04-17T22:00:00+00:002020-03-29T22:00:00+00:00In our <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>A practical example with Hadoop</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>In our <a href="/blog/mdad/introduction-to-hadoop-and-its-mapreduce/">previous Hadoop post</a>, we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.</p> <div class="date-created-modified">Created 2020-03-30<br> Modified 2020-04-18</div> <p>This post will reproduce the example on Chapter 2 of the book <a href="http://www.hadoopbook.com/">Hadoop: The Definitive Guide, Fourth Edition</a> (<a href="http://grut-computing.com/HadoopBook.pdf">pdf,</a><a href="http://www.hadoopbook.com/code.html">code</a>), that is, finding the maximum global-wide temperature for a given year.</p> <h2 class="title" id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2> <p>Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to <a href="http://hadoop.apache.org/releases.html">Apache Hadoop’s releases</a> and download the <a href="https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz">latest binary version</a> at the time of writing (3.2.1).</p> <p>We will be using the <a href="https://linuxmint.com/">Linux Mint</a> distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as <a href="https://ubuntu.com/">Ubuntu</a>.</p> <p>Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as <a href="https://openjdk.java.net/">OpenJDK</a>.</p> <p>Here are all the three steps in the command line:</p> <pre><code>wget https://apache.brunneis.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz tar xf hadoop-3.2.1.tar.gz hadoop-3.2.1/bin/hadoop version </code></pre> <p>We will be using the two example data files that they provide in <a href="https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all">their GitHub repository</a>, although the full dataset is offered by the <a href="https://www.ncdc.noaa.gov/">National Climatic Data Center</a> (NCDC).</p> <p>We will also unzip and concatenate both files into a single text file, to make it easier to work with. As a single command pipeline:</p> <pre><code>curl https://raw.githubusercontent.com/tomwhite/hadoop-book/master/input/ncdc/all/190{1,2}.gz | gunzip &gt; 190x </code></pre> <p>This should create a <code>190x</code> text file in the current directory, which will be our input data.</p> <h2 id="processing_data"><a class="anchor" href="#processing_data">¶</a>Processing data</h2> <p>To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.</p> <p>We will use Java, because it’s a dependency that we already have anyway, so might as well.</p> <p>Our map function needs to extract the year and air temperature, which will prepare the data for later use (finding the maximum temperature for each year). We will also drop bad records here (if the temperature is missing, suspect or erroneous).</p> <p>Copy or reproduce the following code in a file called <code>MaxTempMapper.java</code>, using any text editor of your choice:</p> <pre><code>import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxTempMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; { private static final int TEMP_MISSING = 9999; private static final String GOOD_QUALITY_RE = &quot;[01459]&quot;; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); String temp = line.substring(87, 92).replaceAll(&quot;^\\+&quot;, &quot;&quot;); String quality = line.substring(92, 93); int airTemperature = Integer.parseInt(temp); if (airTemperature != TEMP_MISSING &amp;&amp; quality.matches(GOOD_QUALITY_RE)) { context.write(new Text(year), new IntWritable(airTemperature)); } } } </code></pre> <p>Now, let’s create the <code>MaxTempReducer.java</code> file. Its job is to reduce the data from multiple values into just one. We do that by keeping the maximum out of all the values we receive:</p> <pre><code>import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxTempReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; { @Override public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException { Iterator&lt;IntWritable&gt; iter = values.iterator(); if (iter.hasNext()) { int maxValue = iter.next().get(); while (iter.hasNext()) { maxValue = Math.max(maxValue, iter.next().get()); } context.write(key, new IntWritable(maxValue)); } } } </code></pre> <p>Except for some Java weirdness (…why can’t we just iterate over an <code>Iterator</code>? Or why can’t we just manually call <code>next()</code> on an <code>Iterable</code>?), our code is correct. There can’t be a maximum if there are no elements, and we want to avoid dummy values such as <code>Integer.MIN_VALUE</code>.</p> <p>We can also take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.</p> <p>Last, let’s write the <code>main</code> method, or else we won’t be able to run it. In our new file <code>MaxTemp.java</code>:</p> <pre><code>import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class MaxTemp { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println(&quot;usage: java MaxTemp &lt;input path&gt; &lt;output path&gt;&quot;); System.exit(-1); } Job job = Job.getInstance(); job.setJobName(&quot;Max temperature&quot;); job.setJarByClass(MaxTemp.class); job.setMapperClass(MaxTempMapper.class); job.setReducerClass(MaxTempReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } </code></pre> <p>And compile by including the required <code>.jar</code> dependencies in Java’s classpath with the <code>-cp</code> switch:</p> <pre><code>javac -cp &quot;hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*&quot; *.java </code></pre> <p>At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful):</p> <pre><code>java -cp &quot;.:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*&quot; MaxTemp 190x results </code></pre> <p>Hooray! We should have a new <code>results/</code> folder along with the following files:</p> <pre><code>$ ls results part-r-00000 _SUCCESS $ cat results/part-r-00000 1901 317 1902 244 </code></pre> <p>It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.</p> </main> </body> </html> Developing a Python application for Cassandradist/developing-a-python-application-for-cassandra/index.html2020-04-15T22:00:00+00:002020-03-22T23:00:00+00:00Warning<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Developing a Python application for Cassandra</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p><em><strong>Warning</strong>: this post is, in fact, a shameless self-plug to my own library. If you continue reading, you accept that you are okay with this. Otherwise, please close the tab, shut down your computer, and set it on fire.__(Also, that was a joke. Please don’t do that.)</em></p> <div class="date-created-modified">Created 2020-03-23<br> Modified 2020-04-16</div> <p>Let’s do some programming! Today we will be making a tiny CLI application in <a href="http://python.org/">Python</a> that queries <a href="https://core.telegram.org/api">Telegram’s API</a> and stores the data in <a href="http://cassandra.apache.org/">Cassandra</a>.</p> <h2 class="title" id="our_goal"><a class="anchor" href="#our_goal">¶</a>Our goal</h2> <p>Our goal is to make a Python console application. This application will connect to <a href="https://telegram.org/">Telegram</a>, and ask for your account credentials. Once you have logged in, the application will fetch all of your open conversations and we will store these in Cassandra.</p> <p>With the data saved in Cassandra, we can now very efficiently query information about your conversations given their identifier offline (no need to query Telegram anymore).</p> <p><strong>In short</strong>, we are making an application that performs efficient offline queries to Cassandra to print out information about your Telegram conversations given the ID you want to query.</p> <h2 id="data_model"><a class="anchor" href="#data_model">¶</a>Data model</h2> <p>The application itself is really simple, and we only need one table to store all the relevant information we will be needing. This table called <code>**users**</code> will contain the following columns:</p> <ul> <li><code>**id**</code>, of type <code>int</code>. This will also be the <code>primary key</code> and we’ll use it to query the database later on.</li> <li><code>**first_name**</code>, of type <code>varchar</code>. This field contains the first name of the stored user.</li> <li><code>**last_name**</code>, of type <code>varchar</code>. This field contains the last name of the stored user.</li> <li><code>**username**</code>, of type <code>varchar</code>. This field contains the username of the stored user. Because Cassandra uses a <a href="https://cassandra.apache.org/doc/latest/architecture/overview.html">wide column storage model</a>, direct access through a key is the most efficient way to query the database. In our case, the key is the primary key of the <code>users</code> table, using the <code>id</code> column. The index for the primary key is ready to be used as soon as we create the table, so we don’t need to create it on our own.</li> </ul> <h2 id="dependencies"><a class="anchor" href="#dependencies">¶</a>Dependencies</h2> <p>Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the <a href="https://www.python.org/downloads/">Python downloads section</a>, but if you’re on Linux, chances are you have it installed already.</p> <p>Once Python 3.5 or above is installed, get a copy of the Cassandra driver for Python and Telethon through <code>pip</code>:</p> <pre><code>pip install cassandra-driver telethon </code></pre> <p>For more details on that, see the <a href="https://docs.datastax.com/en/developer/python-driver/3.22/installation/">installation guide for <code>cassandra-driver</code></a>, or the <a href="https://docs.telethon.dev/en/latest/basic/installation.html">installation guide for <code>telethon</code></a>.</p> <p>As we did in our <a href="/blog/mdad/cassandra-operaciones-basicas-y-arquitectura/">previous post</a>, we will setup a new keyspace for this application with <code>cqlsh</code>. We will also create a table to store the users into. This could all be automated in the Python code, but because it’s a one-time thing, we prefer to use <code>cqlsh</code>.</p> <p>Make sure that Cassandra is running in the background. We can’t make queries to it if it’s not running.</p> <pre><code>$ bin/cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh&gt; create keyspace mdad with replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; cqlsh&gt; use mdad; cqlsh:mdad&gt; create table users(id int primary key, first_name varchar, last_name varchar, username varchar); </code></pre> <p>Python installed? Check. Python dependencies? Check. Cassandra ready? Check.</p> <h2 id="the_code"><a class="anchor" href="#the_code">¶</a>The code</h2> <h3 id="getting_users"><a class="anchor" href="#getting_users">¶</a>Getting users</h3> <p>The first step is connecting to <a href="https://core.telegram.org/api">Telegram’s API</a>, for which we’ll use <a href="https://telethon.dev/">Telethon</a>, a wonderful (wink, wink) Python library to interface with it.</p> <p>As with most APIs, we need to supply <a href="https://my.telegram.org/">our API key</a> in order to use it (here <code>API_ID</code> and <code>API_HASH</code>). We will refer to them as constants. At the end, you may download the entire code and use my own key for this example. But please don’t use those values for your other applications!</p> <p>It’s pretty simple: we create a client, and for every dialog (that is, open conversation) we have, do some checks:</p> <ul> <li>If it’s an user, we just store that in a dictionary mapping <code>ID → User</code>.</li> <li>Else if it’s a group, we iterate over the participants and store those users instead.</li> </ul> <pre><code>async def load_users(): from telethon import TelegramClient users = {} async with TelegramClient(SESSION, API_ID, API_HASH) as client: async for dialog in client.iter_dialogs(): if dialog.is_user: user = dialog.entity users[user.id] = user print('found user:', user.id, file=sys.stderr) elif dialog.is_group: async for user in client.iter_participants(dialog): users[user.id] = user print('found member:', user.id, file=sys.stderr) return list(users.values()) </code></pre> <p>With this we have a mapping ID to user, so we know we won’t have duplicates. We simply return the list of user values, because that’s all we care about.</p> <h3 id="saving_users"><a class="anchor" href="#saving_users">¶</a>Saving users</h3> <p>Inserting users into Cassandra is pretty straightforward. We take the list of <code>User</code> objects as input, and prepare a new <code>INSERT</code> statement that we can reuse (because we will be using it in a loop, this is the best way to do it).</p> <p>For each user, execute the statement with the user data as input parameters. Simple as that.</p> <pre><code>def save_users(session, users): insert_stmt = session.prepare( 'INSERT INTO users (id, first_name, last_name, username) ' 'VALUES (?, ?, ?, ?)') for user in users: row = (user.id, user.first_name, user.last_name, user.username) session.execute(insert_stmt, row) </code></pre> <h3 id="fetching_users"><a class="anchor" href="#fetching_users">¶</a>Fetching users</h3> <p>Given a list of users, yield all of them from the database. Similar to before, we prepare a <code>SELECT</code> statement and just execute it repeatedly over the input user IDs.</p> <pre><code>def fetch_users(session, users): select_stmt = session.prepare('SELECT * FROM users WHERE id = ?') for user_id in users: yield session.execute(select_stmt, (user_id,)).one() </code></pre> <h3 id="parsing_arguments"><a class="anchor" href="#parsing_arguments">¶</a>Parsing arguments</h3> <p>We’ll be making a little CLI application, so we need to parse console arguments. It won’t be anything fancy, though. For that we’ll be using <a href="https://docs.python.org/3/library/argparse.html">Python’s <code>argparse</code> module</a>:</p> <pre><code>def parse_args(): import argparse parser = argparse.ArgumentParser( description='Dump and query Telegram users') parser.add_argument('users', type=int, nargs='*', help='one or more user IDs to query for') parser.add_argument('--load-users', action='store_true', help='load users from Telegram (do this first run)') return parser.parse_args() </code></pre> <h3 id="all_together"><a class="anchor" href="#all_together">¶</a>All together</h3> <p>Last, the entry point. We import a Cassandra Cluster, and connect to some default keyspace (we called it <code>mdad</code> earlier).</p> <p>If the user wants to load the users into the database, we’ll do just that first.</p> <p>Then, for each user we fetch from the database, we print it. Last names and usernames are optional, so don’t print those if they’re missing (<code>None</code>).</p> <pre><code>async def main(args): from cassandra.cluster import Cluster cluster = Cluster(CLUSTER_NODES) session = cluster.connect(KEYSPACE) if args.load_users: users = await load_users() save_users(session, users) for user in fetch_users(session, args.users): print('User', user.id, ':') print(' First name:', user.first_name) if user.last_name: print(' Last name:', user.last_name) if user.username: print(' Username:', user.username) print() if __name__ == '__main__': asyncio.run(main(parse_args())) </code></pre> <p>Because Telethon is an <code>[asyncio](https://docs.python.org/3/library/asyncio.html)</code> library, we define it as <code>async def main(...)</code> and run it with <code>asyncio.run(main(...))</code>.</p> <p>Here’s what it looks like in action:</p> <pre><code>$ python data.py --help usage: data.py [-h] [--load-users] [users [users ...]] Dump and query Telegram users positional arguments: users one or more user IDs to query for optional arguments: -h, --help show this help message and exit --load-users load users from Telegram (do this first run) $ python data.py --load-users found user: 487158 found member: 59794114 found member: 487158 found member: 191045991 (...a lot more output) $ python data.py 487158 59794114 User 487158 : First name: Rick Last name: Pickle User 59794114 : Firt name: Peter Username: pete </code></pre> <p>Telegram’s data now persists in Cassandra, and we can efficiently query it whenever we need to! I would’ve shown a video presenting its usage, but I’m afraid that would leak some of the data I want to keep private :-).</p> <p>Feel free to download the code and try it yourself:</p> <p><em>download removed</em></p> <h2 id="references"><a class="anchor" href="#references">¶</a>References</h2> <ul> <li><a href="https://docs.datastax.com/en/developer/python-driver/3.22/getting_started/">DataStax Python Driver for Apache Cassandra – Getting Started</a></li> <li><a href="https://docs.telethon.dev/en/latest/">Telethon’s Documentation</a></li> </ul> </main> </body> </html> Introduction to Hadoop and its MapReducedist/introduction-to-hadoop-and-its-mapreduce/index.html2020-03-31T22:00:00+00:002020-03-29T22:00:00+00:00Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Introduction to Hadoop and its MapReduce</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Hadoop is an open-source, free, Java-based programming framework that helps processing large datasets in a distributed environment and the problems that arise when trying to harness the knowledge from BigData, capable of running on thousands of nodes and dealing with petabytes of data. It is based on Google File System (GFS) and originated from the work on the Nutch open-source project on search engines.</p> <div class="date-created-modified">Created 2020-03-30<br> Modified 2020-04-01</div> <p>Hadoop also offers a distributed filesystem (HDFS) enabling for fast transfer among nodes, and a way to program with MapReduce.</p> <p>It aims to strive for the 4 V’s: Volume, Variety, Veracity and Velocity. For veracity, it is a secure environment that can be trusted.</p> <h2 class="title" id="milestones"><a class="anchor" href="#milestones">¶</a>Milestones</h2> <p>The creators of Hadoop are Doug Cutting and Mike Cafarella, who just wanted to design a search engine, Nutch, and quickly found the problems of dealing with large amounts of data. They found their solution with the papers Google published.</p> <p>The name comes from the plush of Cutting’s child, a yellow elephant.</p> <ul> <li>In July 2005, Nutch used GFS to perform MapReduce operations.</li> <li>In February 2006, Nutch started a Lucene subproject which led to Hadoop.</li> <li>In April 2007, Yahoo used Hadoop in a 1 000-node cluster.</li> <li>In January 2008, Apache took over and made Hadoop a top-level project.</li> <li>In July 2008, Apache tested a 4000-node cluster. The performance was the fastest compared to other technologies that year.</li> <li>In May 2009, Hadoop sorted a petabyte of data in 17 hours.</li> <li>In December 2011, Hadoop reached 1.0.</li> <li>In May 2012, Hadoop 2.0 was released with the addition of YARN (Yet Another Resource Navigator) on top of HDFS, splitting MapReduce and other processes into separate components, greatly improving the fault tolerance.</li> </ul> <p>From here onwards, many other alternatives have born, like Spark, Hive &amp; Drill, Kafka, HBase, built around the Hadoop ecosystem.</p> <p>As of 2017, Amazon has clusters between 1 and 100 nodes, Yahoo has over 100 000 CPUs running Hadoop, AOL has clusters with 50 machines, and Facebook has a 320-machine (2 560 cores) and 1.3PB of raw storage.</p> <h2 id="why_not_use_rdbms_"><a class="anchor" href="#why_not_use_rdbms_">¶</a>Why not use RDBMS?</h2> <p>Relational database management systems simply cannot scale horizontally, and vertical scaling will require very expensive servers. Similar to RDBMS, Hadoop has a notion of jobs (analogous to transactions), but without ACID or concurrency control. Hadoop supports any form of data (unstructured or semi-structured) in read-only mode, and failures are common but there’s a simple yet efficient fault tolerance.</p> <p>So what problems does Hadoop solve? It solves the way we should think about problems, and distributing them, which is key to do anything related with BigData nowadays. We start working with clusters of nodes, and coordinating the jobs between them. Hadoop’s API makes this really easy.</p> <p>Hadoop also takes very seriously the loss of data with replication, and if a node falls, they are moved to a different node.</p> <h2 id="major_components"><a class="anchor" href="#major_components">¶</a>Major components</h2> <p>The previously-mentioned HDFS runs on commodity machine, which are cost-friendly. It is very fault-tolerant and efficient enough to process huge amounts of data, because it splits large files into smaller chunks (or blocks) that can be more easily handled. Multiple nodes can work on multiple chunks at the same time.</p> <p>NameNode stores the metadata of the various datablocks (map of blocks) along with their location. It is the brain and the master in Hadoop’s master-slave architecture, also known as the namespace, and makes use of the DataNode.</p> <p>A secondary NameNode is a replica that can be used if the first NameNode dies, so that Hadoop doesn’t shutdown and can restart.</p> <p>DataNode stores the blocks of data, and are the slaves in the architecture. This data is split into one or more files. Their only job is to manage this access to the data. They are often distributed among racks to avoid data lose.</p> <p>JobTracker creates and schedules jobs from the clients for either map or reduce operations.</p> <p>TaskTracker runs MapReduce tasks assigned to the current data node.</p> <p>When clients need data, they first interact with the NameNode and replies with the location of the data in the correct DataNode. Client proceeds with interaction with the DataNode.</p> <h2 id="mapreduce"><a class="anchor" href="#mapreduce">¶</a>MapReduce</h2> <p>MapReduce, as the name implies, is split into two steps: the map and the reduce. The map stage is the «divide and conquer» strategy, while the reduce part is about combining and reducing the results.</p> <p>The mapper has to process the input data (normally a file or directory), commonly line-by-line, and produce one or more outputs. The reducer uses all the results from the mapper as its input to produce a new output file itself.</p> <p><img src="bitmap.png" alt="" /></p> <p>When reading the data, some may be junk that we can choose to ignore. If it is valid data, however, we label it with a particular type that can be useful for the upcoming process. Hadoop is responsible for splitting the data accross the many nodes available to execute this process in parallel.</p> <p>There is another part to MapReduce, known as the Shuffle-and-Sort. In this part, types or categories from one node get moved to a different node. This happens with all nodes, so that every node can work on a complete category. These categories are known as «keys», and allows Hadoop to scale linearly.</p> <h2 id="references"><a class="anchor" href="#references">¶</a>References</h2> <ul> <li><a href="https://youtu.be/oT7kczq5A-0">YouTube – Hadoop Tutorial For Beginners | What Is Hadoop? | Hadoop Tutorial | Hadoop Training | Simplilearn</a></li> <li><a href="https://youtu.be/bcjSe0xCHbE">YouTube – Learn MapReduce with Playing Cards</a></li> <li><a href="https://youtu.be/j8ehT1_G5AY?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">YouTube – Video Post #2: Hadoop para torpes (I)-¿Qué es y para qué sirve?</a></li> <li><a href="https://youtu.be/NQ8mjVPCDvk?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #3: Hadoop para torpes (II)-¿Cómo funciona? HDFS y MapReduce</a></li> <li><a href="https://hadoop.apache.org/old/releases.html">Apache Hadoop Releases</a></li> <li><a href="https://youtu.be/20qWx2KYqYg?list=PLi4tp-TF_qjM_ed4lIzn03w7OnEh0D8Xi">Video Post #4: Hadoop para torpes (III y fin)- Ecosistema y distribuciones</a></li> <li><a href="http://www.hadoopbook.com/">Chapter 2 – Hadoop: The Definitive Guide, Fourth Edition</a> (<a href="http://grut-computing.com/HadoopBook.pdf">pdf,</a><a href="http://www.hadoopbook.com/code.html">code</a>)</li> </ul> </main> </body> </html> Data Warehousing and OLAPdist/data-warehousing-and-olap/index.html2020-03-31T22:00:00+00:002020-03-22T23:00:00+00:00Business intelligence (BI) refers to systems used to gain insights from data, traditionally taken from relational databases and being used to build a data warehouse. Performance and scalability are key aspects of BI systems.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Data Warehousing and OLAP</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Business intelligence (BI) refers to systems used to gain insights from data, traditionally taken from relational databases and being used to build a data warehouse. Performance and scalability are key aspects of BI systems.</p> <div class="date-created-modified">Created 2020-03-23<br> Modified 2020-04-01</div> <p>Commonly, the data in the warehouse is a transformation of the original, operational data into a form better suited for reporting and analysis.</p> <p>This whole process is known as Online Analytical Processing (OLAP), and is different to the approach taken by relational databases, which is known as Online Transaction Processing (OLTP) and is optimized for individual transactions. OLAP is based on multidimensional databases simply by the way it works.</p> <p>The Business Intelligence Semantic Model (BISM) refers to the different semantics in which data can be accessed and queried.</p> <p>On the one hand, MDX is the language used for Microsoft’s BISM of multidimensional mode, and on the other, DAX is the language of tabular mode, based on Excel’s formula language and designed to be easy to use by those familiar with Excel.</p> <h2 class="title" id="types_of_data"><a class="anchor" href="#types_of_data">¶</a>Types of data</h2> <p>The business data is often called detail data or <em>fact</em> data, goes in a de-normalized table called the fact table. The term «facts» literally refers to the facts, such as number of products sold and amount received for products sold. Different tables will often represent different dimensions of the data, where «dimensions» simply means different ways to look at the data.</p> <p>Data can also be referred to as measures, because most of it is numbers and subject to aggregations. By measures, we refer to these values and numbers.</p> <p>Multidimensional databases are formed with separate fact and dimension tables, grouped to create a «cube» with both facts and dimensions.</p> <h2 id="places_to_store_data"><a class="anchor" href="#places_to_store_data">¶</a>Places to store data</h2> <p>Three different terms are often heard when talking about the places where data is stored: data lakes, data warehouses, and data marts. All of these have different target users, cost, size and growth.</p> <p>The data lake contains <strong>all</strong> the data generated by your business. Nothing is filtered out, not even cancelled or invalid transactions. If there are future plans to use the data, or a need to analyze it in various ways, a data lake is often necessary.</p> <p>The data warehouse contains <strong>structured</strong> data, or has already been modelled. It’s also multi-purpose, but often of a lot smaller scale. Operational users are able to easily evaluate reports or analyze performance here, since it is built for their needs.</p> <p>The data mart contains a <strong>small portion</strong> of the data, and is often part of data warehouses themselves. It can be seen as a subsection built for specific departments, and as a benefit, users get isolated security and performance. The data here is clean, and subject-oriented.</p> <h2 id="ways_to_store_data"><a class="anchor" href="#ways_to_store_data">¶</a>Ways to store data</h2> <p>Data is often stored de-normalized, because it would not be feasible to store otherwise.</p> <p>There are two main techniques to implement data warehouses, known as Inmon approach and Kimball approach. They are named after Ralph Kimball <em>et al.</em> for their work on «The Data Warehouse Lifecycle Toolkit», and Bill Inmon <em>et al.</em> for their work on «Corporate Information Factory» respectively.</p> <p>When several independent systems identify and store data in different ways, we face what’s known as the problem of the stovepipe. Something as simple as trying to connect these systems or use their data in a warehouse results in an overly complicated system.</p> <p>To tackle this issue, Kimball advocates the use of «conformed dimensions», that is, some dimensions will be «of interest», and have the same attributes and rollups (or at least a subset) in different data marts. This way, warehouses contain dimensional databases to ease analysis in the data marts it is composed of, and users query the warehouse.</p> <p>The Inmon approach on the other hand has the warehouse laid out in third normal form, and users query the data marts, not the warehouse (so the data marts are dimensional in nature).</p> <h2 id="key_takeaways"><a class="anchor" href="#key_takeaways">¶</a>Key takeaways</h2> <ul> <li>«BI» stands for «Business Intelligence» and refers to the system that <em>perform</em> data analysis.</li> <li>«BISM» stands for «Business Intelligence Semantic Model», and Microsoft has two languages to query data: MDX and DAX.</li> <li>«OLAP» stands for «Online Analytical Processing», and «OLTP» for «Online Transaction Processing».</li> <li>Data mart, warehouse and lake refer to places at different scales and with different needs to store data.</li> <li>Inmon and Kimbal are different ways to implement data warehouses.</li> <li>Data facts contains various measures arranged into different dimensions, which together form a data cube.</li> </ul> <h2 id="references"><a class="anchor" href="#references">¶</a>References</h2> <ul> <li><a href="https://media.wiley.com/product_data/excerpt/03/11181011/1118101103-157.pdf">Chapter 1 – Professional Microsoft SQL Server 2012 Analysis Services with MDX and DAX (Harinath et al., 2012)</a></li> <li><a href="https://youtu.be/m_DzhW-2pWI">YouTube – Data Mining in SQL Server Analysis Services</a></li> <li>Almacenes de Datos y Procesamiento Analítico On-Line (Félix R.)</li> <li><a href="https://youtu.be/qkJOace9FZg">YouTube – What are Dimensions and Measures?</a></li> <li><a href="https://www.holistics.io/blog/data-lake-vs-data-warehouse-vs-data-mart/">Data Lake vs Data Warehouse vs Data Mart</a></li> </ul> </main> </body> </html> Cassandra: Introduccióndist/cassandra-introduccion/index.html2020-03-29T22:00:00+00:002020-03-04T23:00:00+00:00Este es el primer post en la serie sobre Cassandra, en el cuál introduciremos dicha bases de datos NoSQL y veremos sus características e instalación.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Cassandra: Introducción</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p><img src="1200px-Cassandra_logo.png" alt="" /></p> <div class="date-created-modified">Created 2020-03-05<br> Modified 2020-03-30</div> <p>Este es el primer post en la serie sobre Cassandra, en el cuál introduciremos dicha bases de datos NoSQL y veremos sus características e instalación.</p> <p>Otros posts en esta serie:</p> <ul> <li><a href="/blog/mdad/cassandra-introduccion/">Cassandra: Introducción</a> (este post)</li> <li><a href="/blog/mdad/cassandra-operaciones-basicas-y-arquitectura/">Cassandra: Operaciones Básicas y Arquitectura</a></li> </ul> <p>Este post está hecho en colaboración con un compañero.</p> <hr /> <h2 class="title" id="finalidad_de_la_tecnología"><a class="anchor" href="#finalidad_de_la_tecnología">¶</a>Finalidad de la tecnología</h2> <p>Apache Cassandra es una base de datos NoSQL distribuida y de código abierto (<a href="https://github.com/apache/cassandra">con un espejo en GitHub</a>). Su filosofía es de tipo «clave-valor», y puede manejar grandes volúmenes de datos</p> <p>Entre sus objetivos, busca ser escalable horizontalmente (puede replicarse en varios centros manteniendo la latencia baja) y alta disponibilidad sin ceder en rendimiento.</p> <h2 id="cómo_funciona"><a class="anchor" href="#cómo_funciona">¶</a>Cómo funciona</h2> <p>Instancias de Cassandra se distribuyen en nodos iguales (es decir, no hay maestro-esclavo) que se comunican entre sí (P2P). De este modo, da buen soporte entre varios centros de datos, con redundancia y réplicas síncronas.</p> <p><img src="multiple-data-centers-and-data-replication-in-cassandra.jpg" alt="" /></p> <p>Con respecto al modelo de datos, Cassandra particiona las filas con el objetivo de re-organizarla a lo largo distintas tablas. Como clave primaria, se usa un primer componente conocido como «clave de la partición». Dentro de cada partición, las filas se agrupan según el resto de columnas de la clave. Cualquier otra columna se puede indexar independientemente de la clave primaria.</p> <p>Las tablas se pueden crear, borrar, actualizar y consultar sin bloqueos. No hay soporte para JOIN o subconsultas, pero Cassandra prefiere de-normalizar los datos haciendo uso de características como coleciones.</p> <p>Para realizar las operaciones sobre cassandra se usa CQL (Cassandra Query Language), que tiene una sintaxis muy similar a SQL.</p> <h2 id="características"><a class="anchor" href="#características">¶</a>Características</h2> <p>Como ya hemos mencionado antes, la arquitectura de Cassandra es <strong>decentralizada</strong>. No tiene un único punto que pudiera fallar porque todos los nodos son iguales (sin maestros), y por lo tanto, cualquiera puede dar servicio a la petición.</p> <p>Los datos se encuentran <strong>replicados</strong> entre los distintos nodos del clúster (lo que ofrece gran <strong>tolerancia a fallos</strong> sin necesidad de interrumpir la aplicación), y es trivial <strong>escalar</strong> añadiendo más nodos al sistema.</p> <p>El nivel de <strong>consistencia</strong> para lecturas y escrituras es configurable.</p> <p>Siendo de la familia Apache, Cassandra ofrece integración con Apache Hadoop para tener soporte MapReduce.</p> <h2 id="arista_dentro_del_teorema_cap"><a class="anchor" href="#arista_dentro_del_teorema_cap">¶</a>Arista dentro del Teorema CAP</h2> <p>Cassandra se encuentra dentro de la esquina «AP» junto con CouchDB y otros, porque garantiza tanto la disponibilidad como la tolerancia a fallos.</p> <p>Sin embargo, puede configurarse como un sistema «CP» si se prefiere respetar la consistencia en todo momento.</p> <p><img src="0.jpeg" alt="" /></p> <h2 id="descarga"><a class="anchor" href="#descarga">¶</a>Descarga</h2> <p>Se pueden seguir las instrucciones de la página oficial para <a href="https://cassandra.apache.org/download/">descargar Cassandra</a>. Para ello, se debe clicar en la <a href="https://www.apache.org/dyn/closer.lua/cassandra/3.11.6/apache-cassandra-3.11.6-bin.tar.gz">última versión para descargar el archivo</a>. En nuestro caso, esto es el enlace nombrado «3.11.6», versión que utilizamos.</p> <h2 id="instalación"><a class="anchor" href="#instalación">¶</a>Instalación</h2> <p>Cassandra no ofrece binarios para Windows, por lo que usaremos Linux para instalarlo. En nuestro caso, tenemos un sistema Linux Mint (derivado de Ubuntu), pero una máquina virtual con cualquier Linux debería funcionar.</p> <p>Debemos asegurarnos de tener Java y Python 2 instalado mediante el siguiente comando:</p> <pre><code>apt install openjdk-8-jdk openjdk-8-jre python2.7 </code></pre> <p>Para verificar que la instalación ha sido correcta, podemos mostrar las versiones de los programas:</p> <pre><code>$ java -version openjdk version &quot;1.8.0_242&quot; OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08) OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) $ python2 --version Python 2.7.17 </code></pre> <p>Una vez las dependencias estén instaladas, extraemos el fichero descargado o bien mediante la interfaz gráfica de nuestro sistema, o bien mediante un comando:</p> <pre><code>tar xf apache-cassandra-3.11.6-bin.tar.gz </code></pre> <p>Y finalmente, lanzar la ejecución de Cassandra:</p> <pre><code>apache-cassandra-3.11.6/bin/cassandra </code></pre> <p>Es posible que tarde un poco en abrirse, pero luego debería haber muchas líneas de log indicando. Para apagar el servidor, simplemente basta con pulsar <code>Ctrl+C</code>.</p> <h2 id="referencias"><a class="anchor" href="#referencias">¶</a>Referencias</h2> <ul> <li><a href="https://blog.yugabyte.com/apache-cassandra-architecture-how-it-works-lightweight-transactions/">Apache Cassandra Architecture Fundamentals – The Distributed SQL Blog</a></li> <li><a href="https://cassandra.apache.org/">Apache Cassandra</a></li> <li><a href="https://www.datastax.com/blog/2019/05/how-apache-cassandratm-balances-consistency-availability-and-performance">How Apache Cassandra™ Balances Consistency, Availability, and Performance – Datasax</a></li> <li><a href="https://blog.yugabyte.com/apache-cassandra-architecture-how-it-works-lightweight-transactions/">Apache Cassandra Architecture Fundamentals</a></li> </ul> </main> </body> </html> Privado: NoSQL evaluationdist/nosql-evaluation/index.html2020-03-27T23:00:00+00:002020-03-15T23:00:00+00:00This evaluation is based on the criteria for the first delivery described by Trabajos en grupo sobre Bases de Datos NoSQL.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Privado: NoSQL evaluation</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>This evaluation is based on the criteria for the first delivery described by Trabajos en grupo sobre Bases de Datos NoSQL.</p> <div class="date-created-modified">Created 2020-03-16<br> Modified 2020-03-28</div> <p>I have chosen to evaluate the following people and works:</p> <ul> <li>a12: Classmate (username) with Druid.</li> <li>a21: Classmate (username) with Neo4J.</li> </ul> <h2 class="title" id="classmate_s_evaluation"><a class="anchor" href="#classmate_s_evaluation">¶</a>Classmate’s Evaluation</h2> <p><strong>Grading: A.</strong></p> <p>The post evaluated is Bases de datos NoSQL – Apache Druid – Primera entrega.</p> <p>It is a very well-written, complete post, with each section meeting one of the points in the required criteria. The only thing that bothered me a little is the abuse of strong emphasis in the text, which I found quite distracting. However, the content deserves the highest grading.</p> <h2 id="classmate_s_evaluation_2"><a class="anchor" href="#classmate_s_evaluation_2">¶</a>Classmate’s Evaluation</h2> <p><strong>Grading: A.</strong></p> <p>The post evaluated is Bases de datos NoSQL – Neo4j – Primera entrega.</p> <p>Well-written post, although a bit smaller than Classmate’s, but that’s not really an issue. It still talks about everything it should talk and includes photos to go along the text which help. There is no noticeable wrong things in it, so it gets the highest grading as well.</p> </main> </body> </html> Mining of Massive Datasetsdist/mining-of-massive-datasets/index.html2020-03-27T23:00:00+00:002020-03-15T23:00:00+00:00In this post we will talk about the Chapter 1 of the book Mining of Massive Datasets Leskovec, J. et al., available online, and I will summarize and share my thoughts on it.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Mining of Massive Datasets</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>In this post we will talk about the Chapter 1 of the book Mining of Massive Datasets Leskovec, J. et al., available online, and I will summarize and share my thoughts on it.</p> <div class="date-created-modified">Created 2020-03-16<br> Modified 2020-03-28</div> <p>Data mining often refers to the discovery of models for data, where the model can be for statistics, machine learning, summarizing, extracting features, or other computational approaches to perform complex queries on the data.</p> <p>Commonly, problems related to data mining involve discovering unusual events hidden in massive data sets. There is another problem when trying to achieve Total Information Awareness (TIA), though, a project that was proposed by the Bush administration but shut down. The problem is, if you look at so much data, and try to find activities that look like (for example) terrorist behavior, inevitably one will find other illicit activities that are not terrorism with bad consequences. So it is important to narrow the activities we are looking for, in this case.</p> <p>When looking at data, even completely random data, for a certain event type, the event will likely occur. With more data, it will occur more times. However, these are bogus results. The Bonferroni correction gives a statistically sound way to avoid most of these bogus results, however, the Bonferroni’s Principle can be used as an informal version to achieve the same thing.</p> <p>For that, we calculate the expected number of occurrences of the events you are looking for on the assumption that data is random. If this number is way larger than the number of real instances one hoped to find, then nearly everything will be Bogus.</p> <hr /> <p>When analysing documents, some words will be more important than others, and can help determine the topic of the document. One could think the most repeated words are the most important, but that’s far from the truth. The most common words are the stop-words, which carry no meaning, reason why we should remove them prior to processing. We are mostly looking for rare nouns.</p> <p>There are of course formal measures for how concentrated into relatively few documents are the occurrences of a given word, known as TF.IDF (Term Frequency times In-verse Document Frequency). We won’t go into details on how to compute it, because there are multiple ways.</p> <p>Hash functions are also frequently used, because they can turn hash keys into a bucket number (the index of the bucket where this hash key belongs). They «randomize» and spread the universe of keys into a smaller number of buckets, useful for storage and access.</p> <p>An index is an efficient structure to query for values given a key, and can be built with hash functions and buckets.</p> <p>Having all of these is important when analysing documents when doing data mining, because otherwise it would take far too long.</p> </main> </body> </html> MongoDB: Operaciones Básicas y Arquitecturadist/mongodb-operaciones-basicas-y-arquitectura/index.html2020-03-19T23:00:00+00:002020-03-04T23:00:00+00:00Este es el segundo post en la serie sobre MongoDB, con una breve descripción de las operaciones básicas (tales como inserción, recuperación e indexado), y ejecución por completo junto con el modelo de datos y arquitectura.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>MongoDB: Operaciones Básicas y Arquitectura</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Este es el segundo post en la serie sobre MongoDB, con una breve descripción de las operaciones básicas (tales como inserción, recuperación e indexado), y ejecución por completo junto con el modelo de datos y arquitectura.</p> <div class="date-created-modified">Created 2020-03-05<br> Modified 2020-03-20</div> <p>Otros posts en esta serie:</p> <ul> <li><a href="/blog/mdad/mongodb-introduction/">MongoDB: Introducción</a></li> <li><a href="/blog/mdad/mongodb-operaciones-basicas-y-arquitectura/">MongoDB: Operaciones Básicas y Arquitectura</a> (este post)</li> </ul> <p>Este post está hecho en colaboración con un compañero, y en él veremos algunos ejemplos de las operaciones básicas (<a href="https://stackify.com/what-are-crud-operations/">CRUD</a>) sobre MongoDB.</p> <hr /> <p>Empezaremos viendo cómo creamos una nueva base de datos dentro de MongoDB y una nueva colección donde poder insertar nuestros documentos.</p> <h2 class="title" id="creación_de_una_base_de_datos_e_inserción_de_un_primer_documento"><a class="anchor" href="#creación_de_una_base_de_datos_e_inserción_de_un_primer_documento">¶</a>Creación de una base de datos e inserción de un primer documento</h2> <p>Podemos ver las bases de datos que tenemos disponibles ejecutando el comando:</p> <pre><code>&gt; show databases admin 0.000GB config 0.000GB local 0.000GB </code></pre> <p>Para crear una nueva base de datos, o utilizar una de las que tenemos creadas ejecutamos <code>use</code> junto con el nombre que le vamos a dar:</p> <pre><code>&gt; use new_DB switched to db new_DB </code></pre> <p>Una vez hecho esto, podemos ver que si volvemos a ejecutar «show databases», la nueva base de datos no aparece. Esto es porque para que Mongo registre una base de datos en la lista de las existentes, necesitamos insertar al menos un nuevo documento en una colección de esta. Lo podemos hacer de la siguiente forma:</p> <pre><code>&gt; db.movie.insert({&quot;name&quot;:&quot;tutorials point&quot;}) WriteResult({ &quot;nInserted&quot; : 1 }) &gt; show databases admin 0.000GB config 0.000GB local 0.000GB movie 0.000GB </code></pre> <p>Al igual que podemos ver las bases de datos existentes, también podemos consultar las colecciones que existen dentro de estas. Siguiendo la anterior ejecución, si ejecutamos:</p> <pre><code>&gt; show collections movie </code></pre> <h3 id="borrar_base_de_datos"><a class="anchor" href="#borrar_base_de_datos">¶</a>Borrar base de datos</h3> <p>Para borrar una base de datos tenemos que ejecutar el siguiente comando:</p> <pre><code>&gt; db.dropDatabase() { &quot;dropped&quot; : &quot;new_DB&quot;, &quot;ok&quot; : 1 } </code></pre> <h3 id="crear_colección"><a class="anchor" href="#crear_colección">¶</a>Crear colección</h3> <p>Para crear una colección podemos hacerlo de dos formas. O bien mediante el comando:</p> <pre><code>db.createCollection(&lt;nombre de la colección&gt;, opciones) </code></pre> <p>Donde el primer parámetro es el nombre que le queremos asignar a la colección, y los siguientes, todos opcionales, pueden ser (entre otros):</p> <table class=""> <thead> <tr> <th> Campo </th> <th> Tipo </th> <th> Descripción </th> </tr> </thead> <tbody> <tr> <td> <code> capped </code> </td> <td> Booleano </td> <td> Si es <code> true </code> , permite una colección limitada. Una colección limitada es una colección de tamaño fijo que sobrescribe automáticamente sus entradas más antiguas cuando alcanza su tamaño máximo. Si especifica <code> true </code> , también debe especificar el parámetro de <code> size </code> . </td> </tr> <tr> <td> <code> autoIndexId </code> </td> <td> Booleano </td> <td> Si es <code> true </code> crea automáticamente un índice en el campo <code> _id </code> . Por defecto es <code> false </code> </td> </tr> <tr> <td> <code> size </code> </td> <td> Número </td> <td> Especifica el tamaño máximo en bytes para una colección limitada. Es obligatorio si el campo <code> capped </code> está a <code> true </code> . </td> </tr> <tr> <td> <code> max </code> </td> <td> Número </td> <td> Especifica el número máximo de documentos que están permitidos en la colección limitada. </td> </tr> </tbody> </table> <pre><code>&gt; use test switched to db test &gt; db.createCollection(&quot;mycollection&quot;) { &quot;ok&quot; : 1 } &gt; db.createCollection(&quot;mycol&quot;, {capped : true, autoIndexId: true, size: 6142800, max: 10000}) { &quot;note&quot; : &quot;the autoIndexId option is deprecated and will be removed in a future release&quot;, &quot;ok&quot; : 1 } &gt; show collections mycol mycollection </code></pre> <p>Como se ha visto anteriormente al crear la base de datos, podemos insertar un documento en una colección sin que la hayamos creado anteriormente. Esto es porque MongoDB crea automáticamente una colección cuando insertas algún documento en ella:</p> <pre><code>&gt; db.tutorialspoint.insert({&quot;name&quot;:&quot;tutorialspoint&quot;}) WriteResult({ &quot;nInserted&quot; : 1 }) &gt; show collections mycol mycollection tutorialspoint </code></pre> <h3 id="borrar_colección"><a class="anchor" href="#borrar_colección">¶</a>Borrar colección</h3> <p>Para borrar una colección basta con situarnos en la base de datos que la contiene, y ejecutar lo siguiente:</p> <pre><code>db.&lt;nombre_de_la_colección&gt;.drop() </code></pre> <pre><code>&gt; db.mycollection.drop() true &gt; show collections mycol tutorialspoint </code></pre> <h3 id="insertar_documento"><a class="anchor" href="#insertar_documento">¶</a>Insertar documento</h3> <p>Para insertar datos en una colección de MongoDB necesitaremos usar el método <code>insert()</code> o <code>save()</code>.</p> <p>Ejemplo del método <code>insert</code>:</p> <pre><code>&gt; db.colection.insert({ ... title: 'Esto es una prueba para MDAD', ... description: 'MongoDB es una BD no SQL', ... by: 'Classmate and Me', ... tags: ['mongodb', 'database'], ... likes: 100 ... }) WriteResults({ &quot;nInserted&quot; : 1 }) </code></pre> <p>En este ejemplo solo se ha insertado un único documento, pero podemos insertar los que queramos separándolos de la siguiente forma:</p> <pre><code>db.collection.insert({documento}, {documento2}, {documento3}) </code></pre> <p>No hace falta especificar un ID ya que el propio mongo asigna un ID a cada documento automáticamente, aunque nos da la opción de poder asignarle uno mediante el atributo <code>_id</code> en la inserción de los datos</p> <p>Como se indica en el título de este apartado también se puede insertar mediante el método <code>db.coleccion.save(documento)</code>, funcionando este como el método <code>insert</code>.</p> <h3 id="método_"><a class="anchor" href="#método_">¶</a>Método <code>find()</code></h3> <p>El método find en MongoDB es el que nos permite realizar consultas a las colecciones de nuestra base de datos:</p> <pre><code>db.&lt;nombre_de_la_colección&gt;.find() </code></pre> <p>Este método mostrará de una forma no estructurada todos los documentos de la colección. Si le añadimos la función <code>pretty</code> a este método, se mostrarán de una manera más «bonita».</p> <pre><code>&gt; db.colection.find() { &quot;_id&quot;: ObjectId(&quot;5e738f0989f85a7eafdf044a&quot;), &quot;title&quot; : &quot;Esto es una prueba para MDAD&quot;, &quot;description&quot; : &quot;MongoDB es una BD no SQL&quot;, &quot;by&quot; : &quot;Classmate and Me&quot;, &quot;tags&quot; : [ &quot;mongodb&quot;, &quot;database&quot; ], &quot;likes&quot; : 100 } &gt; db.colection.find().pretty() { &quot;_id&quot;: ObjectId(&quot;5e738f0989f85a7eafdf044a&quot;), &quot;title&quot; : &quot;Esto es una prueba para MDAD&quot;, &quot;description&quot; : &quot;MongoDB es una BD no SQL&quot;, &quot;by&quot; : &quot;Classmate and Me&quot;, &quot;tags&quot; : [ &quot;mongodb&quot;, &quot;database&quot; ], &quot;likes&quot; : 100 } </code></pre> <p>Los equivalentes del <code>where</code> en las bases de datos relacionales son:</p> <table class=""> <thead> <tr> <th> Operación </th> <th> Sintaxis </th> <th> Ejemplo </th> <th> Equivalente en RDBMS </th> </tr> </thead> <tbody> <tr> <td> Igual </td> <td> <code> {&lt;clave&gt;:&lt;valor&gt;} </code> </td> <td> <code> db.mycol.find({"by":"Classmate and Me"}) </code> </td> <td> <code> where by = 'Classmate and Me' </code> </td> </tr> <tr> <td> Menor que </td> <td> <code> {&lt;clave&gt;:{$lt:&lt;valor&gt;}} </code> </td> <td> <code> db.mycol.find({"likes":{$lt:60}}) </code> </td> <td> <code> where likes &lt; 60 </code> </td> </tr> <tr> <td> Menor o igual que </td> <td> <code> {&lt;clave&gt;:{$lte:&lt;valor&gt;}} </code> </td> <td> <code> db.mycol.find({"likes":{$lte:60}}) </code> </td> <td> <code> where likes &lt;= 60 </code> </td> </tr> <tr> <td> Mayor que </td> <td> <code> {&lt;clave&gt;:{$gt:&lt;valor&gt;}} </code> </td> <td> <code> db.mycol.find({"likes":{$gt:60}}) </code> </td> <td> <code> where likes &gt; 60 </code> </td> </tr> <tr> <td> Mayor o igual que </td> <td> <code> {&lt;clave&gt;:{$gte:&lt;valor&gt;}} </code> </td> <td> <code> db.mycol.find({"likes":{$gte:60}}) </code> </td> <td> <code> where likes &gt;= 60 </code> </td> </tr> <tr> <td> No igual </td> <td> <code> {&lt;clave&gt;:{$ne:&lt;valor&gt;}} </code> </td> <td> <code> db.mycol.find({"likes":{$ne:60}}) </code> </td> <td> <code> where likes != 60 </code> </td> </tr> </tbody> </table> <p>En el método <code>find()</code> podemos añadir condiciones AND y OR de la siguiente manera:</p> <pre><code>(AND) &gt; db.colection.find({$and:[{&quot;by&quot;:&quot;Classmate and Me&quot;},{&quot;title&quot;: &quot;Esto es una prueba para MDAD&quot;}]}).pretty() (OR) &gt; db.colection.find({$or:[{&quot;by&quot;:&quot;Classmate and Me&quot;},{&quot;title&quot;: &quot;Esto es una prueba para MDAD&quot;}]}).pretty() (Ambos a la vez) &gt; db.colection.find({&quot;likes&quot;: {$gt:10}, $or: [{&quot;by&quot;: &quot;Classmate and Me&quot;}, {&quot;title&quot;: &quot;Esto es una prueba para MDAD&quot;}]}).pretty() </code></pre> <p>La última llamada con ambos a la vez equivalente en una consulta SQL a:</p> <pre><code>where likes&gt;10 AND (by = 'Classmate and Me' OR title = 'Esto es una prueba para MDAD') </code></pre> <h3 id="actualizar_un_documento"><a class="anchor" href="#actualizar_un_documento">¶</a>Actualizar un documento</h3> <p>En MongoDB se hace utilizando el método <code>update</code>:</p> <pre><code>db.&lt;nombre_colección&gt;.update(&lt;criterio_de_selección&gt;, &lt;dato_actualizado&gt;) </code></pre> <p>Para este ejemplo vamos a actualizar el documento que hemos insertado en el apartado anterior:</p> <pre><code>&gt; db.colection.update({'title':'Esto es una prueba para MDAD'},{$set:{'title':'Título actualizado'}}) WriteResult({ &quot;nMatched&quot; : 1, &quot;nUpserted&quot; : 0, &quot;nModified&quot; : 1 }) &gt; db.colection.find().pretty() { &quot;_id&quot;: ObjectId(&quot;5e738f0989f85a7eafdf044a&quot;), &quot;title&quot; : &quot;Título actualizado&quot;, &quot;description&quot; : &quot;MongoDB es una BD no SQL&quot;, &quot;by&quot; : &quot;Classmate and Me&quot;, &quot;tags&quot; : [ &quot;mongodb&quot;, &quot;database&quot; ], &quot;likes&quot; : 100 } </code></pre> <p>Anteriormente se ha mencionado el método <code>save()</code> para la inserción de documentos, pero también podemos utilizarlo para sustituir documentos enteros por uno nuevo:</p> <pre><code>&gt; db.&lt;nombre_de_la_colección&gt;.save({_id:ObjectId(), &lt;nuevo_documento&gt;}) </code></pre> <p>Con nuestro documento:</p> <pre><code>&gt; db.colection.save( ... { ... &quot;_id&quot;: ObjectId(&quot;5e738f0989f85a7eafdf044a&quot;), &quot;title&quot;: &quot;Este es el nuevo título&quot;, &quot;by&quot;: &quot;MDAD&quot; ... } ... ) WriteResult({ &quot;nMatched&quot; : 1, &quot;nUpserted&quot; : 0, &quot;nModified&quot; : 1 }) &gt; db.colection.find() { &quot;_id&quot;: ObjectId(&quot;5e738f0989f85a7eafdf044a&quot;), &quot;title&quot;: &quot;Este es el nuevo título&quot;, &quot;by&quot;: &quot;MDAD&quot; } </code></pre> <h3 id="borrar_documento"><a class="anchor" href="#borrar_documento">¶</a>Borrar documento</h3> <p>Para borrar un documento utilizaremos el método <code>remove()</code> de la siguiente manera:</p> <pre><code>db.&lt;nombre_de_la_colección&gt;.remove(&lt;criterio_de_borrado&gt;) </code></pre> <p>Considerando la colección del apartado anterior borraremos el único documento que tenemos:</p> <pre><code>&gt; db.colection.remove({'title': 'Este es el nuevo título'}) WriteResult({ &quot;nRemoved&quot; : 1 }) &gt; db.colection.find().pretty() &gt; </code></pre> <p>Para borrar todos los documentos de una colección usamos:</p> <pre><code>db.&lt;colección&gt;.remove({}) </code></pre> <h3 id="indexación"><a class="anchor" href="#indexación">¶</a>Indexación</h3> <p>MongDB nos permite crear índices sobre atributos de una colección de la siguiente forma:</p> <pre><code>db.&lt;colección&gt;.createIndex( {&lt;atributo&gt;:&lt;opciones&gt;}) </code></pre> <p>Como ejemplo:</p> <pre><code>&gt; db.mycol.createIndex({&quot;title&quot;:1}) { &quot;createdCollectionAutomatically&quot; : false, &quot;numIndexesBefore&quot; : 1, &quot;numIndexesAfter&quot; : 2, &quot;ok&quot; : 1 } </code></pre> <p>Si queremos más de un atributo en el índice lo haremos así:</p> <pre><code>&gt; db.mycol.ensureIndex({&quot;title&quot;:1,&quot;description&quot;:-1}) </code></pre> <p>Los valores que puede tomar son <code>+1</code> para ascendente o <code>-1</code> para descendente.</p> <h3 id="referencias"><a class="anchor" href="#referencias">¶</a>Referencias</h3> <ul> <li>Manual MongoDB. (n.d.). <a href="https://docs.mongodb.com/manual/">https://docs.mongodb.com/manual/</a></li> <li>MongoDB Tutorial – Tutorialspoint. (n.d.). – <a href="https://www.tutorialspoint.com/mongodb/index.htm">https://www.tutorialspoint.com/mongodb/index.htm</a></li> </ul> </main> </body> </html> MongoDB: Introduccióndist/mongodb-introduction/index.html2020-03-19T23:00:00+00:002020-03-04T23:00:00+00:00Este es el primer post en la serie sobre Mongo, en el cuál introduciremos dicha bases de datos NoSQL y veremos sus características e instalación.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>MongoDB: Introducción</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Este es el primer post en la serie sobre Mongo, en el cuál introduciremos dicha bases de datos NoSQL y veremos sus características e instalación.</p> <div class="date-created-modified">Created 2020-03-05<br> Modified 2020-03-20</div> <p>Otros posts en esta serie:</p> <ul> <li><a href="/blog/mdad/mongodb-introduction/">MongoDB: Introducción</a> (este post)</li> <li><a href="/blog/mdad/mongodb-operaciones-basicas-y-arquitectura/">MongoDB: Operaciones Básicas y Arquitectura</a></li> </ul> <p>Este post está hecho en colaboración con un compañero.</p> <hr /> <p><img src="0LRP4__jIIkJ-0gl8j2RDzWscL1Rto-NwvdqzmYk0jmYBIVbJ78n1ZLByPgV.png" alt="" /></p> <h2 class="title" id="definición"><a class="anchor" href="#definición">¶</a>Definición</h2> <p>MongoDB es una base de datos orientada a documentos. Esto quiere decir que en lugar de guardar los datos en registros, guarda los datos en documentos. Estos documentos son almacenados en BSON, que es una representación binaria de JSON. Una de las principales diferencias respecto a las bases de datos relacionales es que no necesita seguir ningún esquema, los documentos de una misma colección pueden tener esquemas diferentes.</p> <p>MongoDB está escrito en C++, aunque las consultas se hacen pasando objetos JSON como parámetro.</p> <pre><code>{ &quot;_id&quot; : ObjectId(&quot;52f602d787945c344bb4bda5&quot;), &quot;name&quot; : &quot;Tyrion&quot;, &quot;hobbies&quot; : [ &quot;books&quot;, &quot;girls&quot;, &quot;wine&quot; ], &quot;friends&quot; : [ { &quot;name&quot; : &quot;Bronn&quot;, &quot;ocuppation&quot; : &quot;sellsword&quot; }, { &quot;name&quot; : &quot;Shae&quot;, &quot;ocuppation&quot; : &quot;handmaiden&quot; } ] } </code></pre> <h2 id="características"><a class="anchor" href="#características">¶</a>Características</h2> <p><img src="WxZenSwSsimGvXVu5XH4cFUd3kr3Is_arrdSZGX8Hi0Ligqgw_ZTvGSIeXZm.png" alt="" /></p> <p>MongoDB alcanza un balance perfecto entre rendimiento y funcionalidad gracias a su sistema de consulta de contenidos. Pero sus características principales no se limitan solo a esto, también cuenta con otras que lo posicionan como el preferido de muchos desarrolladores de aplicaciones como aplicaciones móviles, gaming, logging o e-commerce.</p> <p>Algunas de las principales características de esta base de datos son:</p> <ul> <li>Almacenamiento orientado a documentos (documentos JSON con esquemas dinámicos).</li> <li>Soporte Full index: puede crear índices sobre cualquier atributo y añadir múltiples índices secundarios.</li> <li>Replicación y alta disponibilidad: espejos entre LANs y WANs.</li> <li>Auto-Sharding: escalabilidad horizontal sin comprometer la funcionalidad, está limitada, actualmente, a 20 nodos, aunque el objetivo es alcanzar una cifra cercana a los 1000.</li> <li>Consultas ricas y basadas en documentos.</li> <li>Rápidas actualizaciones en el contexto.</li> <li>Soporte comercial, capacitación y consultoría disponibles.</li> <li>También puede ser utilizada para el almacenamiento de archivos aprovechando la capacidad de MongoDB para el balanceo de carga y la replicación de datos.</li> </ul> <p>En cuanto a la arquitectura, podríamos decir que divide en tres partes: las bases de datos, las colecciones y los documentos (que contienen los campos de cada entrada).</p> <ul> <li><strong>Base de datos</strong>: cada una de las bases de datos tiene un conjunto propio de archivos en el sistema de archivos con diversas bases de datos existentes en un solo servidor.</li> <li><strong>Colección</strong>: un conjunto de documentos de base de datos. El equivalente RDBMS de la colección es una tabla. Toda colección existe dentro de una única base de datos.</li> <li><strong>Documento</strong>: un conjunto de pares clave/valor. Los documentos están asociados con esquemas dinámicos. La ventaja de tener esquemas dinámicos es que el documento en una sola colección no tiene que tener la misma estructura o campos. </li> </ul> <h2 id="arista_dentro_del_teorema_cap"><a class="anchor" href="#arista_dentro_del_teorema_cap">¶</a>Arista dentro del Teorema CAP</h2> <p><img src="t73Q1t-HXfWij-Q1o5AYEnO39Kz2oyLLCdQz6lWQQPaSQWamlDMjmptAn97h.png" alt="" /></p> <p>MongoDB es CP por defecto, es decir, garantiza consistencia y tolerancia a particiones (fallos). Pero también podemos configurar el nivel de consistencia, eligiendo el número de nodos a los que se replicarán los datos. O podemos configurar si se pueden leer datos de los nodos secundarios (en MongoDB solo hay un servidor principal, que es el único que acepta inserciones o modificaciones). Si permitimos leer de un nodo secundario mediante la replicación, sacrificamos consistencia, pero ganamos disponibilidad.</p> <h2 id="descarga_e_instalación"><a class="anchor" href="#descarga_e_instalación">¶</a>Descarga e instalación</h2> <h3 id="windows"><a class="anchor" href="#windows">¶</a>Windows</h3> <p>Descargar el archivo desde <a href="https://www.mongodb.com/download-center#production">https://www.mongodb.com/download-center#production</a></p> <ol> <li>Doble clic en el archivo <code>.msi</code></li> <li>El instalador de Windows lo guía a través del proceso de instalación. Si elige la opción de instalación personalizada, puede especificar un directorio de instalación. MongoDB no tiene ninguna otra dependencia del sistema. Puede instalar y ejecutar MongoDB desde cualquier carpeta que elija.</li> <li>Ejecutar el <code>.exe</code> que hemos instalado.</li> </ol> <h3 id="linux"><a class="anchor" href="#linux">¶</a>Linux</h3> <p>Abrimos una terminal y ejecutamos:</p> <pre><code>sudo apt-get update sudo apt install -y mongodb-org </code></pre> <p>Luego comprobamos el estado del servicio:</p> <pre><code>sudo systemctl start mongod sudo systemctl status mongod </code></pre> <p>Finalmente ejecutamos la base de datos con el comando:</p> <pre><code>sudo mongo </code></pre> <h3 id="macos"><a class="anchor" href="#macos">¶</a>macOS</h3> <p>Abrimos una terminal y ejecutamos:</p> <pre><code>brew update brew install mongodb </code></pre> <p>Iniciamos el servicio:</p> <pre><code>brew services start mongodb </code></pre> <h2 id="referencias"><a class="anchor" href="#referencias">¶</a>Referencias</h2> <ul> <li><a href="https://expertoenbigdata.com/que-es-mongodb/#La_arquitectura_de_MongoDB">Todo lo que debes saber sobre MongoDB</a></li> <li><a href="https://www.ecured.cu/MongoDB">MongoDB – EcuRed</a></li> <li><a href="https://mappinggis.com/2014/07/mongodb-y-gis/">Bases de datos NoSQL, MongoDB y GIS – MappingGIS</a></li> <li><a href="https://es.slideshare.net/maxfontana90/caractersticas-mongo-db">Características MONGO DB</a></li> <li><a href="https://openwebinars.net/blog/que-es-mongodb">Qué es MongoDB y características</a></li> <li><a href="https://www.genbeta.com/desarrollo/mongodb-que-es-como-funciona-y-cuando-podemos-usarlo-o-no">MongoDB. Qué es, cómo funciona y cuándo podemos usarlo (o no)</a></li> <li><a href="https://docs.mongodb.com/">MongoDB Documentation</a></li> <li><a href="https://www.genbeta.com/desarrollo/nosql-clasificacion-de-las-bases-de-datos-segun-el-teorema-cap">NoSQL: Clasificación de las bases de datos según el teorema CAP</a></li> </ul> </main> </body> </html> Cassandra: Operaciones Básicas y Arquitecturadist/cassandra-operaciones-basicas-y-arquitectura/index.html2020-03-19T23:00:00+00:002020-03-04T23:00:00+00:00Este es el segundo post en la serie sobre Cassandra, con una breve descripción de las operaciones básicas (tales como inserción, recuperación e indexado), y ejecución por completo junto con el modelo de datos y arquitectura.<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Cassandra: Operaciones Básicas y Arquitectura</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Este es el segundo post en la serie sobre Cassandra, con una breve descripción de las operaciones básicas (tales como inserción, recuperación e indexado), y ejecución por completo junto con el modelo de datos y arquitectura.</p> <div class="date-created-modified">Created 2020-03-05<br> Modified 2020-03-20</div> <p>Otros posts en esta serie:</p> <ul> <li><a href="/blog/mdad/cassandra-introduccion/">Cassandra: Introducción</a></li> <li><a href="/blog/mdad/cassandra-operaciones-basicas-y-arquitectura/">Cassandra: Operaciones Básicas y Arquitectura</a> (este post)</li> </ul> <p>Este post está hecho en colaboración con un compañero.</p> <hr /> <p>Antes de poder ejecutar ninguna consulta, debemos lanzar la base de datos en caso de que no se encuentre en ejecución aún. Para ello, en una terminal, lanzamos el binario de <code>cassandra</code>:</p> <pre><code>$ cassandra-3.11.6/bin/cassandra </code></pre> <p>Sin cerrar esta consola, abrimos otra en la que podamos usar la <a href="https://cassandra.apache.org/doc/latest/tools/cqlsh.html">CQL shell</a>:</p> <pre><code>$ cassandra-3.11.6/bin/cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh&gt; </code></pre> <h2 class="title" id="crear"><a class="anchor" href="#crear">¶</a>Crear</h2> <h3 id="crear_una_base_de_datos"><a class="anchor" href="#crear_una_base_de_datos">¶</a>Crear una base de datos</h3> <p>Cassandra denomina a las «bases de datos» como «espacio de claves» (keyspace en inglés).</p> <pre><code>cqlsh&gt; create keyspace helloworld with replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; </code></pre> <p>Cuando creamos un nuevo <em>keyspace</em>, indicamos el nombre y la estrategia de replicación a usar. Nosotros usamos la estrategia simple con un factor 3 de replicación.</p> <h3 id="crear_una_tabla"><a class="anchor" href="#crear_una_tabla">¶</a>Crear una tabla</h3> <p>Una vez estemos dentro de un <em>keyspace</em>, podemos crear tablas. Vamos a crear una tabla llamada «greetings» con identificador (número entero), mensaje (texto) y lenguaje (<code>varchar</code>).</p> <pre><code>cqlsh&gt; use helloworld; cqlsh:helloworld&gt; create table greetings(id int primary key, message text, lang varchar); </code></pre> <h3 id="crear_una_fila"><a class="anchor" href="#crear_una_fila">¶</a>Crear una fila</h3> <p>Insertar nuevas filas es similar a otros sistemas gestores de datos, mediante la sentencia <code>INSERT</code>:</p> <pre><code>cqlsh:helloworld&gt; insert into greetings(id, message, lang) values(1, '¡Bienvenido!', 'es'); cqlsh:helloworld&gt; insert into greetings(id, message, lang) values(2, 'Welcome!', 'es'); </code></pre> <h2 id="leer"><a class="anchor" href="#leer">¶</a>Leer</h2> <p>La lectura se lleva a cabo mediante la sentencia <code>SELECT</code>:</p> <pre><code>cqlsh:helloworld&gt; select * from greetings; id | lang | message ----+------+-------------- 1 | es | ¡Bienvenido! 2 | es | Welcome! (2 rows) </code></pre> <p><code>cqlsh</code> colorea la salida, lo cuál resulta muy útil para identificar la clave primaria y distintos tipos de datos como texto, cadenas o números:</p> <p><img src="image.png" alt="" /></p> <h2 id="actualizar"><a class="anchor" href="#actualizar">¶</a>Actualizar</h2> <p>La actualización se lleva a cabo con la sentencia <code>UPDATE</code>. Vamos a arreglar el fallo que hemos cometido al insertar «Welcome!» como español:</p> <pre><code>cqlsh:helloworld&gt; update greetings set lang = 'en' where id = 2; </code></pre> <h2 id="indexar"><a class="anchor" href="#indexar">¶</a>Indexar</h2> <pre><code>cqlsh:helloworld&gt; create index langIndex on greetings(lang); </code></pre> <h2 id="borrar"><a class="anchor" href="#borrar">¶</a>Borrar</h2> <p>Finalmente, el borrado se lleva a cabo con la sentencia <code>DELETE</code>. Es posible borrar solo campos individuales, lo cuál los pone a nulos:</p> <pre><code>cqlsh:helloworld&gt; delete message from greetings where id = 1; </code></pre> <p>Para eliminar la fila entera, basta con no especificar la columna:</p> <pre><code>cqlsh:helloworld&gt; delete from greetings where id = 1; </code></pre> <h2 id="referencias"><a class="anchor" href="#referencias">¶</a>Referencias</h2> <ul> <li><a href="https://www.tutorialspoint.com/cassandra/cassandra_create_keyspace.htm">tutorialspoint – Creating a Keyspace using Cqlsh</a></li> <li><a href="https://www.tutorialspoint.com/cassandra/cassandra_cql_datatypes.htm">tutorialspoint – Cassandra – CQL Datatypes</a></li> <li><a href="https://www.tutorialspoint.com/cassandra/cassandra_create_table.htm">tutorialspoint – Cassandra – Create Table</a></li> <li><a href="https://data-flair.training/blogs/cassandra-crud-operation/">Data Flair – Cassandra Crud Operation – Create, Update, Read &amp; Delete</a></li> <li><a href="https://cassandra.apache.org/doc/latest/cql/indexes.html">Cassandra Documentation – Secondary Indexes</a></li> </ul> </main> </body> </html> Visualizing Cáceres’ OpenDatadist/visualizing-caceres-opendata/index.html2020-03-18T23:00:00+00:002020-03-08T23:00:00+00:00The city of Cáceres has online services to provide <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Visualizing Cáceres’ OpenData</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>The city of Cáceres has online services to provide <a href="http://opendata.caceres.es/">Open Data</a> over a wide range of <a href="http://opendata.caceres.es/dataset">categories</a>, all of which are very interesting to explore!</p> <div class="date-created-modified">Created 2020-03-09<br> Modified 2020-03-19</div> <p>We have chosen two different datasets, and will explore four different ways to visualize the data.</p> <p>This post is co-authored with Classmate.</p> <h2 class="title" id="obtain_the_data"><a class="anchor" href="#obtain_the_data">¶</a>Obtain the data</h2> <p>We are interested in the JSON format for the <a href="http://opendata.caceres.es/dataset/informacion-del-padron-de-caceres-2017">census in 2017</a> and those for the <a href="http://opendata.caceres.es/dataset/vias-urbanas-caceres">vias of the city</a>. This way, we can explore the population and their location in interesting ways! You may follow those two links and select the JSON format under Resources to download it.</p> <p>Why JSON? We will be using <a href="https://python.org/">Python</a> (3.7 or above) and <a href="https://matplotlib.org/">matplotlib</a> for quick iteration, and loading the data with <a href="https://docs.python.org/3/library/json.html">Python’s <code>json</code> module</a> will be trivial.</p> <h2 id="implementation"><a class="anchor" href="#implementation">¶</a>Implementation</h2> <h3 id="imports_and_constants"><a class="anchor" href="#imports_and_constants">¶</a>Imports and constants</h3> <p>We are going to need a lot of things in this code, such as <code>json</code> to load the data, <code>matplotlib</code> to visualize it, and other data types and type hinting for use in the code.</p> <p>We also want automatic download of the JSON files if they’re missing, so we add their URLs and download paths as constants.</p> <pre><code>import json import re import os import sys import urllib.request import matplotlib.pyplot as plt from dataclasses import dataclass from collections import namedtuple from datetime import date from pathlib import Path from typing import Optional CENSUS_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:InformacionCENSUS&amp;year=2017&amp;format=json' VIAS_URL = 'http://opendata.caceres.es/GetData/GetData?dataset=om:Via&amp;format=json' CENSUS_JSON = Path('data/demografia/Padrón_Cáceres_2017.json') VIAS_JSON = Path('data/via/Vías_Cáceres.json') </code></pre> <h3 id="data_classes"><a class="anchor" href="#data_classes">¶</a>Data classes</h3> <p><a href="https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/">Parse, don’t validate</a>. By defining a clear data model, we will be able to tell at a glance what information we have available. It will also be typed, so we won’t be confused as to what is what! Python 3.7 introduces <code>[dataclasses](https://docs.python.org/3/library/dataclasses.html)</code>, which are a wonderful feature to define… well, data classes concisely.</p> <p>We also have a <code>[namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple)</code> for points, because it’s extremely common to represent them as tuples.</p> <pre><code>Point = namedtuple('Point', 'long lat') @dataclass class Census: year: int via: int count_per_year: dict count_per_city: dict count_per_gender: dict count_per_nationality: dict time_year: int @dataclass class Via: name: str kind: str code: int history: Optional[str] old_name: Optional[str] length: Optional[float] start: Optional[Point] middle: Optional[Point] end: Optional[Point] geometry: Optional[list] </code></pre> <h3 id="helper_methods"><a class="anchor" href="#helper_methods">¶</a>Helper methods</h3> <p>We will have a little helper method to automatically download the JSON when missing. This is just for convenience, we could as well just download it manually. But it is fun to automate things.</p> <pre><code>def ensure_file(file, url): if not file.is_file(): print('Downloading', file.name, 'because it was missing...', end='', flush=True, file=sys.stderr) file.parent.mkdir(parents=True, exist_ok=True) urllib.request.urlretrieve(url, file) print(' Done.', file=sys.stderr) </code></pre> <h3 id="parsing_the_data"><a class="anchor" href="#parsing_the_data">¶</a>Parsing the data</h3> <p>I will be honest, parsing Cáceres’ OpenData is a pain in the neck! The official descriptions are huge and not all that helpful. Maybe if one needs documentation for a specific field. But luckily for us, the names are pretty self-descriptive, and we can explore the data to get a feel for what we will find.</p> <p>We define two methods, one to iterate over <code>Census</code> values, and another to iterate over <code>Via</code> values. Here’s where our friend <code>[re](https://docs.python.org/3/library/re.html)</code> comes in, and oh boy the format of the data…</p> <p>For example, the year and via identifier are best extracted from the URI! The information is also available in the <code>rdfs_label</code> field, but that’s just a Spanish text! At least the URI will be more reliable… hopefully.</p> <p>Birth date. They could have used a JSON list, but nah, that would’ve been too simple. Instead, you are given a string separated by semicolons. The values? They could have been dictionaries with names for «year» and «age», but nah! That would’ve been too simple! Instead, you are given strings that look like «2001 (7)», and that’s the year and the count.</p> <p>The birth place? Sometimes it’s «City (Province) (Count)», but sometimes the province is missing. Gender? Semicolon-separated. And there are only two genders. I know a few people who would be upset just reading this, but it’s not my data, it’s theirs. Oh, and plenty of things are optional. That was a lot of <code>AttributeError: 'NoneType' object has no attribute 'foo'</code> to work through!</p> <p>But as a reward, we have nicely typed data, and we no longer have to deal with this mess when trying to visualize it. For brevity, we will only be showing how to parse the census data, and not the data for the vias. This post is already long enough on its own.</p> <pre><code>def iter_census(file): with file.open() as fd: data = json.load(fd) for row in data['results']['bindings']: year, via = map(int, row['uri']['value'].split('/')[-1].split('-')) count_per_year = {} for item in row['schema_birthDate']['value'].split(';'): y, c = map(int, re.match(r'(\d+) \((\d+)\)', item).groups()) count_per_year[y] = c count_per_city = {} for item in row['schema_birthPlace']['value'].split(';'): match = re.match(r'([^(]+) \(([^)]+)\) \((\d+)\)', item) if match: l, _province, c = match.groups() else: l, c = re.match(r'([^(]+) \((\d+)\)', item).groups() count_per_city[l] = int(c) count_per_gender = {} for item in row['foaf_gender']['value'].split(';'): g, c = re.match(r'([^(]+) \((\d+)\)', item).groups() count_per_gender[g] = int(c) count_per_nationality = {} for item in row['schema_nationality']['value'].split(';'): match = re.match(r'([^(]+) \((\d+)\)', item) if match: g, c = match.groups() else: g, _alt_name, c = re.match(r'([^(]+) \(([^)]+)\) \((\d+)\)', item).groups() count_per_nationality[g] = int(c) time_year = int(row['time_year']['value']) yield Census( year=year, via=via, count_per_year=count_per_year, count_per_city=count_per_city, count_per_gender=count_per_gender, count_per_nationality=count_per_nationality, time_year=time_year, ) </code></pre> <h2 id="visualizing_the_data"><a class="anchor" href="#visualizing_the_data">¶</a>Visualizing the data</h2> <p>Here comes the fun part! After parsing all the desired data from the mentioned JSON files, we plotted the data in four different graphics making use of Python’s <a href="https://matplotlib.org/"><code>matplotlib</code> library.</a> This powerful library helps with the creation of different visualizations in Python.</p> <h3 id="visualizing_the_genders_in_a_pie_chart"><a class="anchor" href="#visualizing_the_genders_in_a_pie_chart">¶</a>Visualizing the genders in a pie chart</h3> <p>After seeing that there are only two genders in the data of the census, we, displeased, started work in a chart for it. The pie chart was the best option since we wanted to show only the percentages of each gender. The result looks like this:</p> <p><img src="pie_chart.png" alt="" /></p> <p>Pretty straight forward, isn’t it? To display this wonderful graphic, we used the following code:</p> <pre><code>def pie_chart(ax, data): lists = sorted(data.items()) x, y = zip(*lists) ax.pie(y, labels=x, autopct='%1.1f%%', shadow=True, startangle=90) ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. </code></pre> <p>We pass the axis as the input parameter (later we will explain why) and the data collected from the JSON regarding the genders, which are in a dictionary with the key being the labels and the values the tally of each gender. We sort the data and with some unpacking magic we split it into two values: <code>x</code> being the labels and <code>y</code> the amount of each gender.</p> <p>After that we plot the pie chart with the data and labels from <code>y</code> and <code>x</code>, we specify that we want the percentage with one decimal place with the <code>autopct</code> parameter, we enable shadows for the presentation, and specify the start angle at 90º.</p> <h3 id="date_tick_labels"><a class="anchor" href="#date_tick_labels">¶</a>Date tick labels</h3> <p>We wanted to know how many of the living people were born on each year, so we are making a date plot! In the census we have the year each person was born in, and using that information is an easy task after parsing the data (parsing was an important task of this work). The result looks as follows:</p> <p><img src="date_tick.png" alt="" /></p> <p>How did we do this? The following code was used:</p> <pre><code>def date_tick(ax, data): lists = sorted(data.items()) x, y = zip(*lists) x = [date(year, 1, 1) for year in x] ax.plot(x, y) </code></pre> <p>Again, we pass in an axis and the data related with the year born, we sort it, split it into two lists, being the keys the years and the values the number per year. After that, we put the years in a date format for the plot to be more accurate. Finally, we plot the values into that wonderful graphic.</p> <h3 id="stacked_bar_chart"><a class="anchor" href="#stacked_bar_chart">¶</a>Stacked bar chart</h3> <p>We wanted to know if there was any relation between the latitudes and count per gender, so we developed the following code:</p> <pre><code>def stacked_bar_chart(ax, data): labels = [] males = [] females = [] for latitude, genders in data.items(): labels.append(str(latitude)) males.append(genders['Male']) females.append(genders['Female']) ax.bar(labels, males, label='Males') ax.bar(labels, females, bottom=males, label='Females') ax.set_ylabel('Counts') ax.set_xlabel('Latitudes') ax.legend() </code></pre> <p>The key of the data dictionary is the latitude rounded to two decimals, and value is another dictionary, which is composed by the key that is the name of the gender and the value, the number of people per gender. So, in a single entry of the data dictionary we have the latitude and how many people per gender are in that latitude.</p> <p>We iterate the dictionary to extract the different latitudes and people per gender (because we know only two genders are used, we hardcode it to two lists). Then we plot them putting the <code>males</code> and <code>females</code> lists at the bottom and set the labels of each axis. The result is the following:</p> <p><img src="stacked_bar_chart-1.png" alt="" /></p> <h3 id="scatter_plots"><a class="anchor" href="#scatter_plots">¶</a>Scatter plots</h3> <p>This last graphic was very tricky to get right. It’s incredibly hard to find the extent of a city online! We were getting confused because some of the points were way farther than the centre of Cáceres, and the city background is a bit stretched even if the coordinates appear correct. But in the end, we did a pretty good job on it.</p> <pre><code>def scatter_map(ax, data): xs = [] ys = [] areas = [] for (long, lat), count in data.items(): xs.append(long) ys.append(lat) areas.append(count / 100) if CACERES_MAP.is_file(): ax.imshow(plt.imread(str(CACERES_MAP)), extent=CACERES_EXTENT) else: print('Note:', CACERES_MAP, 'does not exist, not showing it', file=sys.stderr) ax.scatter(xs, ys, areas, alpha=0.1) </code></pre> <p>This time, the keys in the data dictionary are points and the values are the total count of people in that point. We use a normal <code>for</code> loop to create the different lists. For the areas on how big the circles we are going to represent will be, we divide the count of people by some number, like <code>100</code>, or otherwise they would be huge.</p> <p>If the file of the map is present, we render it so that we can get a sense on where the points are, but if the file is missing we print a warning.</p> <p>At last, we draw the scatter plot with some low alpha value (there’s a lot of overlapping points). The result is <em>absolutely gorgeous</em>. (For some definitions of gorgeous, anyway):</p> <p><img src="scatter_map.png" alt="" /></p> <p>Just for fun, here’s what it looks like if we don’t divide the count by 100 and lower the opacity to <code>0.01</code>:</p> <p><img src="scatter_map-2.png" alt="" /></p> <p>That’s a big solid blob, and the opacity is only set to <code>0.01</code>!</p> <h3 id="drawing_all_the_graphs_in_the_same_window"><a class="anchor" href="#drawing_all_the_graphs_in_the_same_window">¶</a>Drawing all the graphs in the same window</h3> <p>To draw all the graphs in the same window instead of getting four different windows we made use of the <a href="https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.subplots.html"><code>subplots</code> function</a>, like this:</p> <pre><code>fig, axes = plt.subplots(2, 2) </code></pre> <p>This will create a matrix of two by two of axes that we store in the axes variable (fitting name!). Following this code are the different calls to the methods commented before, where we access each individual axis and pass it to the methods to draw on:</p> <pre><code>pie_chart(axes[0, 0], genders) date_tick(axes[0, 1], years) stacked_bar_chart(axes[1, 0], latitudes) scatter_map(axes[1, 1], positions) </code></pre> <p>Lastly, we plot the different graphics:</p> <pre><code>plt.show() </code></pre> <p>Wrapping everything together, here’s the result:</p> <p><img src="figures-1.png" alt="" /></p> <p>The numbers in some of the graphs are a bit crammed together, but we’ll blame that on <code>matplotlib</code>.</p> <h2 id="closing_words"><a class="anchor" href="#closing_words">¶</a>Closing words</h2> <p>Wow, that was a long journey! We hope that this post helped you pick some interest in data exploration, it’s such a fun world. We also offer the full download for the code below, because we know it’s quite a bit!</p> <p>Which of the graphs was your favourite? I personally like the count per date, I think it’s nice to see the growth. Let us know in the comments below!</p> <p><em>download removed</em></p> </main> </body> </html> What is an algorithm?dist/what-is-an-algorithm/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00Algorithms are a sequence of instructions that can be followed to achieve <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>What is an algorithm?</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Algorithms are a sequence of instructions that can be followed to achieve <em>something</em>. That something can be anything, and depends entirely on your problem!</p> <div class="date-created-modified">Created 2020-02-25<br> Modified 2020-03-18</div> <p>For example, a recipe to cook some really nice food is an algorithm: it guides you, step by step, to cook something nice. People dealing with mathemathics also apply algorithms to transform their data. And computers <em>love</em> algorithms, too!</p> <p>In reality, any computer program can basically be thought as an algorithm. It contains a series of instructions for the computer to execute. Running them is a process that takes time, consumes input and produces output. This is also why terms like «procedure» come up when talking about them.</p> <p>Computer programs (their algorithms) are normally written in some more specific language, like Java or Python. The instructions are very clear here, which is what we need! A natural language like English is a lot harder to process, and ambiguous. I’m sure you’ve been in arguments because the other person didn’t understand you!</p> <h2 class="title" id="references"><a class="anchor" href="#references">¶</a>References</h2> <ul> <li>algorithm – definition and meaning: <a href="https://www.wordnik.com/words/algorithm">https://www.wordnik.com/words/algorithm</a></li> <li>Algorithm: <a href="https://en.wikipedia.org/wiki/Algorithm">https://en.wikipedia.org/wiki/Algorithm</a></li> <li>What is a «computer algorithm»?: <a href="https://computer.howstuffworks.com/what-is-a-computer-algorithm.htm">https://computer.howstuffworks.com/what-is-a-computer-algorithm.htm</a></li> </ul> </main> </body> </html> Introduction to NoSQLdist/introduction-to-nosql/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00This post will primarly focus on the talk held in the <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Introduction to NoSQL</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>This post will primarly focus on the talk held in the <a href="https://youtu.be/qI_g07C_Q5I">GOTO 2012 conference: Introduction to NoSQL by Martin Fowler</a>. It can be seen as an informal, summarized transcript of the talk</p> <div class="date-created-modified">Created 2020-02-25<br> Modified 2020-03-18</div> <hr /> <p>The relational database model is affected by the <em><a href="https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch">impedance mismatch problem</a></em>. This occurs because we have to match our high-level design with the separate columns and rows used by relational databases.</p> <p>Taking the in-memory objects and putting them into a relational database (which were dominant at the time) simply didn’t work out. Why? Relational databases were more than just databases, they served as a an integration mechanism across applications, up to the 2000s. For 20 years!</p> <p>With the rise of the Internet and the sheer amount of traffic, databases needed to scale. Unfortunately, relational databases only scale well vertically (by upgrading a <em>single</em> node). This is <em>very</em> expensive, and not something many could afford.</p> <p>The problem are those pesky <code>JOIN</code>‘s, and its friends <code>GROUP BY</code>. Because our program and reality model don’t match the tables used by SQL, we have to rely on them to query the data. It is because the model doesn’t map directly.</p> <p>Furthermore, graphs don’t map very well at all to relational models.</p> <p>We needed a way to scale horizontally (by increasing the <em>amount</em> of nodes), something relational databases were not designed to do.</p> <blockquote> <p><em>We need to do something different, relational across nodes is an unnatural act</em></p> </blockquote> <p>This inspired the NoSQL movement.</p> <blockquote> <p><em>#nosql was only meant to be a hashtag to advertise it, but unfortunately it’s how it is called now</em></p> </blockquote> <p>It is not possible to define NoSQL, but we can identify some of its characteristics:</p> <ul> <li>Non-relational</li> <li><strong>Cluster-friendly</strong> (this was the original spark)</li> <li>Open-source (until now, generally)</li> <li>21st century web culture</li> <li>Schema-less (easier integration or conjugation of several models, structure aggregation)</li> </ul> <p>These databases use different data models to those used by the relational model. However, it is possible to identify 4 broad chunks (some may say 3, or even 2!):</p> <ul> <li><strong>Key-value store</strong>. With a certain key, you obtain the value corresponding to it. It knows nothing else, nor does it care. We say the data is opaque.</li> <li><strong>Document-based</strong>. It stores an entire mass of documents with complex structure, normally through the use of JSON (XML has been left behind). Then, you can ask for certain fields, structures, or portions. We say the data is transparent.</li> <li><strong>Column-family</strong>. There is a «row key», and within it we store multiple «column families» (columns that fit together, our aggregate). We access by row-key and column-family name.</li> </ul> <p>All of these kind of serve to store documents without any <em>explicit</em> schema. Just shove in anything! This gives a lot of flexibility and ease of migration, except… that’s not really true. There’s an <em>implicit</em> schema when querying.</p> <p>For example, a query where we may do <code>anOrder['price'] * anOrder['quantity']</code> is assuming that <code>anOrder</code> has both a <code>price</code> and a <code>quantity</code>, and that both of these can be multiplied together. «Schema-less» is a fuzzy term.</p> <p>However, it is the lack of a <em>fixed</em> schema that gives flexibility.</p> <p>One could argue that the line between key-value and document-based is very fuzzy, and they would be right! Key-value databases often let you include additional metadata that behaves like an index, and in document-based, documents often have an identifier anyway.</p> <p>The common notion between these three types is what matters. They save an entire structure as an <em>unit</em>. We can refer to these as «Aggregate Oriented Databases». Aggregate, because we group things when designing or modeling our systems, as opposed to relational databases that scatter the information across many tables.</p> <p>There exists a notable outlier, though, and that’s:</p> <ul> <li><strong>Graph</strong> databases. They use a node-and-arc graph structure. They are great for moving on relationships across things. Ironically, relational databases are not very good at jumping across relationships! It is possibly to perform very interesting queries in graph databases which would be really hard and costly on relational models. Unlike the aggregated databases, graphs break things into even smaller units. NoSQL is not <em>the</em> solution. It depends on how you’ll work with your data. Do you need an aggregate database? Will you have a lot of relationships? Or would the relational model be good fit for you?</li> </ul> <p>NoSQL, however, is a good fit for large-scale projects (data will <em>always</em> grow) and faster development (the impedance mismatch is drastically reduced).</p> <p>Regardless of our choice, it is important to remember that NoSQL is a young technology, which is still evolving really fast (SQL has been stable for <em>decades</em>). But the <em>polyglot persistence</em> is what matters. One must know the alternatives, and be able to choose.</p> <hr /> <p>Relational databases have the well-known ACID properties: Atomicity, Consistency, Isolation and Durability.</p> <p>NoSQL (except graph-based!) are about being BASE instead: Basically Available, Soft state, Eventual consistency.</p> <p>SQL needs transactions because we don’t want to perform a read while we’re only half-way done with a write! The readers and writers are the problem, and ensuring consistency results in a performance hit, even if the risk is low (two writers are extremely rare but it still must be handled).</p> <p>NoSQL on the other hand doesn’t need ACID because the aggregate <em>is</em> the transaction boundary. Even before NoSQL itself existed! Any update is atomic by nature. When updating many documents it <em>is</em> a problem, but this is very rare.</p> <p>We have to distinguish between logical and replication consistency. During an update and if a conflict occurs, it must be resolved to preserve the logical consistency. Replication consistency on the other hand is preserveed when distributing the data across many machines, for example during sharding or copies.</p> <p>Replication buys us more processing power and resillence (at the cost of more storage) in case some of the nodes die. But what happens if what dies is the communication across the nodes? We could drop the requests and preserve the consistency, or accept the risk to continue and instead preserve the availability.</p> <p>The choice on whether trading consistency for availability is acceptable or not depends on the domain rules. It is the domain’s choice, the business people will choose. If you’re Amazon, you always want to be able to sell, but if you’re a bank, you probably don’t want your clients to have negative numbers in their account!</p> <p>Regardless of what we do, in a distributed system, the CAP theorem always applies: Consistecy, Availability, Partitioning-tolerancy (error tolerancy). It is <strong>impossible</strong> to guarantee all 3 at 100%. Most of the times, it does work, but it is mathematically impossible to guarantee at 100%.</p> <p>A database has to choose what to give up at some point. When designing a distributed system, this must be considered. Normally, the choice is made between consistency or response time.</p> <h2 class="title" id="further_reading"><a class="anchor" href="#further_reading">¶</a>Further reading</h2> <ul> <li><a href="https://www.martinfowler.com/articles/nosql-intro-original.pdf">The future is: <del>NoSQL Databases</del> Polyglot Persistence</a></li> <li><a href="https://www.thoughtworks.com/insights/blog/nosql-databases-overview">NoSQL Databases: An Overview</a></li> </ul> </main> </body> </html> Big Datadist/big-data/index.html2020-03-17T23:00:00+00:002020-02-24T23:00:00+00:00Big Data sounds like a buzzword you may be hearing everywhere, but it’s actually here to stay!<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>Big Data</title> <link rel="stylesheet" href="../css/style.css"> </head> <body> <main> <p>Big Data sounds like a buzzword you may be hearing everywhere, but it’s actually here to stay!</p> <div class="date-created-modified">Created 2020-02-25<br> Modified 2020-03-18</div> <h2 class="title" id="what_is_big_data_"><a class="anchor" href="#what_is_big_data_">¶</a>What is Big Data?</h2> <p>And why is it so important? We use this term to refer to the large amount of data available, rapidly growing every day, that cannot be processed in conventional ways. It’s not only about the amount, it’s also about the variety and rate of growth.</p> <p>Thanks to technological advancements, there are new ways to process this insane amount of data, which would otherwise be too costly for processing in traditional database systems.</p> <h2 id="where_does_data_come_from_"><a class="anchor" href="#where_does_data_come_from_">¶</a>Where does data come from?</h2> <p>It can be pictures in your phone, industry transactions, messages in social networks, a sensor in the mountains. It can come from anywhere, which makes the data very varied.</p> <p>Just to give some numbers, over 12TB of data is generated on Twitter <em>daily</em>. If you purchase a laptop today (as of March 2020), the disk will be roughly 1TB, maybe 2TB. Twitter would fill 6 of those drives every day!</p> <p>What about Facebook? It is estimated they store around 100PB of photos and videos. That would be 50000 laptop disks. Not a small number. And let’s not talk about worldwide network traffic…</p> <h2 id="what_data_can_be_exploited_"><a class="anchor" href="#what_data_can_be_exploited_">¶</a>What data can be exploited?</h2> <p>So, we have a lot of data. Should we attempt and process everything? We can distinguish several categories.</p> <ul> <li><strong>Web and Social Media</strong>: Clickstream Data, Twitter Feeds, Facebook Postings, Web content… Stuff coming from social networks.</li> <li><strong>Biometrics</strong>: Facial Recognion, Genetics… Any kind of personal recognition.</li> <li><strong>Machine-to-Machine</strong>: Utility Smart Meter Readings, RFID Readings, Oil Rig Sensor Readings, GPS Signals… Any sensor shared with other machines.</li> <li><strong>Human Generated</strong>: Call Center Voice Recordings, Email, Electronic Medical Records… Even the voice notes one sends over WhatsApp count.</li> <li><strong>Big Transaction Data</strong>: Healthcare Claims, Telecommunications Call Detail Records, Utility Billing Records… Financial transactions.</li> </ul> <p>But asking what to process is asking the wrong question. Instead, one should think about «What problem am I trying to solve?».</p> <h2 id="how_to_exploit_this_data_"><a class="anchor" href="#how_to_exploit_this_data_">¶</a>How to exploit this data?</h2> <p>What are some of the ways to deal with this data? If the problem fits the Map-Reduce paradigm then Hadoop is a great option! Hadoop is inspired by Google File System (GFS), and achieves great parallelism across the nodes of a cluster, and has the following components:</p> <ul> <li><strong>Hadoop Distributed File System</strong>. Data is divided into smaller «blocks» and distributed across the cluster, which makes it possible to execute the mapping and reduction in smaller subsets, and makes it possible to scale horizontally.</li> <li><strong>Hadoop MapReduce</strong>. First, a data set is «mapped» into a different set, and data becomes a list of tuples (key, value). The «reduce» step works on these tuples and combines them into a smaller subset.</li> <li><strong>Hadoop Common</strong>. These are a set of libraries that ease working with Hadoop.</li> </ul> <h2 id="key_insights"><a class="anchor" href="#key_insights">¶</a>Key insights</h2> <p>Big Data is a field whose goal is to extract information from very large sets of data, and find ways to do so. To summarize its different dimensions, we can refer to what’s known as «the Four V’s of Big Data»:</p> <ul> <li><strong>Volume</strong>. Really large quantities.</li> <li><strong>Velocity</strong>. Processing response time matters!</li> <li><strong>Variety</strong>. Data comes from plenty of sources.</li> <li><strong>Veracity.</strong> Can we trust all sources, though?</li> </ul> <p>Some sources talk about a fifth V for <strong>Value</strong>; because processing this data is costly, it is important we can get value out of it.</p> <p>…And some other sources go as high as seven V’s, including <strong>Viability</strong> and <strong>Visualization</strong>. Computers can’t take decissions on their own (yet), a human has to. And they can only do so if they’re presented the data (and visualize it) in a meaningful way.</p> <h2 id="infographics"><a class="anchor" href="#infographics">¶</a>Infographics</h2> <p>Let’s see some pictures, we all love pictures:</p> <p><img src="4-Vs-of-big-data.jpg" alt="" /></p> <h2 id="common_patterns"><a class="anchor" href="#common_patterns">¶</a>Common patterns</h2> <h2 id="references"><a class="anchor" href="#references">¶</a>References</h2> <ul> <li>¿Qué es Big Data? – <a href="https://www.ibm.com/developerworks/ssa/local/im/que-es-big-data/">https://www.ibm.com/developerworks/ssa/local/im/que-es-big-data/</a></li> <li>The Four V’s of Big Data – <a href="https://www.ibmbigdatahub.com/infographic/four-vs-big-data">https://www.ibmbigdatahub.com/infographic/four-vs-big-data</a></li> <li>Big data – <a href="https://en.wikipedia.org/wiki/Big_data">https://en.wikipedia.org/wiki/Big_data</a></li> <li>Las 5 V’s del Big Data – <a href="https://www.quanticsolutions.es/big-data/las-5-vs-del-big-data">https://www.quanticsolutions.es/big-data/las-5-vs-del-big-data</a></li> <li>Las 7 V del Big data: Características más importantes – <a href="https://www.iic.uam.es/innovacion/big-data-caracteristicas-mas-importantes-7-v/#viabilidad">https://www.iic.uam.es/innovacion/big-data-caracteristicas-mas-importantes-7-v/</a></li> </ul> </main> </body> </html>