git rekt — gemini-redirect.git (fcd5db0bbb4bb9fe310c1507e91664de87765514): blog/ribw/upgrading-our-baby-crawler/index.html

blog/ribw/upgrading-our-baby-crawler/index.html (view raw)
 1<!DOCTYPE html>
 2<html>
 3<head>
 4<meta charset="utf-8" />
 5<meta name="viewport" content="width=device-width, initial-scale=1" />
 6<title>Upgrading our Baby Crawler</title>
 7<link rel="stylesheet" href="../css/style.css">
 8</head>
 9<body>
10<main>
11<p>In our <a href="/blog/ribw/build-your-own-pc/">last post on this series</a>, we presented the code for our Personal Crawler. However, we didn’t quite explain what a crawler even is! We will use this moment to go a bit more in-depth, and make some upgrades to it.</p>
12<div class="date-created-modified">Created 2020-03-11<br>
13Modified 2020-03-18</div>
14<h2 class="title" id="what_is_a_crawler_"><a class="anchor" href="#what_is_a_crawler_">¶</a>What is a Crawler?</h2>
15<p>A crawler is a program whose job is to analyze documents and extract data from them. For example, search engines like <a href="http://duckduckgo.com/">DuckDuckGo</a>, <a href="https://bing.com/">Bing</a> or <a href="http://google.com/">Google</a> all have crawlers to analyze websites and build a database around them. They are some kind of «trackers», because they keep track of everything they find.</p>
16<p>Their basic behaviour can be described as follows: given a starting list of URLs, follow them all and identify hyperlinks inside the documents. Add these to the list of links to follow, and repeat <em>ad infinitum</em>.</p>
17<ul>
18<li>This lets us create an index to quickly search across them all.</li>
19<li>We can also identify broken links.</li>
20<li>We can gather any other type of information that we found.
21Our crawler will work offline, within our own computer, scanning the text documents it finds on the root we tell it to scan.</li>
22</ul>
23<h2 id="design_decissions"><a class="anchor" href="#design_decissions">¶</a>Design Decissions</h2>
24<ul>
25<li>We will use Java. Its runtime is quite ubiquitous, so it should be able to run in virtually anywhere. The language is typed, which helps catch errors early on.</li>
26<li>Our solution is iterative. While recursion can be seen as more elegants by some, iterative solutions are often more performant with less need for optimization.</li>
27</ul>
28<h2 id="requirements"><a class="anchor" href="#requirements">¶</a>Requirements</h2>
29<p>If you don’t have Java installed yet, you can <a href="https://java.com/en/download/">Download Free Java Software</a> from Oracle’s site. To compile the code, the <a href="https://www.oracle.com/java/technologies/javase-jdk8-downloads.html">Java Development Kit</a> is also necessary.</p>
30<p>We don’t depend on any other external libraries, for easier deployment and compilation.</p>
31<h2 id="implementation"><a class="anchor" href="#implementation">¶</a>Implementation</h2>
32<p>Because the code was getting pretty large, it has been split into several files, and we have also upgraded it to use a Graphical User Interface instead! We decided to use Swing, based on the Java tutorial <a href="https://docs.oracle.com/javase/tutorial/uiswing/">Creating a GUI With JFC/Swing</a>.</p>
33<h3 id="app"><a class="anchor" href="#app">¶</a>App</h3>
34<p>This file is the entry point of our application. Its job is to initialize the components, lay them out in the main panel, and connect the event handlers.</p>
35<p>Most widgets are pretty standard, and are defined as class variables. However, some variables are notable. The <code>[DefaultTableModel](https://docs.oracle.com/javase/8/docs/api/javax/swing/table/DefaultTableModel.html)</code> is used because it allows to <a href="https://stackoverflow.com/a/22550106">dynamically add rows</a>, and we also have a <code>[SwingWorker](https://docs.oracle.com/javase/8/docs/api/javax/swing/SwingWorker.html)</code> subclass responsible for performing the word analysis (which is quite CPU intensive and should not be ran in the UI thread!).</p>
36<p>There’s a few utility methods to ease some common operations, such as <code>updateStatus</code> which changes the status label in the main window, informing the user of the latest changes.</p>
37<h3 id="thesaurus"><a class="anchor" href="#thesaurus">¶</a>Thesaurus</h3>
38<p>A thesaurus is a collection of words or terms used to represent concepts. In literature this is commonly known as a dictionary.</p>
39<p>On the subject of this project, we are using a thesaurus based on how relevant is a word for the meaning of a sentence, filtering out those that barely give us any information.</p>
40<p>This file contains a simple thesaurus implementation, which can trivially be used as a normal or inverted thesaurus. However, we only treat it as inverted, and its job is loading itself and determining if words are valid or should otherwise be ignored.</p>
41<h3 id="utils"><a class="anchor" href="#utils">¶</a>Utils</h3>
42<p>Several utility functions used across the codebase.</p>
43<h3 id="wordmap"><a class="anchor" href="#wordmap">¶</a>WordMap</h3>
44<p>This file is the important one, and its implementation hasn’t changed much since our last post. Instances of a word map contain… wait for it… a map of words! It stores the mapping <code>word → count</code> in memory, and offers methods to query the count of a word or iterate over the word count entries.</p>
45<p>It can be loaded from cache or told to analyze a root path. Once an instance is created, additional files could be analyzed one by one if desired.</p>
46<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
47<p>The code was getting a bit too large to embed it within the blog post itself, so instead you can download it as a<code>.zip</code> file.</p>
48<p><em>download removed</em></p>
49</main>
50</body>
51</html>
52