all repos — gemini-redirect @ 772f5df37f7bd727dd575ea042b6d87b9b2e2529

blog/ribw/integrating-apache-tika-into-our-crawler/index.html (view raw)

 1<!DOCTYPE html>
 2<html>
 3<head>
 4<meta charset="utf-8" />
 5<meta name="viewport" content="width=device-width, initial-scale=1" />
 6<title>Integrating Apache Tika into our Crawler</title>
 7<link rel="stylesheet" href="../css/style.css">
 8</head>
 9<body>
10<main>
11<p><a href="/blog/ribw/upgrading-our-baby-crawler/">In our last crawler post</a>, we detailed how our crawler worked, and although it did a fine job, it’s time for some extra upgrading.</p>
12<div class="date-created-modified">Created 2020-03-18<br>
13Modified 2020-03-25</div>
14<h2 class="title" id="what_kind_of_upgrades_"><a class="anchor" href="#what_kind_of_upgrades_">¶</a>What kind of upgrades?</h2>
15<p>A small but useful one. We are adding support for file types that contain text but cannot be processed by normal text editors because they are structured and not just plain text (such as PDF files, Excel, Word documents…).</p>
16<p>And for this task, we will make use of the help offered by <a href="https://tika.apache.org/">Tika</a>, our friendly Apache tool.</p>
17<h2 id="what_is_tika_"><a class="anchor" href="#what_is_tika_">¶</a>What is Tika?</h2>
18<p><a href="https://tika.apache.org/">Tika</a> is a set of libraries offered by <a href="https://en.wikipedia.org/wiki/The_Apache_Software_Foundation">The Apache Software Foundation</a> that we can include in our project in order to extract the text and metadata of files from a <a href="https://tika.apache.org/1.24/formats.html">long list of supported formats</a>.</p>
19<h2 id="changes_in_the_code"><a class="anchor" href="#changes_in_the_code">¶</a>Changes in the code</h2>
20<p>Not much has changed in the structure of the crawler, we simply have added a new method in <code>Utils</code> that uses the class <code>Tika</code> from the previously mentioned library so as to process and extract the text of more filetypes.</p>
21<p>Then, we use this text just like we would for our standard text file (checking the thesaurus and adding it to the word map) and voilà! We have just added support for a big range of file types.</p>
22<h2 id="incorporating_gradle"><a class="anchor" href="#incorporating_gradle">¶</a>Incorporating Gradle</h2>
23<p>In order for the previous code to work, we need to make use of external libraries. To make this process easier and because the project is growing, we decided to use <a href="https://gradle.org/">Gradle</a>, a build system that can be used for projects in various programming languages, such as Java.</p>
24<p>We followed their <a href="https://guides.gradle.org/building-java-applications/">guide to Building Java Applications</a>, and in a few steps added the required <code>.gradle</code> files. Now we can compile and run the code without having to worry about juggling with Java and external dependencies in a single command:</p>
25<pre><code>./gradlew run
26</code></pre>
27<h2 id="download"><a class="anchor" href="#download">¶</a>Download</h2>
28<p>And here you can download the final result:</p>
29<p><em>download removed</em></p>
30</main>
31</body>
32</html>
33