blog/ribw/integrating-apache-tika-into-our-crawler/index.html (view raw)
1<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=description content="Official Lonami's website"><meta name=viewport content="width=device-width, initial-scale=1.0, user-scalable=yes"><title> Integrating Apache Tika into our Crawler | Lonami's Blog </title><link rel=stylesheet href=/style.css><body><article><nav class=sections><ul><li><a href=/>lonami's site</a><li><a href=/blog class=selected>blog</a><li><a href=/golb>golb</a></ul></nav><main><h1 class=title>Integrating Apache Tika into our Crawler</h1><div class=time><p>2020-03-18T00:00:00+00:00<p>last updated 2020-03-25T17:38:07+00:00</div><p><a href=/blog/ribw/upgrading-our-baby-crawler/>In our last crawler post</a>, we detailed how our crawler worked, and although it did a fine job, it’s time for some extra upgrading.<h2 id=what-kind-of-upgrades>What kind of upgrades?</h2><p>A small but useful one. We are adding support for file types that contain text but cannot be processed by normal text editors because they are structured and not just plain text (such as PDF files, Excel, Word documents…).<p>And for this task, we will make use of the help offered by <a href=https://tika.apache.org/>Tika</a>, our friendly Apache tool.<h2 id=what-is-tika>What is Tika?</h2><p><a href=https://tika.apache.org/>Tika</a> is a set of libraries offered by <a href=https://en.wikipedia.org/wiki/The_Apache_Software_Foundation>The Apache Software Foundation</a> that we can include in our project in order to extract the text and metadata of files from a <a href=https://tika.apache.org/1.24/formats.html>long list of supported formats</a>.<h2 id=changes-in-the-code>Changes in the code</h2><p>Not much has changed in the structure of the crawler, we simply have added a new method in <code>Utils</code> that uses the class <code>Tika</code> from the previously mentioned library so as to process and extract the text of more filetypes.<p>Then, we use this text just like we would for our standard text file (checking the thesaurus and adding it to the word map) and voilà! We have just added support for a big range of file types.<h2 id=incorporating-gradle>Incorporating Gradle</h2><p>In order for the previous code to work, we need to make use of external libraries. To make this process easier and because the project is growing, we decided to use <a href=https://gradle.org/>Gradle</a>, a build system that can be used for projects in various programming languages, such as Java.<p>We followed their <a href=https://guides.gradle.org/building-java-applications/>guide to Building Java Applications</a>, and in a few steps added the required <code>.gradle</code> files. Now we can compile and run the code without having to worry about juggling with Java and external dependencies in a single command:<pre><code>./gradlew run
2</code></pre><h2 id=download>Download</h2><p>And here you can download the final result:<p><em>download removed</em></main><footer><div><p>Share your thoughts, or simply come hang with me <a href=https://t.me/LonamiWebs><img src=img/telegram.svg alt=Telegram></a> <a href=mailto:totufals@hotmail.com><img src=img/mail.svg alt=Mail></a></div></footer></article><p class=abyss>Glaze into the abyss… Oh hi there!