git rekt — gemini-redirect.git (2d38b1e5d2f5b60075785e2b78e50f55ae16d826): content/blog/mdad/mining-of-massive-datasets/index.md

content/blog/mdad/mining-of-massive-datasets/index.md (view raw)
 1+++
 2title = "Mining of Massive Datasets"
 3date = 2020-03-16T01:00:00+00:00
 4updated = 2020-03-28T19:09:44+00:00
 5+++
 6
 7In this post we will talk about the Chapter 1 of the book Mining of Massive Datasets Leskovec, J. et al., available online, and I will summarize and share my thoughts on it.
 8
 9Data mining often refers to the discovery of models for data, where the model can be for statistics, machine learning, summarizing, extracting features, or other computational approaches to perform complex queries on the data.
10
11Commonly, problems related to data mining involve discovering unusual events hidden in massive data sets. There is another problem when trying to achieve Total Information Awareness (TIA), though, a project that was proposed by the Bush administration but shut down. The problem is, if you look at so much data, and try to find activities that look like (for example) terrorist behavior, inevitably one will find other illicit activities that are not terrorism with bad consequences. So it is important to narrow the activities we are looking for, in this case.
12
13When looking at data, even completely random data, for a certain event type, the event will likely occur. With more data, it will occur more times. However, these are bogus results. The Bonferroni correction gives a statistically sound way to avoid most of these bogus results, however, the Bonferroni’s Principle can be used as an informal version to achieve the same thing.
14
15For that, we calculate the expected number of occurrences of the events you are looking for on the assumption that data is random. If this number is way larger than the number of real instances one hoped to find, then nearly everything will be Bogus.
16
17----------
18
19When analysing documents, some words will be more important than others, and can help determine the topic of the document. One could think the most repeated words are the most important, but that’s far from the truth. The most common words are the stop-words, which carry no meaning, reason why we should remove them prior to processing. We are mostly looking for rare nouns.
20
21There are of course formal measures for how concentrated into relatively few documents are the occurrences of a given word, known as TF.IDF (Term Frequency times In-verse Document Frequency). We won’t go into details on how to compute it, because there are multiple ways.
22
23Hash functions are also frequently used, because they can turn hash keys into a bucket number (the index of the bucket where this hash key belongs). They «randomize» and spread the universe of keys into a smaller number of buckets, useful for storage and access.
24
25An index is an efficient structure to query for values given a key, and can be built with hash functions and buckets.
26
27Having all of these is important when analysing documents when doing data mining, because otherwise it would take far too long.