Big Data

2020-02-25T01:00:30+00:00

last updated 2020-03-18T09:51:17+00:00

Big Data sounds like a buzzword you may be hearing everywhere, but it’s actually here to stay!

What is Big Data?

And why is it so important? We use this term to refer to the large amount of data available, rapidly growing every day, that cannot be processed in conventional ways. It’s not only about the amount, it’s also about the variety and rate of growth.

Thanks to technological advancements, there are new ways to process this insane amount of data, which would otherwise be too costly for processing in traditional database systems.

Where does data come from?

It can be pictures in your phone, industry transactions, messages in social networks, a sensor in the mountains. It can come from anywhere, which makes the data very varied.

Just to give some numbers, over 12TB of data is generated on Twitter daily. If you purchase a laptop today (as of March 2020), the disk will be roughly 1TB, maybe 2TB. Twitter would fill 6 of those drives every day!

What about Facebook? It is estimated they store around 100PB of photos and videos. That would be 50000 laptop disks. Not a small number. And let’s not talk about worldwide network traffic…

What data can be exploited?

So, we have a lot of data. Should we attempt and process everything? We can distinguish several categories.

But asking what to process is asking the wrong question. Instead, one should think about «What problem am I trying to solve?».

How to exploit this data?

What are some of the ways to deal with this data? If the problem fits the Map-Reduce paradigm then Hadoop is a great option! Hadoop is inspired by Google File System (GFS), and achieves great parallelism across the nodes of a cluster, and has the following components:

Key insights

Big Data is a field whose goal is to extract information from very large sets of data, and find ways to do so. To summarize its different dimensions, we can refer to what’s known as «the Four V’s of Big Data»:

Some sources talk about a fifth V for Value; because processing this data is costly, it is important we can get value out of it.

…And some other sources go as high as seven V’s, including Viability and Visualization. Computers can’t take decissions on their own (yet), a human has to. And they can only do so if they’re presented the data (and visualize it) in a meaningful way.

Infographics

Let’s see some pictures, we all love pictures:

Common patterns

References

Glaze into the abyss… Oh hi there!