all repos — gemini-redirect @ 37719a9ea8780c2f58d72f2b700b90fbae393630

content/blog/mdad/data-warehousing-and-olap/index.md (view raw)

 1+++
 2title = "Data Warehousing and OLAP"
 3date = 2020-03-23T01:00:00+00:00
 4updated = 2020-04-01T09:45:41+00:00
 5+++
 6
 7Business intelligence (BI) refers to systems used to gain insights from data, traditionally taken from relational databases and being used to build a data warehouse. Performance and scalability are key aspects of BI systems.
 8
 9Commonly, the data in the warehouse is a transformation of the original, operational data into a form better suited for reporting and analysis.
10
11This whole process is known as Online Analytical Processing (OLAP), and is different to the approach taken by relational databases, which is known as Online Transaction Processing (OLTP) and is optimized for individual transactions. OLAP is based on multidimensional databases simply by the way it works.
12
13The Business Intelligence Semantic Model (BISM) refers to the different semantics in which data can be accessed and queried.
14
15On the one hand, MDX is the language used for Microsoft’s BISM of multidimensional mode, and on the other, DAX is the language of tabular mode, based on Excel’s formula language and designed to be easy to use by those familiar with Excel.
16
17## Types of data
18
19The business data is often called detail data or _fact_ data, goes in a de-normalized table called the fact table. The term «facts» literally refers to the facts, such as number of products sold and amount received for products sold. Different tables will often represent different dimensions of the data, where «dimensions» simply means different ways to look at the data.
20
21Data can also be referred to as measures, because most of it is numbers and subject to aggregations. By measures, we refer to these values and numbers.
22
23Multidimensional databases are formed with separate fact and dimension tables, grouped to create a «cube» with both facts and dimensions.
24
25## Places to store data
26
27Three different terms are often heard when talking about the places where data is stored: data lakes, data warehouses, and data marts. All of these have different target users, cost, size and growth.
28
29The data lake contains **all** the data generated by your business. Nothing is filtered out, not even cancelled or invalid transactions. If there are future plans to use the data, or a need to analyze it in various ways, a data lake is often necessary.
30
31The data warehouse contains **structured** data, or has already been modelled. It’s also multi-purpose, but often of a lot smaller scale. Operational users are able to easily evaluate reports or analyze performance here, since it is built for their needs.
32
33The data mart contains a **small portion** of the data, and is often part of data warehouses themselves. It can be seen as a subsection built for specific departments, and as a benefit, users get isolated security and performance. The data here is clean, and subject-oriented.
34
35## Ways to store data
36
37Data is often stored de-normalized, because it would not be feasible to store otherwise.
38
39There are two main techniques to implement data warehouses, known as Inmon approach and Kimball approach. They are named after Ralph Kimball _et al._ for their work on «The Data Warehouse Lifecycle Toolkit», and Bill Inmon _et al._ for their work on «Corporate Information Factory» respectively.
40
41When several independent systems identify and store data in different ways, we face what’s known as the problem of the stovepipe. Something as simple as trying to connect these systems or use their data in a warehouse results in an overly complicated system.
42
43To tackle this issue, Kimball advocates the use of «conformed dimensions», that is, some dimensions will be «of interest», and have the same attributes and rollups (or at least a subset) in different data marts. This way, warehouses contain dimensional databases to ease analysis in the data marts it is composed of, and users query the warehouse.
44
45The Inmon approach on the other hand has the warehouse laid out in third normal form, and users query the data marts, not the warehouse (so the data marts are dimensional in nature).
46
47## Key takeaways
48
49* «BI» stands for «Business Intelligence» and refers to the system that _perform_ data analysis.
50* «BISM» stands for «Business Intelligence Semantic Model», and Microsoft has two languages to query data: MDX and DAX.
51* «OLAP» stands for «Online Analytical Processing», and «OLTP» for «Online Transaction Processing».
52* Data mart, warehouse and lake refer to places at different scales and with different needs to store data.
53* Inmon and Kimbal are different ways to implement data warehouses.
54* Data facts contains various measures arranged into different dimensions, which together form a data cube.
55
56## References
57
58* [Chapter 1 – Professional Microsoft SQL Server 2012 Analysis Services with MDX and DAX (Harinath et al., 2012)](https://media.wiley.com/product_data/excerpt/03/11181011/1118101103-157.pdf)
59* [YouTube – Data Mining in SQL Server Analysis Services](https://youtu.be/m_DzhW-2pWI)
60* Almacenes de Datos y Procesamiento Analítico On-Line (Félix R.)
61* [YouTube – What are Dimensions and Measures?](https://youtu.be/qkJOace9FZg)
62* [Data Lake vs Data Warehouse vs Data Mart](https://www.holistics.io/blog/data-lake-vs-data-warehouse-vs-data-mart/)