Data Warehousing and OLAP

2020-03-23T01:00:00+00:00

last updated 2020-04-01T09:45:41+00:00

Business intelligence (BI) refers to systems used to gain insights from data, traditionally taken from relational databases and being used to build a data warehouse. Performance and scalability are key aspects of BI systems.

Commonly, the data in the warehouse is a transformation of the original, operational data into a form better suited for reporting and analysis.

This whole process is known as Online Analytical Processing (OLAP), and is different to the approach taken by relational databases, which is known as Online Transaction Processing (OLTP) and is optimized for individual transactions. OLAP is based on multidimensional databases simply by the way it works.

The Business Intelligence Semantic Model (BISM) refers to the different semantics in which data can be accessed and queried.

On the one hand, MDX is the language used for Microsoft’s BISM of multidimensional mode, and on the other, DAX is the language of tabular mode, based on Excel’s formula language and designed to be easy to use by those familiar with Excel.

Types of data

The business data is often called detail data or fact data, goes in a de-normalized table called the fact table. The term «facts» literally refers to the facts, such as number of products sold and amount received for products sold. Different tables will often represent different dimensions of the data, where «dimensions» simply means different ways to look at the data.

Data can also be referred to as measures, because most of it is numbers and subject to aggregations. By measures, we refer to these values and numbers.

Multidimensional databases are formed with separate fact and dimension tables, grouped to create a «cube» with both facts and dimensions.

Places to store data

Three different terms are often heard when talking about the places where data is stored: data lakes, data warehouses, and data marts. All of these have different target users, cost, size and growth.

The data lake contains all the data generated by your business. Nothing is filtered out, not even cancelled or invalid transactions. If there are future plans to use the data, or a need to analyze it in various ways, a data lake is often necessary.

The data warehouse contains structured data, or has already been modelled. It’s also multi-purpose, but often of a lot smaller scale. Operational users are able to easily evaluate reports or analyze performance here, since it is built for their needs.

The data mart contains a small portion of the data, and is often part of data warehouses themselves. It can be seen as a subsection built for specific departments, and as a benefit, users get isolated security and performance. The data here is clean, and subject-oriented.

Ways to store data

Data is often stored de-normalized, because it would not be feasible to store otherwise.

There are two main techniques to implement data warehouses, known as Inmon approach and Kimball approach. They are named after Ralph Kimball et al. for their work on «The Data Warehouse Lifecycle Toolkit», and Bill Inmon et al. for their work on «Corporate Information Factory» respectively.

When several independent systems identify and store data in different ways, we face what’s known as the problem of the stovepipe. Something as simple as trying to connect these systems or use their data in a warehouse results in an overly complicated system.

To tackle this issue, Kimball advocates the use of «conformed dimensions», that is, some dimensions will be «of interest», and have the same attributes and rollups (or at least a subset) in different data marts. This way, warehouses contain dimensional databases to ease analysis in the data marts it is composed of, and users query the warehouse.

The Inmon approach on the other hand has the warehouse laid out in third normal form, and users query the data marts, not the warehouse (so the data marts are dimensional in nature).

Key takeaways

References

Glaze into the abyss… Oh hi there!