Building a Cloud Data Warehouse

Published in

CodeX

6 min readSep 6, 2021

For a Data Engineer, building data warehouses is a common practice in all of our projects. Unfortunately, not everything is all unicorns and rainbows, since there is a lot of work involved in building several data layers, ETL processes, reprocessing methodologies, and business needs to attend at the end of the day.

In this article, I will approach a few of my past experiences on building Data Warehouses with the help of several data layers that serve as a base for its creation. Nonetheless, I will give it a taste of Cloud to keep up with the trends.

Why leveraging the Public Cloud?

A cloud data warehouse uses the space and compute power allocated by a cloud provider to integrate and store data for analytical purposes (querying and reporting).

Scalability: Instant up- or down-scaling.
Availability: Up to 99.99% of uptime with leading cloud providers (Amazon, Microsoft, Google).
Security: The cloud provider ensures infrastructure and data safety.
Performance: Serve multiple geographic locations and show great query performance (measured in seconds).
Cost-effectiveness: No hardware-related costs. Possibility to pay only for used storage and computing resources (although for enterprise-level applications I don’t recommend this pay-as-you-go approach);

The Different Data Layers

Before we talk about the Data Warehouse per se, we need to discuss the foundations. Meaning, the layers that support the creation of a Data Warehouse.

The Data Lake and the Data Quality Layer

With the available technologies, we can virtually ingest all data types (structured, semi-structured, unstructured). Leveraging one of the public cloud storage (S3 for AWS, blob storage for Azure, etc.) we can easily bring this information into our “Data Lake”.

This Data Lake can be composed of several layers, I have defined two: Raw Zone and Curated Zone.

My suggestion is to divide these layers into folders and sub-folders, having the data segregated by Source, Year, Month, Day, Hour, etc.

├── raw_zone
│├── source1
│  ├── 2021
│     ├── 01
│        ├── 01
│           ├── 08
│              ├── tableA_20210101080000.csv
│           └── 09
│              ├── tableA_2021010109.csv
│              ├── tableB_2021010109.csv

The downside of ingesting raw data is that it can have wild formats, contain erroneous information, or simply different data types. That’s why we need to perform some profiling, cleansing, and ETL in this raw data, so we can create a curated zone that can be leveraged to store all ingested data having the right quality and the ability to cross or join multiple data sources without mismatches or errors. In the end, and under the same storage (if we want to) it will look something like this:

├── raw_zone
│├── source1
│  ├── 2021
│     ├── 01
│        ├── 01
│           ├── 08
│              ├── tableA_20210101080000.csv
│           └── 09
│              ├── tableA_2021010109.csv
│              ├── tableB_2021010109.csv
├── curated_zone
│├── source1
│  ├── 2021
│     ├── 01
│        ├── 01
│           ├── 08
│              ├── tableA_cur_20210101080000.csv
│           └── 09
│              ├── tableA_cur_2021010109.csv
│              ├── tableB_cur_2021010109.csv

Note: These folders should be created by our ingestion process. It can be an ETL doing it or simply PySpark or other languages.

At the same time, while doing this cleansing we can create data quality layers, which can feed dashboards to support our processing and data correction processes effectively.

Example of a Data Quality Dashboard for Data Pipelines

The Data Model & The Data Warehouse

To build our Data Warehouse, we will need to have an operational data model as the basis. Normally, our curated zone has more than data from one system — which can normally be connected or linked — but for simplicity let us imagine just one source.

To maintain historical data or keep track of changes we can leverage the usage of CDC (change data capture) or full snapshots.

Meaning that we will only ingest updated or inserted records and if that’s not possible we need to ingest the full table. The first approach is also applicable for Event-driven architectures, meaning that each event will be a new line/payload in our data model which can be segregated into several records and spread across several tables (if we decide to move from semi-structured to relational).

The reprocessing of information is also possible since we will bundle our information into “date snapshots”. On reprocessing one or several days we will overwrite a certain load/reference date with newer information, allowing the full replacement of data that was unprocessed or the correction of data that was not well ingested/exported.

In a quick-win approach, we can leverage this model to provide reporting capabilities and build several operational reports or many simple metrics (e.g. number of accounts, average transactions per account, number of active customers, etc.) but our main goal is to reach for a more de-normalized model, which will facilitate the insights gathering and will provide business users to access this model and do simple queries (using a self-service BI tool).

Moving Into a De-Normalized Model

When we de-normalize the model, it tends to be more simple and the joins tend to be more agile, to do so we need to “bundle” several columns from different tables into one or more tables and make the information coherent across several data domains. Additionally, using common nomenclatures on our columns will facilitate the crossing process.

Some rules to have in mind while building these nomenclatures:

All date columns start with DT, all identifier columns start with ID, etc.
Maintain the same names across join columns:

Account Identifier commonly shared between tables

Do we need to keep track of all historical changes or can we have the latest version only?
Isolate most updated columns or attributes and keep them out of our dimensions.
Do not overpopulate our tables with columns that not going to be used.

In our final Data Warehouse model, we will have something looking more clean and condensed that will allow us to explore information and some predefined metrics:

The main goal of building this type of model is to accelerate the queries and consistency of our data. Reducing the number of errors and allowing us to have our data in a more condensed way and pull some predetermined metrics or insights without overloading our BI tool with data processing.

This is doable on an on-premise database, but to have the proper performance we would need to set up partitions, indexes and take into account every piece of code or procedure written. With the public cloud, we can leverage its processing power: performing data movement, data profiling, table joins, data cleansing, etc.

Data flow with a Cloud Query Engine (Cloud DW)

In other of my articles, I mentioned several technologies that can fit the purpose of Cloud Query Engine.

Examples of Query Engines and Data Warehouses

Nonetheless, it is mandatory to maintain good practices on code development and query performance, but it is less restrictive no handling/deal with huge amounts of data.

By data modeling this way, we will build something a star-schema or snowflake model, but other methodologies can be used, such as data vault, data lake, or lakehouse for more agile or rapid changing models.

I will not get into much detail on building such models, but I leave some links for the ones more curious:

Disclaimer: This post reflects my opinion, experiences and architectural views. Therefor, should be treated accordingly and I know this is subject that can generate multiple discussions.