Working with data during the last 8 years of my life I believe I have seen some structural changes in the data landscape. Not only connected with architecture — which nowadays is its majority cloud-based — but also on the roles of certain personas involved in teams and implementations.
Looking at the actual implementation in terms of architecture we can have several approaches, but more and more we see event-driven architectures arise! These architectures are based on microservices and API-driven implementations, which fit the purpose of real-time or near-real-time ingestions and processing.
The following diagram can illustrate the idea of using microservices to ingest data in real-time and make it available for a Data LakeHouse or Data Lake, leveraging semi-structured data capabilities.
Microservice Layer & Event Queuing
The majority of the new source systems are API-driven, meaning that they can work in a push/pull logic:
- The push logic works using streaming APIs that export events — usually in JSON format — at the moment they happen on the source.
- The pull logic works very similar to what we call “batch”, where our data pipelines are responsible for requesting information from the source system, using parameters to decrease the amount of data moved through data pipelines (deltas).
Focusing mainly on the push logic, the microservices — which can be built in a lot of languages e.g. Python, .NET, Java, etc. — would be the ones responsible for subscribing to endpoints/webhooks and passing events into the defined topic. These topics are specific for each event and need to have a unique naming convention or definition. For these parts, some technologies guarantee queuing, replication, and partitionings, such as Apache Kafka, Confluent, or Google Cloud Pub/Sub.
Obviously that nowadays, this code can be run on Cloud Functions which are available on every cloud provider and are normally server-less.
Data Lake / LakeHouse
A Data Lake is a repository for large amounts of data, which are in a raw format. Organizations and individuals use data lakes to store different types of big data, including structured, unstructured, and semi-structured data.
A data lake uses a flat architecture, with no file or folder hierarchies. Each piece of data is associated with a set of metadata and related to a unique identifier. These use a wide variety of data sources, including mobile apps, IoT products, websites, and industrial applications.
Data Lakes have the advantage of functionality, scalability, and low cost:
- As features it has the advantage of working well with machine learning tools, AI algorithms, advanced real-time data analytics, and predictive models;
- Within the scalability, they are used to manage large volumes of data, which grow and change depending on the data inputs;
- The low cost of this tool is that it allows the use of open-source technologies.
On the other hand, data lakes can become data swamps, which have poor data integrity and security issues, and this has certain disadvantages, including:
- The complexity of the data, due to its size, means that only engineers and data scientists can navigate and leverage the data for possible analysis.
- Browsing data lakes can be time-consuming, as they need to regularly organize and maintain data integrity to avoid data quality issues. Without these precautions, a data lake ends up becoming a data swamp with disorganized and unusable data.
- Being carelessness in a data lake can bring security risks. With so much data stored, security and access risks can increase. Without the necessary precaution, certain pieces of sensitive data could end up “living” in the data lake and become available to anyone with access to the data lake.
A Data Lakehouse is intended to receive structured or unstructured data. Using this approach a business can benefit from working with unstructured data that only need a repository, instead of needing a warehouse infrastructure such as a lake.
Lakehouses had the chance to be created due to a new system design, implementing similar data structures and data manipulation tools for data warehouses on low-cost open formats.
As features of the Data Lakehouse application, we can check the following:
- Profitable storage
- Supports all types of data in all file formats
- Schema is supported through data management mechanisms
- Simultaneous writing and reading of data
- Optimized access to data science and machine learning tools
- A single system to help data teams move data more efficiently and quickly, without needing to access multiple systems
- It is possible to apply ACID transactions, which allows data to be read and written simultaneously
- Real-time analytics for data science, machine learning, and data analytics projects
- Scalability and flexibility
- All the advantages of being an Open Source
The advantages of this approach can be:
- Simplified schema
- Better treatment and analysis of data
- Reduced data traffic and redundancy
- Faster and more efficient use of team time
This type of approach also can leverage serverless processing for moving data between bronze, silver, and gold layers. Normally, the gold layer is then used in the Data Warehouse for insights calculation and all data existing is already curated with data quality checks, ETL, or data processing using files/tables joining.
A Data Warehouse is a vast collection of business data stored in a structured format to help organizations gain knowledge. Before writing data in a Data Warehouse, it is necessary to know its schema. It takes data from various sources and formats it to match the defined schema.
A Data Warehouse is optimized to process and store in a way that queries are efficient. Furthermore, it offers good data quality and obtains the results of queries quickly.
Data Warehouse cannot handle raw data or unstructured data. Its maintenance is very expensive due to constant data growth. Furthermore, it is not the best solution for processing complex data such as using machine learning or predictive analytics.
- High costs
- Rigid and inflexible architecture
- High complexity and redundancy
- Slow and degrading performance
- Obsolete technologies
Machine Learning & Data Science
Machine learning and Data Science can be very useful for certain businesses to find models for predicting customer behaviors or engagement, fraud tentative, product customization, or simply recommendation.
Since it is not my main focus area, I understand that there is a lot more to discuss and explore in this area, but at this point, I believe that ML Engineers will need data and that’s where we have data engineers come into play. To provide them with the data that they need, behind this use a simple repository (Data Lake/ Lakehouse) or structured data (Data Warehouse).
Insights and KPIs
Insights and KPIs are some of the main drivers for every business. They represent historical or real-time information and can be very useful to track patterns, understand the current status, and make informed decisions based on data that is accurate, non-redundant, and provides direction.
There are also KPIs related to Data quality that demonstrates the information that needs to be corrected on source systems or some information that is missing for key insights.
This layer is normally fed using the data warehouse and a more traditional way or relational model, using the 3N form and one of the several approaches for data modeling (Kimball, Inmon, Data Vault, etc.)