Member-only story

Data Quality within Lakehouses

A deep dive into data quality using bronze, silver, and gold layered architectures

9 min readMar 5, 2024

Introduction

Many times I am asked, how do we ensure data quality within Data Lakehouses? Or how to manage data quality for my data products? In this article, we will see if there is a need to validate, first of all. If yes, what data must be validated and how to validate the data, as well. These are items that many enterprises, at the start of the Data Lakehouse journeys, are dealing with.

Why data quality?

Data quality matters. Failure to validate data quality can lead to numerous effects, both operational and strategic, for any organization.

On a strategic level, poor data quality can result in incorrect insights, causing wrong decisions and strategies. These inaccuracies can be potentially leading to loss of revenue, customer dissatisfaction, and damage to the organization’s reputation. In several highly regulated areas, extreme bad data quality could lead to legal and financial consequences. For example, for a bank it could mean losing the license to operate.

On an operational level, poor data quality can cause inefficiencies, as resources may be wasted in attempting to rectify errors or reconcile inconsistent data. Additionally, poor data quality can result in long latencies, potentially upsetting stakeholders.

So, managing data quality is essential, but ensuring quality at scale is not an easy task. The solution for success is the combination of people, processes and technology.

Data quality and governance framework

Before implementing any services or solutions, it is essential to first define a data quality and governance framework. This framework should contain several essential aspects:

The organization’s commitment or strategy to managing data as a valuable asset.
Personas involved in managing data quality, such as what are the accountabilities and activities of the data and application owners.
Whether data quality covers both source systems and analytical data platforms.
Agreements on intrusive and non-intrusive data quality processing.
Where to implement the different dimensions, such as consistency and completeness…

Agree but I think this sort of runs counter to the Data Contract paradigm which suggests a shift approach to handling data quality. As much as possible one would expect the upstream data sources or applications to define and implement data quality…

Good overview!

For ERP and similar systems data cleaning should be performed as much as possible in the source, otherwise quality acceptance for the various data products will become a nightmare. One can develop data quality assessment reports on the…

Data Quality within Lakehouses

A deep dive into data quality using bronze, silver, and gold layered architectures

Introduction

Why data quality?

Data quality and governance framework

Create an account to read the full story.

Written by Piethein Strengholt

Responses (6)

More from Piethein Strengholt

Medallion architecture: best practices for managing Bronze, Silver and Gold

Best practices for designing and building a lake house architecture.

Integrating Azure Databricks and Microsoft Fabric

The article discusses the integration of Azure Databricks and Microsoft Fabric, presenting several architectural design options.

Data Management at Scale

Microsoft Fabric — a better understanding of the underlying architecture and concepts

Taking a Deep Dive into the Various Concepts of Microsoft Fabric

Recommended from Medium

RIP Data Engineering?

Here lies DE, replaced by ??

Choosing the BEST Data Architecture in 2025

Lakehouse vs Data Mesh vs ...

Agentic AI for Data Engineering

Reimagining Enterprise Data Management leveraging AI Agents

Implementing Unity Catalog with Medallion Architecture: A Mini Project

Project Description: Enable a Databricks workspace with Unity Catalog for centralized data governance and access control. Implement a…

Medallion Architecture: Principles and Practical Exploration

Data Layout Approach: A Modern Approach to Scalable Data Lakehouse Design and Understanding with Databricks notebook

Trigger Databricks Jobs on File Arrival

Use this new feature for event-driven Databricks Jobs to trigger when files arrive in your cloud storage. We’ll look at 4 example use cases.