Member-only story

Data Quality within Lakehouses

Piethein Strengholt
9 min readMar 5, 2024

Introduction

Many times I am asked, how do we ensure data quality within Data Lakehouses? Or how to manage data quality for my data products? In this article, we will see if there is a need to validate, first of all. If yes, what data must be validated and how to validate the data, as well. These are items that many enterprises, at the start of the Data Lakehouse journeys, are dealing with.

Why data quality?

Data quality matters. Failure to validate data quality can lead to numerous effects, both operational and strategic, for any organization.

On a strategic level, poor data quality can result in incorrect insights, causing wrong decisions and strategies. These inaccuracies can be potentially leading to loss of revenue, customer dissatisfaction, and damage to the organization’s reputation. In several highly regulated areas, extreme bad data quality could lead to legal and financial consequences. For example, for a bank it could mean losing the license to operate.

On an operational level, poor data quality can cause inefficiencies, as resources may be wasted in attempting to rectify errors or reconcile inconsistent data. Additionally, poor data quality can result in long latencies, potentially upsetting stakeholders.

So, managing data quality is essential, but ensuring quality at scale is not an easy task. The solution for success is the combination of people, processes and technology.

Data quality and governance framework

Before implementing any services or solutions, it is essential to first define a data quality and governance framework. This framework should contain several essential aspects:

  • The organization’s commitment or strategy to managing data as a valuable asset.
  • Personas involved in managing data quality, such as what are the accountabilities and activities of the data and application owners.
  • Whether data quality covers both source systems and analytical data platforms.
  • Agreements on intrusive and non-intrusive data quality processing.
  • Where to implement the different dimensions, such as consistency and completeness…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Piethein Strengholt

Written by Piethein Strengholt

Hands-on Chief Data Officer. Working @Microsoft.

Responses (6)

Write a response

Furthermore, managing data quality in external locations, such as a Lakehouse, allows for more advanced and flexible data quality processes.

Agree but I think this sort of runs counter to the Data Contract paradigm which suggests a shift approach to handling data quality. As much as possible one would expect the upstream data sources or applications to define and implement data quality…

4

Good overview!
For ERP and similar systems data cleaning should be performed as much as possible in the source, otherwise quality acceptance for the various data products will become a nightmare. One can develop data quality assessment reports on the…

2

although it requires Python

It is possible to build DLT workflows with an extended SQL syntax.

2