Image courtesy of Nasa. Source.

What is Data Observability?

Gobisan

--

Context

There might be situations where the data team is in trouble when dashboards are failing, tons of tickets are getting raised from different downstream stakeholders, and the data engineers are searching for the root cause of the problem.

This is why data engineers must have the ability to better understand the health of the data on their platform. Data observability helps to achieve this ability.

Before jumping into data observability, let us look at the data platform.

The Data Platform

A data platform is an end-to-end solution with an integrated set of technologies that collectively meets an organization’s analytics needs by ingesting, storing, transforming, and serving the data.

Data platform based on Medallion Lakehouse architecture. Image courtesy of author.

Layers of a Data Platform:

  1. Ingestion: Data generated by source systems (batch and stream) is ingested into the data lake.
  2. Storage: Extracted data is moved to a storage layer where it can be further prepared for downstream use cases.
  3. Transformation: Raw data is cleansed and integrated in order to apply the business logic and make it available for further analysis.
  4. Serving: Prepared data is made available for downstream consumers through visualization tools, frontline operations tools, or ML pipelines.
  5. There is a fifth layer; continue reading to find out.

The above-described scenario is an ideal one, but in the real world, there might be some anomalies.

Anomalies in the behavior of the pipeline:

  • Is the pipeline ingesting timely and fresh data at an acceptable frequency?
  • Are there too many or too few rows in the data lake (or warehouse), or is the data complete?
  • Are there any data drops or duplicates while ingesting?
  • Have issues arisen downstream as a result of the source schema change?

Anomalies in the data flowing through the pipeline:

  • Is the data available to consumers accurate and consistent?
  • Are null values within an accepted range?

Answering these questions is critical to preventing or mitigating the impact of data incidents on downstream consumers. That’s where the fifth layer of the data platform comes into play, which is data observability.

What is Data Observability?

Data observability is the ability to understand the health of the data platform by capturing events, measuring metrics, and correlating them across the platform.

Why Data Observability?

Data observability tools use automated monitoring, root cause analysis, data lineage, and data health insights to proactively detect, resolve, and prevent data anomalies.

This approach results in healthier data pipelines, increased team productivity, enhanced data management practices, less data downtime, and ultimately, higher customer satisfaction.

Key Pillars of Data Observability

1. Freshness:

Freshness refers to the accuracy and timeliness of data, ensuring it is up-to-date and relevant to the current situation.

2. Distribution:

Distribution refers to the health of data assets at the field level, with null values being a key metric. Abnormal representations of expected values in data assets can also indicate distribution issues.

3. Volume:

Volume refers to the amount of data in a file or database, which is crucial for meeting the intake and completeness of tables and offering insights into the health of the data source.

4. Schema:

Schema, a formal structure in a database management system, is often the cause of data downtime incidents due to changes in fields and tables; thus, auditing is crucial for data health.

5. Lineage:

Data lineage is a crucial tool that identifies the upstream sources and downstream consumers impacted, the teams generating and accessing the data, and provides metadata about governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.

Core capabilities of a Data Observability Solution

1. Detection:

Detect anomalies in the behavior of data and pipelines and send alerts.

2. Prevention

Prevent incidents by identifying areas for preventive maintenance, such as deteriorating queries or unused tables, and identifying areas with consistent quality issues.

3. Resolution:

Offers various tools for resolving issues, including field-level lineage, automated root cause analysis, past incidents, and query logs.

References

--

--