Modern organizations are producing, collecting, processing data more than ever before. An IDG survey of data professionals reveals that data volumes are growing at an average rate of 63% per month. As businesses handle more data, the tools used to manage it become more complicated, making it hard to see how data is processed. This leads to errors and poor quality data. Luckily, recent observability tools help companies regain control over data processing. We'll explore the idea of Data Observability and the range of available observability tools.
Observability comes from control theory, introduced by Rudolf Kalman for linear dynamic systems. It's a way to measure a system's health using its outputs. Data observability emerged from the development of data pipelines, which move data between systems and are essential for data analysis and science. About 15-20 years ago, data pipelines were simpler, catering to stable business analytics needs. Data engineers used ETL (Extract, Load, Transform) tools to prepare data for specific uses and loaded it into data warehouses. Data analysts then created dashboards and reports with BI software, and life was good.
In recent years, the demand for data has grown rapidly. Everyone, from data analysts to business users, needs data.
In order to ensure data users understand all the data existing in the data warehouse flawlessly, we recommend using a data catalog such as Castor or Collibra (depending on your size). You can find a benchmark of all data catalogs tools here.
Data pipelines now use a mix of complex tools (Spark, Kubernetes, Airflow), increasing the risk of pipeline failure as the number of connected parts grows. The variety of tools is essential, as it lets data teams choose the best platforms for their data stack. However, this combination makes it very difficult to see the different parts of the pipelines.
Modern data pipelines are not only complex, but they also have Blackbox features. You know what goes in, you know what comes out, but you have no clue of what happened in between. It’s fine as long as the desired outcome comes out. But when it doesn't, it's highly frustrating.
When datasets come out of the pipeline, you are often left with strange values, missing columns, letters in fields that were meant to be numeric, and so on. As a result, data engineers spend hours scratching their heads about what on earth went wrong, where, and how to fix it. Forrester estimates that data teams spend upwards of 40% of their time on data quality issues instead of working on value-generating activities for the business.
All this under considerable pressure, because the data analyst who used this poor-quality data for a client presentation is being beaten up by his manager. And that's the most important part. The whole purpose of collecting, processing and analyzing data is to create business value. Without visibility on the pipeline, errors accumulate, the data deteriorates, and business value is destroyed. That’s where data observability technology comes in. It tells you to sit down relax because it’s going to shed a light on the root cause of every pipeline failure. No more tears, no more head-scratching, and no more beating up. But in practice, how does it work?
Data Monitoring is too often mixed up with Data Observability. The two terms are symbiotic, which explains the blurriness of the line between them. Data monitoring is the first step towards data observability and a subset of observability. Say, you are observing a data pipeline system. Monitoring's fundamental purpose is to understand the state of your system using a pre-defined set of system metrics and logs, which is leveraged to get alerts about incidents. It either tells you "the system is healthy" or "there is an issue there". As such, monitoring applications allow you to detect a known set of failure modes. It can't remedy the black box issue. With monitoring, you can:
Data monitoring flags your data issues, and that's great. The downside is, identifying the issue is just the beginning of the journey, and monitoring doesn't fare well when it comes to understanding the root cause of a pipeline outage.
Data observability brings a more proactive approach, by offering a deeper view of your system's operations. It uses the data and insights that monitoring produces to create a holistic understanding of your system, its health, and performance.
Say, you own a grocery shop. Every morning, grocery items are displayed on the shelves for the clients to grab. Simple. What monitoring does is to alert you of any issue that may occur with the products: when there are missing items on the shelves when a product spilled all over the place etc. Observability goes one step further, providing you with a clear view of the supply chain behind the stacking on shelves. Your grocery shop is "observable" if it is easy for you to understand why an item is missing on a shelf if you can see what happens in the backroom, or how employees arrange the items on trolleys.
In the realm of data, observability sheds a light on the workflow occurring in your data pipelines, allowing you to easily navigate from an effect to a cause. With observability, you can:
Monitoring is a subset of observability, and you can only monitor an observable system. Monitoring tells you "your pipeline failed." while observability tells you "You have a pipeline outage because a spark job failed due to an invalid row"
There are three generations of observability tools:
1st generation: Traditional APM (Application Management Performance) tools. These tools were created for software engineers and DevOps, but have been the first tools used for Data Observability.
2nd generation: Data Observability tools. These tools allow you to drill into the source of your data issues and to understand the root cause of the problem.
3rd generation: Automated Data Observability. These intelligent tools can predict and automatically fix issues before they impact performance.
Observability isn't new; it's well-known in the DevOps world. As organizations shifted from a monolith to a micro-service structure, DevOps emerged, bridging gaps between development and operations teams. DevOps teams constantly monitor system health, ensuring applications and infrastructure work properly. Observability comes from this idea.
DevOps teams use Application Management Performance (APM) tools to watch over their systems and infrastructures. APM aims to find and fix complex performance issues to maintain service levels. These tools combine three monitoring components to identify and address system failures: Metrics, Logs, and Traces. These are often called "the three pillars of observability."
Metrics: A metric is a number measured over a specific time period. It has attributes like time, name, KPI, and value. Metrics use math models and predictions to learn about a system's behavior over time.
Logs: A log is a text record of an event at a certain time. It contains the event's details, the time it happened, and some context around the event.
Traces: A trace shows the full journey of a request through a distributed system. It records each operation performed in different micro-services. Traces are based on logs, and a single trace can show both the request's path and its structure.
Integrating these three monitoring components within a single solution allows DevOps to gain visibility over their systems. These systems are said to be observable, as it is possible to infer their health based on the monitoring outputs.
As data proliferated, calling out for a need to have observable pipelines, data engineering teams used standard APM tools as described above, in an attempt to gain observability over their data stack.
However, APM tools were specifically built for software engineers and DevOps, not for monitoring data pipelines. Although some companies use them as data Observability tools, they don't always do the trick.
The reason is that technology systems and data pipelines are very different. For instance, it's common for data pipelines to have many failures before working correctly. APM tools can't understand this or other unique features of data pipelines. Data teams using APM tools for their pipelines often get incorrect alerts, causing unneeded confusion.
To achieve observability in data pipelines, you need to monitor more aspects than the standard set mentioned earlier. APM tools aren't suitable for data observability, and they've been slowly replaced by more specialized tools that monitor elements relevant to data pipeline health.
Data quality: Are there issues in data quality? Does your dataset have the volume you expect it to have, or is there missing data? Is the distribution of your data correct?
Schedules: Is the pipeline running according to schedule? Is your data up to date or is there a glitch in the refreshment schedule?
Dependencies: How will the upstream data issues propagate downstream? How are the different parts of the pipeline-related?
This is one way of approaching the pillars of Data Observability. Barr Moses proposes another, in which she outlines five pillars of data Observability. The number of "pillars" of data of observability doesn't matter that much. The idea is: you can gain observability over your stack by monitoring a certain number of components that will tell you about the health of your data pipeline. Different observability solutions will monitor different components of the pipeline according to the kind of observability they want to achieve. That's why it's important to pick the right solution for your needs.
Third-generation observability tools have extensive automation features. These tools adjust their monitoring approach based on the dataset they're working with, automatically setting alert thresholds. This way, you can observe pipelines without the tedious task of defining thresholds for each data asset.
Modern data observability platforms like Bigeye use machine learning models to analyze trends and automatically recommend data quality metrics for specific assets. You don't need to spend time defining all the metrics and logs important to your business, as it's done automatically. Even better, these tools can predict data quality metrics and auto-adjust alert thresholds based on the forecast. This saves time and ensures teams receive relevant alerts.
Some solutions, like databand.ai, offer automation features such as dynamic pipeline runs to improve data pipeline performance. This feature allows data teams to run different versions of the pipeline based on changing data inputs, parameters, or model scores.
Below, you will find an observability tools landscape, which can hopefully help you choose an observability tool adapted to the needs of your company. We have classified the solutions across two dimensions:
Real-time data monitoring refers to whether the solution can identify issues as they are happening in the data pipeline, which in turns allows for the possibility to stop the pipeline before bad data is loaded.
Data observability solutions differ, mainly based on whether they use a pipeline testing or anomaly detection framework.
A pipeline testing framework lets data engineers test binary statements, like checking if all values in a column are unique or if the schema matches a specific expression. When tests fail, data is labeled as "bad," helping data engineers diagnose poor quality data and resolve issues. Tests are repeated at various pipeline stages, making it easy to identify where the data problem occurred and who should fix it.
An anomaly detection framework scans data assets, collects statistics, and monitors changes in these statistics. Alert thresholds are set (automatically, manually, or both), and the solution sends an alert when the statistic's behavior indicates a data issue. These frameworks often find the root cause of data problems using data lineage, retracing all events the data went through.
To go one step further, we need to look at some observability tools to understand the current ecosystem better. When trying to understand what observability tools are the best , the key thing to check is if and how these tools provide monitoring, end-to-end visibility as well as telemetry data which should be spread and used on all parts of the IT infrastructure. In the modern data ecosystem, many of these tools and their key features are performed in cloud based environments, thus both your firm and the tools you are using should be adapted to the cloud environment. To give an example, it is vital for firms that are using AWS services to be able to effectively utilize monitoring and observability, on the other hand, this is not always feasible with many tools that can’t handle the complexity of the cloud environment and provide the necessary solutions.
Firms must provide their developers with the right tools and help them get insights in order to solve important problems, make a difference compared to other competitors and survive in today’s compound cloud world. If a developer gets the right insight, this will help him or her to focus on the right problems, leading to better problem solving, having that potentially big difference of working on the right problem or the wrong one, allowing your firm to use your time and resources more efficiently. Moreover, with more accurate identification and telemetry, you can have a better view of your firm’s security, which could be seen as a significant plus.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.