Modern organizations are producing, collecting, processing data more than ever before. An IDG survey of data professionals reveals that data volumes are growing at an average rate of 63% per month. As data proliferates in businesses, the technologies we use to move this data around have become more intricate and complex, to the point that we completely loose visibility on how data is processed. As a result, mistakes accumulate as data is moved around, and we end up with crap, unusable data. Thankfully, observability tools have flourished in the past few years, helping companies regain control over data processing. Today, we attempt to understand the concept of Data Observability, and to untangle the vibrant ecosystem of observability tools.
The concept of observability originates in control theory, pioneered by Rudolf Kalman in linear dynamic systems. We define it as a measure of how well the health of a system can be deduced from its outputs.
Data observability came to existence as a result of the evolution of data pipelines. Data pipelines refer to any set of processing material in charge of moving data from one system to another. They are the backbone behind data analysis and data science activities, as they generate the data assets used by data scientists and data analysts. Until 15-20 years ago, data pipelines were rather basic, serving stable requirements for business analytics. Business intelligence teams required historical measures of their financial position, inventory levels, sales pipelines, and other operational metrics. Nothing too serious. Data engineers used ETL (Extract, Load, Transform) tools to transform the data for specific use-cases and load it in the data warehouse. From this, data analysts created dashboards and reports using BI software, and life was good.
In recent years, we have witnessed a skyrocketing demand for data. Data is everywhere and everybody consumes it, from data analysts to business users.
In order to ensure data users understand all the data existing in the data warehouse flawlessly, we recommend using a data catalog such as Castor or Collibra (depending on your size). You can find a benchmark of all data catalogs tools here.
Data pipelines now run with a combination of complex tools (Spark, Kubernetes, Airflow), and as the number of interconnected parts increases, so does the risk of pipeline failure. The diversity of tools is necessary, as it allows data teams to choose the best platforms at each layer of their data stack. But the combination of all these engines makes it practically impossible to gain visibility into the different parts of the pipelines.
Modern data pipelines are not only complex, but they also have Blackbox features. You know what goes in, you know what comes out, but you have no clue of what happened in between. It’s fine as long as the desired outcome comes out. But when it doesn't, it's highly frustrating. When data sets come out of the pipeline, you are often left with strange values, missing columns, letters in fields that were meant to be numeric, and so on. As a result, data engineers spend hours scratching their heads about what on earth went wrong, where, and how to fix it. Forrester estimates that data teams spend upwards of 40% of their time on data quality issues instead of working on value-generating activities for the business. All this under considerable pressure, because the data analyst who used this poor-quality data for a client presentation is being beaten up by his manager. And that's the most important part. The whole purpose of collecting, processing and analyzing data is to create business value. Without visibility on the pipeline, errors accumulate, the data deteriorates, and business value is destroyed. That’s where data observability technology comes in. It tells you to sit down relax because it’s going to shed a light on the root cause of every pipeline failure. No more tears, no more head-scratching, and no more beating up. But in practice, how does it work?
Data Monitoring is too often mixed up with Data Observability. The two terms are symbiotic, which explains the blurriness of the line between them. Data monitoring is the first step towards data observability and a subset of observability. Say, you are observing a data pipeline system. Monitoring's fundamental purpose is to understand the state of your system using a pre-defined set of system metrics and logs, which is leveraged to get alerts about incidents. It either tells you "the system is healthy" or "there is an issue there". As such, monitoring applications allow you to detect a known set of failure modes. It can't remedy the black box issue. With monitoring, you can:
Data monitoring flags your data issues, and that's great. The downside is, identifying the issue is just the beginning of the journey, and monitoring doesn't fare well when it comes to understanding the root cause of a pipeline outage.
Data observability brings a more proactive approach, by offering a deeper view of your system's operations. It uses the data and insights that monitoring produces to create a holistic understanding of your system, its health, and performance.
Say, you own a grocery shop. Every morning, grocery items are displayed on the shelves for the clients to grab. Simple. What monitoring does is to alert you of any issue that may occur with the products: when there are missing items on the shelves when a product spilled all over the place etc. Observability goes one step further, providing you with a clear view of the supply chain behind the stacking on shelves. Your grocery shop is "observable" if it is easy for you to understand why an item is missing on a shelf if you can see what happens in the backroom, or how employees arrange the items on trolleys.
In the realm of data, observability sheds a light on the workflow occurring in your data pipelines, allowing you to easily navigate from an effect to a cause. With observability, you can:
Monitoring is a subset of observability, and you can only monitor an observable system. Monitoring tells you "your pipeline failed." while observability tells you "You have a pipeline outage because a spark job failed due to an invalid row"
There are three generations of observability tools:
1st generation: Traditional APM (Application Management Performance) tools. These tools were created for software engineers and DevOps, but have been the first tools used for Data Observability.
2nd generation: Data Observability tools. These tools allow you to drill into the source of your data issues and to understand the root cause of the problem.
3rd generation: Automated Data Observability. These intelligent tools can predict and automatically fix issues before they impact performance.
Observability is not new. It's a very well-established concept in the DevOps world. The transition of modern organizations from monolith to a micro-service architecture led to the rise of DevOps, teams that remove barriers between the traditionally siloed teams of development and operations. DevOps teams keep a constant pulse on the health of their systems, ensuring that applications and infrastructure are up and running. The concept of observability stems from this development.
DevOps teams used Application Management Performance (APM) tools to monitor their systems and infrastructures. APM seeks to detect and diagnose complex application performance issues to maintain the level of service. These tools bring together three monitoring components to spot and resolve system failure: Metrics, Logs, and Traces. These are often called "the three pillars of observability"
Metrics: A metric is a numeric value measured over a given interval of time. It includes particular attributes such as time, name, KPI, and value. Metrics can use mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time.
Logs: A log is a text record of an event that happened at a specific time. In a log, you find a note of the event, the time at which it happened, and some context around the event.
Traces: A trace represents the end-to-end journey of a request through a distributed system. That is, a record of each operation that was performed in the different micro-services. Traces are representations of logs, and a single trace can provide visibility into both the path traveled by a request as well as its structure.
Integrating these three monitoring components within a single solution allows DevOps to gain visibility over their systems. These systems are said to be observable, as it is possible to infer their health based on the monitoring outputs.
As data proliferated, calling out for a need to have observable pipelines, data engineering teams used standard APM tools as described above, in an attempt to gain observability over their data stack.
However, APM tools were specifically built for software engineers and DevOps, not for monitoring data pipelines. Although some companies use them as data Observability tools, they don't always do the trick. The reason is, technology systems and data pipelines are inherently different. For example, with data pipelines, it is totally normal to encounter many failures before a process runs successfully. APM tools can't make sense of this, nor about the other nuances characterizing the logic of data pipelines. Data teams using APM tools to gain visibility over their pipelines usually ends up with the wrong kind of alerts, which brings about unnecessary confusion.
To gain observability over data pipelines, it is necessary to monitor an additional set of dimensions than the standard set presented above. APM tools are not cut for data observability, and they have gradually been replaced by more fine-tuned tools, monitoring components that are relevant to the health of data pipelines.
Data quality: Are there issues in data quality? Does your dataset have the volume you expect it to have, or is there missing data? Is the distribution of your data correct?
Schedules: Is the pipeline running according to schedule? Is your data up to date or is there a glitch in the refreshment schedule?
Dependencies: How will the upstream data issues propagate downstream? How are the different parts of the pipeline-related?
This is one way of approaching the pillars of Data Observability. Barr Moses proposes another, in which she outlines five pillars of data Observability. The number of "pillars" of data of observability doesn't matter that much. The idea is: you can gain observability over your stack by monitoring a certain number of components that will tell you about the health of your data pipeline. Different observability solutions will monitor different components of the pipeline according to the kind of observability they want to achieve. That's why it's important to pick the right solution for your needs.
Third-generation observability tools possess extensive automation features. This means that the tools adapt their monitoring approach according to the dataset they are presented with, automatically deciding where to set the alert thresholds. This way, you can have observable pipelines without going through the tiresome task of defining an alert threshold for each of your data assets.
Using machine learning models, modern data Observability platforms such as Bigeye analyze trends in your data and automatically recommend data quality metrics to start tracking for a particular asset. You don't have to spend time defining all the metrics and logs that matter to your business, as this is achieved automatically. More importantly, modern tools can forecast data quality metrics and automatically adjust the alerts thresholds based on this forecast. This saves time, as data teams don't have to adjust the thresholds manually. It also ensures that teams always get relevant alerts.
Some solutions, such as databand.ai propose automation features, like dynamic pipeline runs which seek to ameliorate the performance of the data pipeline. This feature grants data teams the ability to run different versions of the pipeline based on changing data inputs, parameters or model scores.
Below, you will find an observability tools landscape, which can hopefully help you choose an observability tool adapted to the needs of your company. We have classified the solutions across two dimensions:
Real-time data monitoring refers to whether the solution can identify issues as they are happening in the data pipeline, which in turns allows for the possibility to stop the pipeline before bad data is loaded.
Not all data observability solutions are the same. We especially differentiate solutions according to whether they use a pipeline testing or an anomaly detection framework.
A pipeline testing framework allows data engineers to test binary statements. For example, whether all the values in a column are unique, or whether the schema matches a certain expression. When tests fail, the data is labelled as "bad". With this information, the data engineering team can diagnose poor quality data and take the necessary steps to resolve issues. Tests are repeated at different steps in the pipeline, which makes it easy for data engineers to see clearly at which stage/layer of the pipeline the data broke, and find the most appropriate person to fix the issue.
In an Anomaly detection framework, the solution scans data assets, collects statistics from this data and pays attention to the changes in the behaviour of these statistics. Alert thresholds are set (automatically, manually or both), and the solution sends an alert to the platform of your choice when the statistic's behaviour shows an issue with data. With this framework, solutions usually find the root cause of data issues using data lineage, allowing them to retrace all the events the data went through.
To go one step further, we need to look at some observability tools to understand the current ecosystem better. When trying to understand what observability tools are the best , the key thing to check is if and how these tools provide monitoring, end-to-end visibility as well as telemetry data which should be spread and used on all parts of the IT infrastructure. In the modern data ecosystem, many of these tools and their key features are performed in cloud based environments, thus both your firm and the tools you are using should be adapted to the cloud environment. To give an example, it is vital for firms that are using AWS services to be able to effectively utilize monitoring and observability, on the other hand, this is not always feasible with many tools that can’t handle the complexity of the cloud environment and provide the necessary solutions.
Firms must provide their developers with the right tools and help them get insights in order to solve important problems, make a difference compared to other competitors and survive in today’s compound cloud world. If a developer gets the right insight, this will help him or her to focus on the right problems, leading to better problem solving, having that potentially big difference of working on the right problem or the wrong one, allowing your firm to use your time and resources more efficiently. Moreover, with more accurate identification and telemetry, you can have a better view of your firm’s security, which could be seen as a significant plus.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.