Why Plug a Data Catalog on Top of dbt?
Cross-Tools Data Lineage, Column-Level Impact Analysis, Data Asset Popularity & More
dbt, short for (Data Build Tool), is a software tool that enables data analysts and engineers to transform and model data in the data warehouse. It allows users to write, document, and execute SQL-based transformations, making the ETL (Extract, Transform, Load) process more transparent and maintainable.
dbt has swiftly become an indispensable tool for data analysts across the globe, serving as their technical cornerstone. Its capabilities extend far beyond just transforming data; it allows for sophisticated manipulation of data and aids in crafting comprehensive insights.
A data catalog is a centralized repository that allows organizations to manage their data assets, making it easier to find, access, and manage data across different sources. It provides metadata, descriptions, and data lineage, enhancing data discoverability and governance.
When paired with dbt, a data catalog can become a formidable asset. This synergy unlocks levels of efficiency and effectiveness in data management and utilization that are revolutionary.
The Synergy Between DBT and Data Catalogs
dbt, historically, has developped a baseline documentation & cataloging feature called dbt docs. dbt (Data Build Tool) Docs is a feature of dbt that allows users to generate, view, and share documentation about their analytics code and datasets. It provides an interactive web interface to understand the structure, lineage, and description of data models created using DBT.
Limitations of dbt docs?
dbt Docs has been optimized historically for small dbt models & engineering team. It was more of an add-on the the dbt product than a real data catalog.
- Dependency on DBT: DBT Docs is tightly integrated with DBT. If you're not using DBT for your data transformations, DBT Docs isn't applicable.
- Complexity: For larger projects with many models, the visualizations can become cluttered or overwhelming.
- Static Snapshots: The generated documentation is a static snapshot. It won't reflect real-time changes unless you regenerate and redeploy it.
- Database Limitations: While DBT supports many databases, not all features or integrations might be available across all platforms.
- Customizability: The look and feel of the documentation is relatively fixed, and deep customization might be challenging.
- Performance: As with any tool, if there are a very large number of models and dependencies, performance might be an issue when generating or navigating the docs.
Upgrade to dbt Explorer (NEW - released Oct 23)
DBT Explorer is a tool that is embedded into dbt Cloud, aimed at enhancing the management and understanding of data projects. Here is a breakdown of its features and functionalities:
- Interface for Lineage and Documentation: DBT Explorer provides an organized, user-friendly interface that facilitates quick access to lineage and documentation of data projects, all housed in a centralized location.
- Project Resource Viewing: It enables users to view various resources of their dbt projects such as models, tests, and metrics. Additionally, it provides a lineage view to help users understand the latest production state of their projects.
- Project Management: Beyond viewing, DBT Explorer also supports the navigation and management of dbt projects within dbt Cloud. This feature is designed to aid data developers, analysts, and other data consumers in discovering and leveraging the resources available in their dbt projects.
- Understanding and Improvement: It's a tool that helps in understanding, improving, and leveraging dbt projects, with a focus on aiding users in getting a better grasp of how their data is being managed and how it can be improved.
- Data Lineage Tracking: DBT Explorer has a feature that enables users to track and view data lineage, including across different domains in a data mesh architecture. This is essential for understanding how data is being consumed and ensuring its correct usage across an organization.
DBT Explorer seems to be a robust tool designed to streamline the data transformation process, reduce manual errors, and ultimately, enhance productivity in managing and understanding data projects within dbt Cloud.
Yet, although dbt Explorer is a new & improved version, it still lacks a lot of depth that only a fully fledged data catalog can bring.
Benefits of Plugging a Data Catalog on Top of dbt
Improved Data Discovery: data catalogs are leverage the popularity of each data asset to improve search relevance. It doesn't stop there, you can in some advanced data catalog, navigate the data like you would on Amazon. "People who bought this product also bought this one". Modern data catalogs like CastorDoc, are showing past SQL queries & consumed dashboards by people & team. It streamlines the Data Discovery process, ensuring you're not drowned in unrelated data and can dive straight into what matters.
Data Lineage Visualization Cross Systems: data catalogs are not just connected to dbt. There are connected to the entire data stack. CastorDoc for example can connect to data sources like Salesforce, ETL tools like Fivetran, data lakes like S3, data warehouses like Snowflake, and most of the BI tools like Looker. It can rebuild the lineage at the column level, across those tools.
Enhanced Metadata Management: because data catalogs are just focused on Metadata, they can go further than dbt on metadata management. They integrate with data quality tools, can propagate tags, sync documentation across tools, launch data tasks based on metadata workflows and more.
Increased Collaboration: one of the core value of data catalogs lies in their ability to be used & consumed across the organization. Most of their UI is optimized to be used by both technical & business teams. CastorDoc for example has a business and expert mode. It also has a chrome extension or Slack integration to ease access to documentation.
Compliance and Data Governance: now that is the north star. Data governance & Compliance is key to navigate through the complexity created by dbt transformations.
How to Integrate a Data Catalog with dbt
Integration of CastorDoc with dbt Cloud
Requirements: You must be a dbt Cloud administrator to provide the necessary details.
dbt Cloud x CastorDoc: To initiate the integration of dbt Cloud with CastorDoc:
- Provide a service token. You can refer to this dbt guide to generate it.
- Set up the required permissions:
- If you are on a Team plan: Add a Read-Only permission to all relevant projects.
- If you are on an Enterprise plan: Add a Stakeholder to all relevant projects.
Note: When creating a Token, it can only be accessed at the time of its creation. After that, it will no longer be visible. Ensure you save it securely.
- Provide the ID of your main 'run' job in production. Among the various jobs you run in production, one typically updates all tables of your project. To find this ID:
- Visit dbt Cloud.
- Navigate to Deploy > Jobs and select the job that fits the above description.
- The ID is present in the URL, right after ‘jobs/’.
- Input your API Token and the job ID directly in the CastorDoc App. For the initial sync, it may take up to 48 hours. CastorDoc will notify you once it's completed.
Going Further: After setting up the dbt Cloud integration, you can leverage the full potential of dbt and Gitlab. Some features to explore include:
Now, let's move on to the integration steps for dbt Core. I'll fetch that information for you.
dbt Core x CastorDoc
CastorDoc has a seamless integration with dbt, especially since dbt doc has encouraged analytics engineers to document their data. Here's how you can integrate CastorDoc with dbt Core:
- Manifest Requirement: CastorDoc requires the manifest.json generated by dbt. This should be sent to Castor whenever it's updated or on a daily basis.
- Locating the Manifest: The manifest.json is located within the target directory. Ensure you retrieve the manifest from your production environment, the one that points to your actual production tables. Otherwise, the matching done by database.schema.table.column might not work.
- Sending the Manifest:
- During Trial: During your trial period with Castor, you can simply send the manifest via Slack or email, and Castor will load your descriptions.
- Scheduled Sync: Post-trial, Castor will provide you with a Python script to schedule and push dbt's manifest to a Castor GCP bucket. This involves:
- A Castor source ID (to send your manifest to).
- A Google service account with its credentials in JSON format. This account will have "write" rights to the bucket. For security reasons, Castor only provides write access to ensure data safety. You'll be able to send files but won't see which files you've sent.
- Scheduling the Sync: You can schedule the sync using your classic Airflow workflow. If your dbt is hosted on Github, there's a specific page that provides guidance. Alternatively, you can choose any method you find suitable.
Going Further: Once your dbt integration is set up, you can explore the full potential of dbt and Gitlab, including features like dbt Sync back and dbt Owners.
Wrapping Up
Deciding to plug a data catalog on top of dbt can totally transform how your organization manages and uses data. It doesn't just enhance your data transformation process, it fosters better collaboration within your data teams. It ensures data integrity, facilitates data governance, and aids in data discovery. In short, a data catalog takes your dbt projects to a whole other level.
So, whether you're a data analyst wrestling with complex data models, or a data engineer looking to streamline your processes, integrating a data catalog with dbt might just be the step forward you need.
You might also like
Get in Touch to Learn More
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data