Tracking Data Lineage Across the Organization
Data lineage is very often mentioned in the discussions around data management, data governance, data quality, BI, reporting, and analytics. Let’s have a closer look at what it is and how it can be created and maintained.
What is data lineage?
Dataversity provides the following definition: “Data Lineage describes data origins, movements, characteristics, and quality. […] Meaningful Data Lineage needs to contain multiple dimensions: who, what, where, why, and how” or the 5 Ws. We want to know what the dataset is, who created or updated it and why, where it happened, and how to access it. We want to see the complete provenance path of the data from creation all the way to its consumption. If there are any interim repositories, we need to know what happened to the data there (again, the 5 Ws).
Why do we need data lineage?
- Data Governance: Data lineage provides “understanding and validation of data usage and risks that need to be mitigated.”
- Compliance: Different stakeholders “need to trust reported data”. They need to know “how did the information get there?”
- Data Quality: As data is ingested, integrated, and maintained in the various streams of the organization’s system and application architecture, there are multiple points where the quality of the data can degrade. “A Data Lineage solution provides the ability to know when data has been transformed, what it means, and how the Data Quality varies from one place to another.”
- Business Impact Analysis: “Businesses may need to know what systems and processes could break” if changes are made to data repositories, structures or transformation processes.
Source: Data Lineage Demystified
How can we implement data lineage?
People: Data owners/stewards – providing timely updates on the status of their datasets, the source data used and the transformations performed to them.
Owners of automated solutions – providing transparent descriptions of the data transformations happening within their solutions.
Data and business architects – ensuring data flows reflect the flow of business processes.
Process: Data lineage can be observed/monitored in one place (via specialized technology), but it is managed at various places – wherever data is created, changed, or moved – therefore, it requires a solid, well-controlled process of timely updates of the parts comprising the whole data lineage picture, including updates triggered by sequential dependencies across different data management processes.
Technology: There are quite different technology options, which can be roughly classified as follows (Disclaimer: software products mentioned below are examples only and cannot be considered as being endorsed or recommended by Info-Tech without due analysis of the member’s requirements and strategy):
Technology Type |
Examples |
Pros |
Cons |
Low-tech tools |
Info-Tech’s Data Lineage Tool, Dataset Certificate |
$0 software cost. Can start today! |
Difficult to aggregate and see/analyze the complete picture 100% manual input |
Specialized tools |
TopBraid Enterprise Data Governance, Collibra, IBM InfoSphere Information Governance Catalog, Waterline Data, Alation |
Complete data governance suite. Lots of automation in data collection and lineage analysis. |
$$ software cost Disjointed from data management/ETL* |
Architecture tools |
$ software cost. Cost and usage can be shared across architecture & data governance. |
Disjointed from data management/ETL* Optimized for architects |
|
ETL or data management tools |
100% accurate lineage – as long as the data is flowing through this platform. Cost and usage can be shared across IT & data governance. |
$$$ software cost Not very business user friendly |
* Some tools can import metadata from data management/ETL but cannot provide input to the data management/ETL processes.
Bottom Line
Determine the scope and depth of data lineage required for your organization before looking at the enabling technology.
Remember that data lineage can also be part of a data catalog or the metadata provided by an ETL or data management tool.
Graphical representation of the data lineage – with ability to drill down into details – is quite possible and should be the preferred way to document data lineage.
Want to Know More?
Collibra Announces Its Acquisition of Data Lineage Provider SQLdep
Restore Trust in Your Data Using a Business-Aligned Data Quality Management Approach
Build a Business-Aligned Data Architecture Optimization Strategy