Data Lineage 102: Definition and key components
In ‘Data Lineage 101: What’s so special about data lineage?’ we discussed why data lineage triggers such an interest among business, IT and data management professionals. Now it is time to take a deeper dive into what data lineage actually is.
‘Everyone wants data lineage, but no one can explain what they exactly mean by that and what their expectations are’, said one of my colleagues a few years ago. This sentence stuck in my mind, as this is exactly what is going on with data lineage nowadays.
Today, when someone starts talking to me about data lineage, the first question I ask is: ‘what do you mean by “data lineage”’? What is your understanding of data lineage?’ So far, I haven’t met a person whose definition and understanding of data lineage has perfectly corresponded with mine, or others that I have encountered over the years. The reason is you can hardly find an aligned, unambiguous and widely accepted definition of data lineage. This article will define what data lineage is, as well as its key components. Let’s start with the definitions.
Defining ‘data lineage’
Let’s start by taking a look at reference industry guides and publications issued by DAMA International (DAMA-DMBOK), The Open Group (TOGAF) and the EDM (Enterprise Data Management) Council (DCAM).
There are two key documents published by DAMA International (The Global Data Management Community) that contain definitions and information about data lineage: DAMA-DMBOK21and DAMA Dictionary2. The key challenge is that the definition of data lineage is ambiguous and can be confused with other terms, such as ‘data flow’, ‘integration architecture’, ‘data & information (value) chain’.
Let’s have a look at the following definitions provided by DAMA Dictionary:
‘Data flow is ‘the transfer of data between systems, applications, or data sets.’’3.
‘Data lineage is a description of the pathway from the data source to their current location and the alterations made to the data along the pathway.’4
This part was pretty clear, it showed that while data flow is a process of data transformation and data lineage describes this process. But then I opened DAMA-DMBOK2 and found the following:
‘[…] data […] has lineage (i.e., a pathway along which it moves from its point of origin to its point of usage, sometimes called the data chain)’5.
‘Data flows are a type of data lineage documentation that depicts how data moves through business processes and systems. End-to-end data flows illustrate where the data originated, where it is stored and used, and how it is transformed as it moves inside and between diverse processes and systems’6.
After reading all of these definitions, I could hardly see the difference between data flow and data lineage. Do you?
Let’s leave the DAMA world for now and see what the EDM Council (Enterprise Data Management Council) has to say about it.
Data Management Capability Model (DCAM) developed by the EDM Council)
The definition of ‘data lineage’ by the EDM Council reads as following:
Data lineage is ‘documentation of the sequence of movement and/or transformation of data as it flows between the consumer and the source(s).’7
This brings me to conclude the following:
- ‘Data flow’, ‘data lineage’ and ‘data chain’ are terms that describe similar concepts of data movement and transformation. Therefore, these terms are often used interchangeably.
- Data lineage is a description of the path along which data flows from the point of its origin to the point of its use.
Still, the definitions say nothing about documenting data lineage. In order to understand the way to document it, we need to know which components constitute data lineage.
Data lineage components
To get some clarity on the key components of data lineage, let’s check the aforementioned industry reference guides.
DAMA-DMBOK2 states that:
‘Data flows map and document relationships between data and:
- Application within a business process
- Data stores or databases in an environment
- Network segments (useful for security mapping)
- Business roles, depicting which roles have responsibility for creating, updating, using and deleting data
- Location where local differences occur.’8
So, according to DAMA-DMBOK2, the key components of data flow / lineage are IT system components (applications, databases, network segments) and business processes.
In the Standard Glossary of EDM Council you can find that ‘Data Lineage describes the chronology of ownership, custody and location of data. Data Lineage provides a visual mapping of the movement and changes in data from system to system. The complete lineage will document the full data flow and capture metadata about the movement and transformation of the data element. Lineage may include a mapping of the data controls’9.
So, according to the EDM council, data lineage links such components as systems, data controls, ownership, custody, metadata.
The Open Group Architecture Framework (TOGAF)
TOGAF 9.1 by The Open Group, the leading guide in Enterprise architecture stipulates that ‘The Data Flow view is concerned with storage, retrieval, processing, archiving, and security of data’10.
The definition of TOGAF9.1 seems to have nothing in common with definitions of DAMA International and EDM Council. It rather refers to the concept of data lifecycle (which is a separate topic and will not be discussed in this article).
So, after the analysis we may conclude:
1. There is no agreed list of components that constitute data lineage.
2. These are the components of data lineage that you should take into account while documenting data lineage:
- IT systems (applications, databases, network segments)
- Data elements
- Business processes, including different functional roles (data- and non-data related)
- Data controls.
Figure 1 is a visualization of these conclusions:
Figure 1. Key components of data lineage
Once again, the key points that you should keep in mind when it comes to discussing data lineage:
- Data lineage is a representation of the path along which data flows from the point of its origin to the point of their usage.
- Data lineage is used to design and describe processes of data transformation and processing.
- Data lineage is recorded by representing a set of interlinked components such as data (elements), business processes, IT systems and applications, data controls. These components could be presented at different level of abstraction and detail.
Such a lineage also is called ‘horizontal’ data lineage.
In ‘Data Lineage 101: What’s so special about data lineage?’, we have discussed that legislation requirements, business changes, data quality initiatives and audit requirements are motivating companies to start implementing data lineage.
In part 3 of this series, Data Lineage 103: Legislative Requirements, I discuss which components of data lineage are required by different legislations.
1 DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017
2DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011
3 DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011, p.75
4DAMA International. The DAMA Dictionary of Data Management, Second Edition: Technics Publications, 2011, p.78
5DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.28
6DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.107
7 Enterprise Data Management Council. The Standard Glossary of Data Management Concepts, version 0.2.1, 2017, p.9
8 DAMA International. DAMA-DMBOK: Data Management Body of Knowledge, Second Edition. Bradley Beach, N.J.: Technics Publications, 2017, p.108
9Enterprise Data Management Council. The Standard Glossary of Data Management Concepts, version 0.2.1, 2017, p.9
10 The Open Group. “TOGAF Version 9.1, The Open Group Standard” no. G116, 2011, p.426
Identify your path to CFO success by taking our CFO Readiness Assessmentᵀᴹ.
For the most up to date and relevant accounting, finance, treasury and leadership headlines all in one place subscribe to The Balanced Digest.
Follow us on Linkedin!