Lineage Trace Header Options
Lineage Flow Type
The Type in the upper left of the lineage display provides a selection between either:
- DATA FLOW - Based upon connection definitions to data stores and physical transformation rules which transform and move the data)
- SEMANTIC FLOW - Based upon the definition and usage type relationships from a term, concept or logical Model to a physical representation.
- OVERVIEW – Based upon a view of the design level lineage limited to the scope of the model you invoked it on (by clicking on the Lineage tab) and thus is not a complete end-to-end lineage picture, but simply an overview of the model lineage picture.
Both data flow and semantic flow may be present in a diagram.
Lineage Direction
Generally, lineage is represented as a “flow”, either of data as part of a data movement and possibly transformation process, or of “meaning” as in from a defining object like a glossary term to a defined object like a column. These directions are commonly also associated with analysis of the lineage, hence:
- Data Flow lineage
- Forward or destinations or target or impact lineage of the data movement and transformation processes. Represented as being to the right of the point of origin.
- Reverse or source lineage of the data movement and transformation processes. Represented as being to the left of the point of origin.
- Semantic Flow lineage
- Forward or target or usage or defined lineage of the application of meaning or documentation or inheritance. Represented as being below (and many times to the right of) the point of origin.
- Reverse or source or origin or definition lineage of the application of meaning or documentation or inheritance. Represented as being below (and many times to the right of) the point of origin.
Direction only makes sense for a lineage trace, not a lineage overview.
Control Flow
Generally, lineage is represented as a “flow”, either of data as part of a data movement and possibly transformation process, or of “meaning” as in from a defining object like a glossary term to a defined object like a column. These directions are commonly also associated with analysis of the lineage, hence:
Control flow is lineage that traces from an object used as part of a selection WHERE clause or similar structure that impacts what data is moved but is not itself directly moved to the target. There are two types of control flow:
- Column control flow where the control flow directly impacts values of column (e.g., lookup)
- Row control flow where the control flow does not directly impact values of columns (e.g., filters).
It is easy to imagine a common scenario where you trace data impact and your impact trace affects a commonly used (in terms of joins and WHERE clauses) dimension, e.g., the time dimension in the warehouse, mart or otherwise. Just about every report will be using that dimension in some way, and thus the impact lineage is basically everything. In this case the diagram size quickly grows out of the capability of your browser to present the lineage let alone navigate and analyze it.
For this and other similar reasons, the same menu as above includes options to limit the lineage.
Talend Data Catalog may be used as an active data catalog, providing:
Control Lineage Option | Description | Delay in Presentation |
None | No control flow data impacts are traced | None |
Limited | Show only immediate (adjacent) control flow objects | Maybe slow |
Complete | All control flow impacts are traced | Likely slow |
Steps
- Begin a lineage trace.
- In Control Flow, you may:
- Click None to hide any object which are only connected via control flow and not show any control flow links.
- Click Limited to show any objects which are directly connected to the origin object via control flow and show those control flow links.
- Click Complete to show any objects which are connected via control flow to the origin object and any subsequent objects and show those control flow links.
- If Limited control flow display is enabled, then go to the lineage Diagram and click on target elements and the control flow that the target depends upon will appear.
Example
Search for the Dimensional DW.dbo.Customer table and open it.
Go to the Data Flow tab and ensure that the Type is DATA FLOW and the View is DIAGRAM.
There is a red “pin” in the diagram, showing the point of origin, from which lineage is presented. In this case, the Customer table.
Finally, ensure that the Control Flow is NONE:
Click Columns HIDE and select the top checkbox to show all the columns in the Customer table.
Then, expand the Details panel at the far right and select the Dimensional DW.dbo.Customer.CustomerID column.
At this time, the diagram does not contain any control lineage artifacts, as we specified.
Now, specify Control Flow as Limited:
Many new objects, which are not directly connected by data flow links now appear. Selecting Data Flow Settings >Control Flow > Limited shows any objects which are directly connected to the origin object via control flow.
One must click on an object to see the control lineage.
Now, expand Customer and again click the Dimensional DW.dbo.Customer.CustomerID column.
And we see control lineage as different (dashed) lines.
Now, update the Data Flow settings with Control Flow as Complete:
Even more objects are now shown in the lineage diagram but are unconnected. Again, one must click on an object to see the control lineage.
Then select Show Mixed Connections from the Display Options menu.
Expand Staging DW.dbo.CustomerPayment and select PaymentID.
Many new objects, which are not directly connected by data flow links now appear. Selecting Data Flow Settings >Control Flow > Complete shows any objects which are connected via control flow to the origin object and any subsequent objects.
Lineage Filter
The ability to present a manageable amount of information targeted for analysis is a critical concern with lineage diagrams. In particular, for larger diagrams (such as those originating in a central fact in a warehouse, or a commonly used time based dimension), filtering is crucial if you do not want spaghetti diagrams, memory faults (remember, the actual lineage diagram must be presented by your local browser and its memory limitations), or simply huge wait times for the diagram to appear.
You have several filters from which several choices are available
Each filter option shows a number adjacent with the number of objects that would be filtered out of the diagram if that filter is enabled.
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage, such as transformations in an ETL pipeline
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived, such as data lake files from which HIVE derives tables
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage such as temporary data store objects which are created and then deleted as part of the data movement process
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
Please see the discussion on handling large diagrams.
Saving Lineage Results
You may save a lineage graph to be shared and referred to later. This reduces the time required to read from the database and regenerate a lineage graph for larger diagrams.