Data Flow Lineage Trace in General
A lineage trace will always have a point of origin, and also has a Type or direction. For the Data Flow lineage tab,
there are a number of common features and tools available when visualizing a lineage trace. Reporting on lineage will bring you to the Lineage Trace Page.
First, though, you must choose to see the Diagram or the End Object (list view), by clicking the tabs on the left.
Data Flow Options
Many options are available in the menus of a data flow lineage report.
Control flow is lineage that traces from an object used as part of a selection WHERE clause or similar structure that impacts what data is moved but is not itself directly moved to the target. There are two types of control flow:
- Column control flow where the control flow directly impacts values of column (e.g., lookup)
- Row control flow where the control flow does not directly impact values of columns (e.g., filters).
It is easy to imagine a common scenario where you trace data impact and your impact trace affects a commonly used (in terms of joins and WHERE clauses) dimension, e.g., the time dimension in the warehouse, mart or otherwise. Just about every report will be using that dimension in some way, and thus the impact lineage is basically everything. In this case the diagram size quickly grows out of the capability of your browser to present the lineage let alone navigate and analyze it.
For this and other similar reasons, the same menu as above includes options to limit the lineage.
Talend Data Catalog may be used as an active data catalog, providing:
Control Lineage Option | Description | Delay in Presentation |
None | No control flow data impacts are traced | None |
Limited | Limited control flow data impacts are traced | Maybe slow |
Complete | All control flow data impacts are traced | Likely slow |
Steps
- Begin a lineage trace.
- In Data Flow Settings, you may:
- Click Data Flow Settings >Control Flow/None to hide any object which are only connected via control flow and not show any control flow links.
- Click Data Flow Settings >Control Flow/Limited to show any objects which are directly connected to the origin object via control flow and show those control flow links.
- Click Data Flow Settings >Control Flow/Complete to show any objects which are connected via control flow to the origin object and any subsequent objects and show those control flow links.
- Once control flow display is enabled, then go to the lineage Diagram and click on target elements and the control flow that the target depends upon will appear.
- Trace data flow lineage.
- Click Data Impact in the Type pull-down in the upper right.
Example
Search for the Dimensional DW.dbo.Customer table and open it.
Go to the Data Flow tab and ensure that the Type is Data Lineage and the View is Diagram.
There is a red “pin” in the diagram, showing the point of origin, from which lineage is presented. In this case, the Customer table.
Finally, ensure that the Display Options are all unchecked, and Data Flow Settings are None and the Lineage Filters are all No:
End Objects gives:
Return to the Diagram view and select Dimensional DW.dbo.Customer.CustomerID and expand the Details at the right.
At this time, the diagram does not contain any control lineage artifacts, as we specified.
Now, update the Data Flow Settings with Control Flow as Limited:
Many new objects, which are not directly connected by data flow links now appear. Selecting Data Flow Settings >Control Flow > Limited shows any objects which are directly connected to the origin object via control flow.
One must click on an object to see the control lineage.
Now, expand Customer and again click the Dimensional DW.dbo.Customer.CustomerID column.
And we see control lineage as different (dashed) lines.
Now, update the Data Flow settings with Control Flow as Complete:
Even more objects are now shown in the lineage diagram but are unconnected. Again, one must click on an object to see the control lineage.
Then select Show Mixed Connections from the Display Options menu.
Expand Staging DW.dbo.CustomerPayment and select PaymentID.
Many new objects, which are not directly connected by data flow links now appear. Selecting Data Flow Settings >Control Flow > Complete shows any objects which are connected via control flow to the origin object and any subsequent objects.
One may include or filter out various object types in order to focus only on specific types of objects in the lineage.
Click Edit Filters and specify:
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
- DEPTH to allow a specific number for the depth into objects in the lineage trace.
In some cases you may see that a lineage diagram is taking an excessive amount of time to display or that you are presented with the message:
This large diagram has xxxxx objects and xxxxx links which may require more resources that what your browse case handle.
You may use the PROCEED ANYWAY button to try to visualize the diagram.
You may also save these settings as defaults in future lineage traces.
Steps
- Begin a lineage trace.
- Click Edit Filters and specify:
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
- DEPTH to allow a specific number for the depth into objects in the lineage trace.
Show Internal/External Objects
Lineage reporting may
- either Show Internal Objects within a model (e.g., interim steps in transformations) or just the objects stitched to other model objects.
- either Show External Objects that are not directly material to the lineage trace (such as the link from files in HDSF to the tables representing them in Hive) or not show these objects.
Show Temporary Objects
Big data solutions and other ETL/DI processes use temporary files and tables routinely. When harvesting, Talend Data Catalog detects temporary files and marks them as TEMPORARY in their lineage characteristics. This fact means that you can distinguish temporary objects from permanent/stitchable ones in a lineage diagram and, optionally hide/show them.
Show External Table Location Objects
Models may refer to external tables that require connection resolution. By default, these table location objects are not shown. You may use this option to explicitly show them.
Default View
This option allows you to save the current filter setting to be the default for future trace reports.