Setting up data lineage with Cloudera Navigator
The support for Cloudera Navigator has been added to Talend Spark Jobs.
If you are using Cloudera V5.5+ to run your Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Spark Job, including the components used in this Job and the schema changes between the components.
For example, assume that you have designed the following Job and you want to generate lineage information about it:
Procedure
Results
Till now, the connection to Cloudera Navigator has been set up. The time when you run this Job, the lineage will be automatically generated in Cloudera Navigator.
Note that you still need to configure the other parameters in the Spark configuration tab in order to successfully run the Job.
When the execution of the Job is done, perform a search in Cloudera Navigator for the data written by this Job and see the lineage of this data in Cloudera Navigator.
If you compare this lineage graph with the Job in the Studio, you can see that every component is presented in this graph and you can expand the icon of each component to read the schema it uses.
Cloudera Navigator uses a Cloudera SDK library to provide functionalities and must be compatible with the version of this SDK library. The version of your Cloudera Navigator is determined by the Cloudera Manager installed with your Cloudera distribution and the compatible SDK is automatically used based on the version of your Navigator.
However, not all the Cloudera Navigator versions have their compatible SDK versions. For more details about the Cloudera SDK versions and their compatible Navigator versions, see the Cloudera documentation about Cloudera Navigator SDK Version Compatibility.
For information about Cloudera Navigator versions supported by the Studio, see the supported Cloudera Navigator versions.