Defining the Azure Databricks connection parameters for Spark Jobs
Complete the Databricks connection configuration in the Spark Configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.
Before you begin
- When running a Spark Streaming Job, only if you have selected the Do not restart the cluster when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster. If you clear the check box, the Job will fail during execution with the following error: run failed with error message Driver of the cluster (01234-56789-cluster) was restarted during the run.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.
Procedure
Results
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
For more information about the Spark checkpointing operation, see the official Spark documentation.