Defining the Databricks-on-AWS connection parameters for Spark Jobs
Complete the Databricks connection configuration in the Spark Configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.
The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.
Before you begin
-
- When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run automatically restarts the cluster, the Jobs that are launched in parallel interrupt each other and thus cause execution failure.
- Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.
Procedure
Results
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
For more information about the Spark checkpointing operation, see the official Spark documentation.