Defining Databricks Serverless connection parameters with Spark Universal

About this task

Talend Studio connects to Databricks Serverless compute to run your Spark Batch Job. Serverless compute is fully managed by Databricks and requires no cluster configuration.

Before you begin, ensure the following requirements are met:

Unity Catalog is enabled in the Databricks workspace.
The workspace is in a region that supports serverless compute. For more information, see Serverless compute in the Databricks documentation.
The PCI-DSS compliance profile is not enabled in the workspace.

Important: Databricks Serverless is available on AWS and Azure cloud providers only.

Procedure

Click the Run view beneath the design workspace, then click the Spark Configuration view.
Select Built-in from the Property type drop-down list.
If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] to open the Repository Content dialog box and select the Hadoop connection to be used.
Information noteTip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
Select Universal from the Distribution drop-down list, the Spark version from the Version drop-down list, and Databricks Serverless from the Runtime mode/environment drop-down list.

Enter the basic configuration information:

Parameter	Usage
Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Information noteNote: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
Parallelize output files writing	Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially in one thread. On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized. This option is only available for Spark Batch Jobs containing the following output components: tAvroOutput tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected) tFileOutputParquet Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.

Complete the Databricks configuration parameters:

Parameter	Usage
Catalog	Enter the Unity Catalog catalog name. Unity Catalog is mandatory for Databricks Serverless.
Schema	Enter the schema name within the Unity Catalog catalog.
Volume	Enter the Unity Catalog volume name used to store Job dependencies at runtime.
Endpoint	Enter the URL address of your Databricks workspace.
Authentication mode	Select the method to authenticate to Databricks: Personal access token: authenticate with a personal access token (PAT). In Authentication token, enter the token generated for your Databricks user account. OAuth2: authenticate with OAuth2. In Client ID and Secret ID, enter the OAuth client credentials generated for your Databricks service principal.
Dependencies folder	Enter the directory used to store Job-related dependencies at runtime, with a trailing slash. For example, /jars/.
Poll interval when retrieving Job status (in ms)	Enter the time interval in milliseconds at which Talend Studio polls Databricks for the status of your Job.
Job timeout (in ms)	Enter the maximum time in milliseconds that a Job is allowed to run before it is terminated. Set to `0` to disable the timeout.
Parallelism for joins/aggregations	Enter the number of partitions to use for joins and aggregations. Use `auto` to let Databricks determine the optimal value automatically.
Max partition size for file reads	Enter the maximum size in bytes of a partition when reading files.
Fetch driver logs after run	Select this check box to retrieve the Spark driver logs from Databricks after the Job finishes running.
Enable job queueing	Select this check box to allow Jobs to be queued when the serverless compute capacity is not immediately available.
Set a budget policy	Select this check box to apply a Databricks budget policy to control the cost of your serverless workloads. When selected, enter the policy ID in the field that appears.
Enable ACL	Select this check box to use access control lists (ACLs) to configure permissions for workspace or account level objects.
Environment name	Enter the name of the serverless environment to use for your Job.
Environment version	Enter the version of the serverless environment. Supported versions are `4` and `5`.

Results

The connection details to Databricks Serverless are complete. You are ready to run your Spark Batch Job from serverless compute.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here