Skip to main content Skip to complementary content

Defining Databricks connection parameters with Spark Universal

About this task

Talend Studio connects to an all-purpose Databricks cluster to run the Job from this cluster.

Procedure

  1. Click the Run view beneath the design workspace, then click the Spark configuration view.
  2. Select Built-in from the Property type drop-down list.
    If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
    Information noteTip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
  3. Select Universal from the Distribution drop-down list, the Spark version from the Version drop-down list, and Databricks from the Runtime mode/environment drop-down list.
  4. Enter the basic configuration information:
    Parameter Usage
    Use local timezone Select this check box to let Spark use the local time zone provided by the system.
    Information noteNote:
    • If you clear this check box, Spark use UTC time zone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

    This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

    Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
    Batch size (ms) Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches.
    Define a streaming timeout (ms) Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running.
    Information noteNote: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace.
    Parallelize output files writing Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time.

    When you leave this checkbox cleared, the output files are written sequentially in one thread.

    On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized.

    This option is only available for Spark Batch Jobs containing the following output components:
    • tAvroOutput
    • tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected)
    • tFileOutputParquet
    Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.
  5. Complete the Databricks configuration parameters:
    Parameter Usage
    Cloud provider Select the cloud provider to be used between AWS, Azure and GCP.
    Run mode Select the mode you want to use to run your Job on Databricks cluster when you execute your Job in Talend Studio. With Create and run now, a new Job is created and run immediately on Databricks and with Runs submit, a one-time run is submitted without creating a Job on Databricks.
    Enable Unity Catalog Select this check box to leverage Unity Catalog. Then, you need to specify the Unity Catalog related information in the Catalog, Schema, and Volume parameters.
    Information noteImportant: All the parameters have to be created on Databricks with granted permissions for all authorized users before using them in Talend Studio.
    Use pool You can select this check box to leverage a Databricks pool. If you do, you must indicate the Pool ID instead of the Cluster ID. You must also select Job clusters from the Cluster type drop-down list.
    Endpoint Enter the URL address of your workspace.
    Cluster ID Enter the ID of the Databricks cluster to be used. This ID is the value of the spark.databricks.clusterUsageTags.clusterId property of your Spark cluster. You can find this property on the properties list in the Environment tab in the Spark UI view of your cluster.
    Authentication mode Select the method you want to use to authentication in the drop-down list:
    Authentication token Enter the authentication token generated for your Databricks user account.
    Dependencies folder Enter the directory that is used to store your Job related dependencies on Databricks Filesystem at runtime, putting a slash (/) at the end of this directory. For example, enter /jars/ to store the dependencies in a folder named jars. This folder is created on the fly if it does not exist then.

    From Databricks 15.4 LTS, the default library location is moved to WORKSPACE, instead of DBFS.

    Project ID Enter the ID of your Google Platform project where the Databricks project is located.

    This field is only available when you select GCP from the Cloud provider drop-down list.

    Bucket Enter the name of the bucket you use for Databricks from Google Platform.

    This field is only available when you select GCP from the Cloud provider drop-down list.

    Workspace ID Enter the ID of your Google Platform workspace respecting the following format: databricks-workspaceid.

    This field is only available when you select GCP from the Cloud provider drop-down list.

    Google credentials Enter the directory in which the JSON file containing your service account key is stored in the Talend JobServer machine.

    This field is only available when you select GCP from the Cloud provider drop-down list.

    Poll interval when retrieving Job status (in ms) Enter the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job.
    Cluster type From the drop-down list, select the type of cluster you want to use. For more information, see About Databricks clusters.
    Information noteNote: When you run the Job using Talend Studio with Java 17, you need to set the JNAME=zulu17-ca-amd64 environment variable:
    • on Databricks side for job clusters
    • in Init scripts using the set_java17_dbr.sh script on S3 for all-purpose clusters

    DBFS is no longer supported as Init scripts location. For all versions of Databricks, it is replaced to WORKSPACE.

    Use policy Select this check box to enter the name of the policy to be used by your Job cluster. You can use a policy to limit the ability to configure clusters based on a set of rules.

    For more information about cluster policies, see Manage cluster policies from the official Databricks documentation.

    Enable ACL

    Select this check box to use access control lists (ACLs) to configure permission to access workspace or account level objects.

    In ACL permission, you can configure permission to access workspace objects with CAN_MANAGE, CAN_MANAGE_RUN, IS_OWNER, or CAN_VIEW.

    In ACL type, you can configure permission to use account-level objects with User, Group, or Service Principal.

    In Name, enter the name you were given by the administrator.

    This option is available when Cluster type is set to Job clusters. For more information, see the Databricks documentation.

    Autoscale Select or clear this check box to define the number of workers to be used by your Job cluster. If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of workers in Max workers. Your Job cluster is scaled up and down in this scope based on its workload.
    • If you select this check box, autoscaling is enabled. Then define the minimum number of workers in Min workers and the maximum number of workers in Max workers. Your Job cluster is scaled up and down in this scope based on its workload.

      According to the Databricks documentation, autoscaling works best with Databricks runtime versions 3.0 or onwards.

    • If you clear this check box, autoscaling is deactivated. Then define the number of workers a Job cluster is expected to have. This number does not include the Spark driver node.
    Node type and Driver node type Select the node types for the workers and the Spark driver node. These types determine the capacity of your nodes and their pricing by Databricks.

    For more information about these node types and the Databricks Units they use, see Supported Instance Types from the Databricks documentation.

    Unable credentials passthrough Select this check box to disable user credential passthrough when connecting to Databricks Universal. When this option is selected, users' individual credentials are not used for authentication to data sources.
    Number of on-demand Select this check box to specify the maximum number of on-demand compute resources (such as virtual machines or worker nodes).
    Spot with fall back to On-demand Select this check box to allow the use of spot clusters with automatic fallback to on-demand clusters if spot resources are unavailable.
    Availability zone Select this check box to specify the availability zone in which your Databricks resources will be deployed.
    Max spot price Select this check box to specify the maximum price you are willing to pay per hour for spot instances when Databricks provisions compute resources.
    EBS volume type Choose the type of EBS volume from the drop-down menu between None, General purpose SSB and Throughput optimized HDD.
    Configure instance profile ARN Select this check box to specify the instance profile ARN to assign custom permissions to your Databricks resources, enabling secure access to AWS services as needed.
    Elastik disk Select this check box to enable your Job cluster to automatically scale up its disk space when its Spark workers are running low on disk space.

    For more details about this elastic disk feature, search for the section about autoscaling local storage from your Databricks documentation.

    SSH public key If an SSH access has been set up for your cluster, enter the public key of the generated SSH key pair. This public key is automatically added to each node of your Job cluster. If no SSH access has been set up, ignore this field.

    For more information about SSH access to your cluster, see SSH access to clusters from the official Databricks documentation.

    Configure cluster logs Select this check box to define where to store your Spark logs for a long term.
    Custom tags Select this check box to add custom tags as key-value pairs to your Databricks resources.
    Init scripts DBFS is no longer supported as Init scripts location. For all versions of Databricks, it was replaced to WORKSPACE.
    Do not restart the cluster when submitting Select this check box to prevent Talend Studio restarting the cluster when Talend Studio is submitting your Jobs. However, if you make changes in your Jobs, clear this check box so that Talend Studio restarts your cluster to take these changes into account.
  6. Select the Set tuning properties check box to define the tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch Jobs.
    Information noteImportant: You must define the tuning parameters otherwise you can get an error (400 - Bad request).
  7. In the Spark "scratch" directory field, enter the local path where Talend Studio stores temporary files, like JARs to transfer.
    If you run the Job on Windows, the default disk is C:. Leaving /tmp in this field will use C:/tmp as the directory.
  8. To make your Job resilient to failure, select Activate checkpointing to enable Spark’s checkpointing operation.
    In the Checkpoint directory field, enter the cluster file system path where Spark saves context data, such as metadata and generated RDDs.
  9. In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.
  10. Select the Use Atlas check box to trace data lineage, view Spark Job components, and track schema changes between components.
    This option is only available for Spark Universal 3.3.x.

    With this option activated, you need to set the following parameters:

    • Atlas URL: Enter the address of your Atlas instance, such as http://name_of_your_atlas_node:port.

    • In the Username and Password fields, enter the authentication information for access to Atlas.

    • Set Atlas configuration folder: Select this check box if your Atlas cluster uses custom properties like SSL or read timeout. In the field that appears, enter the path to a local directory containing your atlas-application.properties file. Your Job will then use these custom properties.

      Ask the administrator of your cluster for this configuration file. For more information, see the Client Configs section in Atlas configuration.

    • Die on error: Select this check box to stop Job execution if Atlas-related issues occur, such as connection errors. Clear it to let your Job continue running.

Results

The connection details to the Databricks cluster are complete, you are ready to schedule executions of your Job or to run it immediately from this cluster.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!