Standalone
|
-
Use pool: you can select this check box to leverage a
Databricks pool. If you do, you must indicate the pool ID instead of the
cluster ID in the Spark Configuration. You must also
select Job cluster from the Cluster
type drop-down list.
-
In the Endpoint field, enter the URL
address of your Azure Databricks workspace. This URL can be
found in the Overview
blade of your Databricks workspace page on your Azure portal.
For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
-
In the Cluster ID field, enter the ID
of the Databricks cluster to be used. This ID is the value of
the spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on
the properties list in the Environment tab in the Spark UI view of your cluster.
You can also
easily find this ID from the URL of your Databricks cluster. It
is present immediately after cluster/ in this URL.
-
If you
selected the Use pool
option, in the Pool ID
field, enter the ID of the Databricks pool to be used. This ID is
the value of the DatabricksInstancePoolId key of your pool. You can
find this key under Tags in
the Configuration tab of
your pool. It is also available in the tags of the clusters that are
using the pool.
You can also easily find this
ID from the URL of your Databricks pool. It is present immediately
after cluster/instance-pools/view/ in this URL.
-
Click the [...] button
next to the Token field to enter the
authentication token generated for your Databricks user account. You can
generate or find this token on the User settings
page of your Databricks workspace. For further information, see Personal access tokens from the official Azure
documentation.
-
In the Dependencies folder field, enter the
directory that is used to store your Job related dependencies on
Databricks Filesystem at runtime, putting a slash (/) at the end of this
directory. For example, enter /jars/ to store the
dependencies in a folder named jars. This folder is
created on the fly if it does not exist then.
From Databricks 15.4 LTS, the default library location is moved to
WORKSPACE, instead of DBFS.
-
Poll interval when retrieving Job status (in ms):
enter, without the quotation marks, the time interval (in milliseconds) at
the end of which you want Talend Studio to ask Spark for the status of your Job. For example, this status could
be Pending or Running.
The
default value is 300000,
meaning 30 seconds. This interval is recommended by Databricks to
correctly retrieve the Job status.
-
Cluster type: select the type of cluster to be used
between Job clusters and All-purpose
clusters.
The custom properties you defined in the Advanced
properties table are automatically taken into account by the
Job clusters at runtime.
-
Use policy: select this check box to enter the
name of the policy to be used by your Job cluster. You can use a policy
to limit the ability to configure clusters based on a set of rules. For
more information about cluster policies, see Manage cluster policies from the official
Databricks documentation.
- Enable ACL: select this check box to use access
control lists (ACLs) to configure permission to access workspace or
account level objects.
In ACL permission, you
can configure permission to access workspace objects with
CAN_MANAGE,
CAN_MANAGE_RUN,
IS_OWNER, or
CAN_VIEW.
In ACL
type, you can configure permission to use account-level
objects with User,
Group, or Service
Principal.
In Name, enter
the name you were given by the administrator.
This option is
available when Cluster type is set to
Job clusters. For more information, see the
Databricks
documentation.
-
Autoscale: select or clear this check box to
define the number of workers to be used by your Job cluster.
- If you select this check box, autoscaling is enabled. Then define
the minimum number of workers in Min workers
and the maximum number of workers in Max
workers. Your Job cluster is scaled up and down in
this scope based on its workload.
According to the Databricks
documentation, autoscaling works best with Databricks runtime
versions 3.0 or onwards.
- If you clear this check box, autoscaling is deactivated. Then
define the number of workers a Job cluster is expected to have.
This number does not include the Spark driver node.
-
Node type
and Driver node type:
select the node types for the workers and the Spark driver node.
These types determine the capacity of your nodes and their
pricing by Databricks.
For more information about
these node types and the Databricks Units they use, see
Supported Instance
Types from the Databricks documentation.
-
Elastic disk: select this check box to enable your
job cluster to automatically scale up its disk space when its Spark
workers are running low on disk space.
For more details about this
elastic disk feature, search for the section about autoscaling local
storage from your Databricks documentation.
-
SSH public key: if an SSH access has been set up
for your cluster, enter the public key of the generated SSH key pair.
This public key is automatically added to each node of your Job cluster.
If no SSH access has been set up, ignore this field.
For more
information about SSH access to your cluster, see SSH access to clusters from the official
Databricks documentation.
-
Configure cluster log: select this check box to
define where to store your Spark logs for a long term. This storage
system can be S3 or DBFS.
- Init scripts: DBFS is no longer supported as
Init scripts location. For all versions of
Databricks, it was replaced to WORKSPACE.
-
Do not restart the cluster when submitting: this option
is available when Cluster type is set to
All-purpose clusters. Select this check box to
prevent Talend Studio restarting the cluster when Talend Studio is submitting your Jobs. However, if you make changes in your Jobs, clear
this check box so that Talend Studio restarts your cluster to take these changes into account.
|