Defining Kubernetes connection parameters with Spark Universal

Procedure

Click the Run view beneath the design workspace, then click the Spark configuration view.
Select Built-in from the Property type drop-down list.
If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
Information noteTip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
Select Universal from the Distribution drop-down list, the Spark version from the Version drop-down list, and Kubernetes from the Runtime mode/environment drop-down list.

Complete the Kubernetes configuration parameters:

Parameter	Usage
Kubernetes submit mode	Choose the type of Kubernetes submit mode to determine whether Spark Jobs are submitted directly using Spark submit or through the Livy REST API for execution on a Kubernetes cluster.
Kubernetes master	Enter the API Server Address respecting the following format: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>. You can retrieve it using the kubectl config view --minify command in your command line interface.
Number of executor instances	Enter the number of executors to be used for the Job execution.
Use registry secret	Enter the password to access the Docker image, if needed.
Docker image	Enter the name of the Docker image to be used for the execution.
Namespace	Enter the namespace of the Docker cluster.
Service account	Enter the name of the service account to be used. The service account must have sufficient rights on the Kubernetes cluster.
Cloud storage	Select the Cloud provider you want to use from the drop-down list and enter the information and credentials in the corresponding fields.
Cloud storage > S3	Set the following parameters to connect to S3: Bucket Path to folder Credentials type Access key Secret key
Cloud storage > Blob	Set the following parameters to connect to Azure Blob Storage: Path to folder Storage account Container name Secret key
Cloud storage > Adls gen 2	Set the following parameters to connect to ADLS Gen 2: Path to folder Storage account Credentials type Container name Secret key
Cloud storage > HDFS	Set the following parameters to connect to HDFS: Use Kerberos HDFS address User Path to folder

Enter the basic Configuration information:

Parameter	Usage
Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Information noteNote: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
Batch size (ms)	Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches.
Define a streaming timeout (ms)	Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running. Information noteNote: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace.
Parallelize output files writing	Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially in one thread. On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized. This option is only available for Spark Batch Jobs containing the following output components: tAvroOutput tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected) tFileOutputParquet Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.

Select the Set tuning properties check box to define the tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch Jobs.

Information noteImportant: You must define the tuning parameters otherwise you can get an error (400 - Bad request).
In the Spark "scratch" directory field, enter the local path where Talend Studio stores temporary files, like JARs to transfer.
If you run the Job on Windows, the default disk is C:. Leaving /tmp in this field will use C:/tmp as the directory.
To make your Job resilient to failure, select Activate checkpointing to enable Spark’s checkpointing operation.
In the Checkpoint directory field, enter the cluster file system path where Spark saves context data, such as metadata and generated RDDs.
In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.

Results

The connection details to the Kubernetes cluster are complete, you are ready to schedule executions of your Job or to run it immediately from this cluster.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here