Defining HDInsight connection parameters with Spark Universal
Procedure
-
Enter the basic configuration information to connect to HDInsight:
- Username: Enter your HDInsight cluster username.
- Password: Enter your HDInsight cluster password.
-
Enter the basic configuration information for Livy:
- Hostname: Enter the URL of your HDInsight cluster.
- Port: Enter the port number. The default one is 443.
- Username: Enter the username you defined when creating your cluster. You can find it in the SSH + Cluster login blade of your cluster.
-
Enter the Job status polling configuration:
- Poll interval when retrieving Job status (in ms): Enter the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job.
- Maximum number of consecutive statuses missing: Enter the maximum number of times Talend Studio should retry to get a status when there is no status response.
-
Enter the configuration information for Windows Azure Storage:
Parameter Usage Primary storage Select from the drop-down list the type of storage where you want to deploy your Job: - ADLS Gen2
- Azure Storage
Authentication mode Select from the drop-down list the authentication type you want to use: - Azure Active Directory
- Secret key
- Shared Access Signature
Hostname Enter the Primary Blob Service Endpoint of your Azure Storage account. You can find this endpoint in the Properties blade of the storage account. Container Enter the name of the container to be used. You can find the available containers in the Blob blade of the Azure Storage account to be used. Directory ID Enter the directory ID. This parameter is only available when you select Azure Active Directory from the Authentication mode drop-down list.
Application ID Enter the application ID. This parameter is only available when you select Azure Active Directory from the Authentication mode drop-down list.
Client key Enter the client key. This parameter is only available when you select Azure Active Directory from the Authentication mode drop-down list.
SAS Token Enter the shared access signature (SAS) token for your storage container. For more information on how to generate the SAS token, see Create SAS tokens for your storage containers from the Microsoft documentation. When you are using the SAS token, you need to configure your cluster. For more information, see Use Azure Blob storage Shared Access Signatures to restrict access to data in HDInsight from the Microsoft documentation.
This parameter is only available when you select Shared Access Signature from the Authentication mode drop-down list.
Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in the storage account. -
If you run your Spark Job on Windows, specify the location of the
winutils.exe program:
- If you want to use your own winutils.exe file, select the Define the Hadoop home directory check box and enter its folder path.
- Otherwise, leave the Define the Hadoop home directory check box clear. Talend Studio will generate and use a directory automatically for this Job.
-
Enter the basic configuration information:
Parameter Usage Use local timezone Select this check box to let Spark use the local time zone provided by the system. Information noteNote:- If you clear this check box, Spark use UTC time zone.
- Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: - If you select the check box, the components inside the Job run with DS which improves performance.
- If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.
This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.
Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.Batch size (ms) Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches. Define a streaming timeout (ms) Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running. Information noteNote: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace.Parallelize output files writing Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially in one thread.
On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized.
This option is only available for Spark Batch Jobs containing the following output components:- tAvroOutput
- tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected)
- tFileOutputParquet
Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter. -
Select the Set tuning properties check box to define the
tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch
Jobs.
Information noteImportant: You must define the tuning parameters otherwise you can get an error (400 - Bad request).
-
In the Spark "scratch" directory field, enter the local
path where Talend Studio stores temporary files, like JARs to transfer.
If you run the Job on Windows, the default disk is C:. Leaving /tmp in this field will use C:/tmp as the directory.
-
To make your Job resilient to failure, select Activate
checkpointing to enable Spark’s checkpointing operation.
In the Checkpoint directory field, enter the cluster file system path where Spark saves context data, such as metadata and generated RDDs.
- In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.
-
Select the Use Atlas check box to trace data lineage,
view Spark Job components, and track schema changes between components.
This option is only available for Spark Universal 3.3.x.
With this option activated, you need to set the following parameters:
-
Atlas URL: Enter the address of your Atlas instance, such as http://name_of_your_atlas_node:port.
-
In the Username and Password fields, enter the authentication information for access to Atlas.
-
Set Atlas configuration folder: Select this check box if your Atlas cluster uses custom properties like SSL or read timeout. In the field that appears, enter the path to a local directory containing your atlas-application.properties file. Your Job will then use these custom properties.
Ask the administrator of your cluster for this configuration file. For more information, see the Client Configs section in Atlas configuration.
-
Die on error: Select this check box to stop Job execution if Atlas-related issues occur, such as connection errors. Clear it to let your Job continue running.
-
Results
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!