Defining Azure Synapse Analytics connection parameters with Spark Universal

Complete the Azure Synapse Analytics connection configuration with Spark Universal in the Spark configuration tab of the Run view of your Spark Batch Job. This configuration is effective on a per-Job basis.

Important: Spark Pools is the only service of Azure Synapse Analytics supported for Spark Jobs in Talend Studio.

Before you begin

You must already have a Synapse workspace and an Apache Spark pool set up. For more information, see Creating a Synapse workspace and Create a new serverless Apache Spark pool using the Azure portal from the official Microsoft Documentation.

Procedure

Enter the basic configuration information to connect to Azure Synapse:

Parameter	Usage
Endpoint	Enter the Development endpoint from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
Authorization token	Enter the token generated for your Azure Synapse account. Information noteNote: To generate a token you need to enter the following command: curl -X post -H "Content-Type: application/x-www-form-urlencoded" -d 'client_id=<YourClientID>&scope=https://dev.azuresynapse.net/.default&client_secret=<YourClientSecret>&grant_type=client_credentials' 'https://login.microsoftonline.com/<YourTenantID>/oauth2/v2.0/token' You can retrieve your Client ID, Client Secret, and Tenant ID from your Azure Portal. The authentication to Azure Synapse is performed using Azure Active Directory application. For more information on how to register to Azure Active Directory, see Use the portal to create an Azure AD application and service principal that can access resources from the official Microsoft documentation. Information noteImportant: The token is valid for one hour. After that, regenerate it to avoid a 401 Not authorized error.
Apache Spark pools	Enter, in double quotation marks, the name of the Apache Spark Pool to be used. Information noteNote: On Azure Synapse workspace side, make sure that: the Autoscale option in Basic settings and the Automatic pausing option in Additional settings are enabled when creating an Apache Spark pool the selected Apache Spark version is set to 3.0 (preview)
Poll interval when retrieving Job status (in ms)	Enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. The default value is 3000, meaning 30 seconds.
Maximum number of consecutive statuses missing	Enter the maximum number of times Talend Studio should retry to get a status when there is no status response. The default value is 10.

Enter the basic storage information of Azure Synapse:

Parameter	Usage
Authentication method	Select the authentication mode to be used from the drop-down list: Secret Key Azure Active Directory
Storage	Select the storage to be used in the drop-down list. ADLS Gen2 is the default storage for Azure Synapse Analytics workspace. If you are using Azure Active Directory authentication, make sure the application is linked to ADLS Gen2 with granted role Storage Blob Data Contribution.
Hostname	Enter the Primary ADLS Gen2 account from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
Container	Enter the Primary ADLS Gen2 file storage from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
Username	Enter the storage account name linked to your Azure Synapse workspace. This property is only available when you select Secret Key from the Authentication method drop-down list.
Password	Enter the access keys linked to your Azure Synapse workspace. For more information about how to retrieve your access keys, see View account access keys from the official Microsoft documentation. This property is only available when you select Secret Key from the Authentication method drop-down list.
Directory ID	Enter the directory ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Application ID	Enter the application ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Use certificate to authenticate	Select this check box to authenticate to your Azure Active Directory application using a certificate and then enter the location in which the certificate is stored in the Path to certificate field. Make sure you upload the certificate in the Certificates & secrets > Certificates section of your Azure Active Directory application. For more information about certificates, see the official Microsoft documentation. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Client key	Enter the client key linked to your Azure Active Directory application. You can generate the client key from the Certificates & secrets tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list and when you clear the Use certificate to authentication check box.
Deployment Blob	Enter the location in which you want to store the current Job and its dependent libraries in your storage.

If you run your Spark Job on Windows, specify the location of the winutils.exe program:
- If you want to use your own winutils.exe file, select the Define the Hadoop home directory check box and enter its folder path.
- Otherwise, leave the Define the Hadoop home directory check box clear. Talend Studio will generate and use a directory automatically for this Job.

Enter the basic configuration information:

Parameter	Usage
Use local timezone	Select this check box to let Spark use the local time zone provided by the system. Information noteNote: If you clear this check box, Spark use UTC time zone. Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components	Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: If you select the check box, the components inside the Job run with DS which improves performance. If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility. This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD. Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
Use timestamp for dataset components	Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
Parallelize output files writing	Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially in one thread. On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized. This option is only available for Spark Batch Jobs containing the following output components: tAvroOutput tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected) tFileOutputParquet Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.

Enter the Synapse tuning properties by following the process explained in Tuning Spark for Apache Spark Batch Jobs.

Information noteImportant: To avoid a 400 Bad Request error, make sure to define the tuning parameters.
In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.
Select the Use Atlas check box to trace data lineage, view Spark Job components, and track schema changes between components.
This option is only available for Spark Universal 3.3.x.
With this option activated, you need to set the following parameters:
- Atlas URL: Enter the address of your Atlas instance, such as http://name_of_your_atlas_node:port.
- In the Username and Password fields, enter the authentication information for access to Atlas.
- Set Atlas configuration folder: Select this check box if your Atlas cluster uses custom properties like SSL or read timeout. In the field that appears, enter the path to a local directory containing your atlas-application.properties file. Your Job will then use these custom properties.
  
  Ask the administrator of your cluster for this configuration file. For more information, see the Client Configs section in Atlas configuration.
- Die on error: Select this check box to stop Job execution if Atlas-related issues occur, such as connection errors. Clear it to let your Job continue running.

Results

You can retrieve the Job results on your Azure Synapse workspace with the Livy ID generated when running you Job.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here