Defining Azure Synapse Analytics connection parameters with Spark Universal
Complete the Azure Synapse Analytics connection configuration with Spark Universal in the Spark configuration tab of the Run view of your Spark Batch Job. This configuration is effective on a per-Job basis.
Before you begin
Procedure
-
Enter the basic configuration information to connect to Azure Synapse:
Parameter Usage Endpoint Enter the Development endpoint from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Authorization token Enter the token generated for your Azure Synapse account. Information noteNote: To generate a token you need to enter the following command: curl -X post -H "Content-Type: application/x-www-form-urlencoded" -d 'client_id=<YourClientID>&scope=https://dev.azuresynapse.net/.default&client_secret=<YourClientSecret>&grant_type=client_credentials' 'https://login.microsoftonline.com/<YourTenantID>/oauth2/v2.0/token'You can retrieve your Client ID, Client Secret, and Tenant ID from your Azure Portal.
The authentication to Azure Synapse is performed using Azure Active Directory application. For more information on how to register to Azure Active Directory, see Use the portal to create an Azure AD application and service principal that can access resources from the official Microsoft documentation.
Information noteImportant: The token is valid for one hour. After that, regenerate it to avoid a 401 Not authorized error.Apache Spark pools Enter, in double quotation marks, the name of the Apache Spark Pool to be used. Information noteNote: On Azure Synapse workspace side, make sure that:- the Autoscale option in Basic settings and the Automatic pausing option in Additional settings are enabled when creating an Apache Spark pool
- the selected Apache Spark version is set to 3.0 (preview)
Poll interval when retrieving Job status (in ms) Enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. The default value is 3000, meaning 30 seconds.
Maximum number of consecutive statuses missing Enter the maximum number of times Talend Studio should retry to get a status when there is no status response. The default value is 10.
-
Enter the basic storage information of Azure Synapse:
Parameter Usage Authentication method Select the authentication mode to be used from the drop-down list: - Secret Key
- Azure Active Directory
Storage Select the storage to be used in the drop-down list. ADLS Gen2 is the default storage for Azure Synapse Analytics workspace. If you are using Azure Active Directory authentication, make sure the application is linked to ADLS Gen2 with granted role Storage Blob Data Contribution.
Hostname Enter the Primary ADLS Gen2 account from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Container Enter the Primary ADLS Gen2 file storage from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Username Enter the storage account name linked to your Azure Synapse workspace. This property is only available when you select Secret Key from the Authentication method drop-down list.
Password Enter the access keys linked to your Azure Synapse workspace. For more information about how to retrieve your access keys, see View account access keys from the official Microsoft documentation.
This property is only available when you select Secret Key from the Authentication method drop-down list.
Directory ID Enter the directory ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Application ID Enter the application ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Use certificate to authenticate Select this check box to authenticate to your Azure Active Directory application using a certificate and then enter the location in which the certificate is stored in the Path to certificate field. Make sure you upload the certificate in the Certificates & secrets > Certificates section of your Azure Active Directory application. For more information about certificates, see the official Microsoft documentation.
This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Client key Enter the client key linked to your Azure Active Directory application. You can generate the client key from the Certificates & secrets tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list and when you clear the Use certificate to authentication check box.
Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in your storage. -
If you run your Spark Job on Windows, specify the location of the
winutils.exe program:
- If you want to use your own winutils.exe file, select the Define the Hadoop home directory check box and enter its folder path.
- Otherwise, leave the Define the Hadoop home directory check box clear. Talend Studio will generate and use a directory automatically for this Job.
-
Enter the basic configuration information:
Parameter Usage Use local timezone Select this check box to let Spark use the local time zone provided by the system. Information noteNote:- If you clear this check box, Spark use UTC time zone.
- Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: - If you select the check box, the components inside the Job run with DS which improves performance.
- If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.
This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.
Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.Parallelize output files writing Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially in one thread.
On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized.
This option is only available for Spark Batch Jobs containing the following output components:- tAvroOutput
- tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected)
- tFileOutputParquet
Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter. -
Enter the Synapse tuning properties by following the process explained in Tuning Spark for Apache Spark Batch
Jobs.
Information noteImportant: To avoid a 400 Bad Request error, make sure to define the tuning parameters.
- In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.
-
Select the Use Atlas check box to trace data lineage,
view Spark Job components, and track schema changes between components.
This option is only available for Spark Universal 3.3.x.
With this option activated, you need to set the following parameters:
-
Atlas URL: Enter the address of your Atlas instance, such as http://name_of_your_atlas_node:port.
-
In the Username and Password fields, enter the authentication information for access to Atlas.
-
Set Atlas configuration folder: Select this check box if your Atlas cluster uses custom properties like SSL or read timeout. In the field that appears, enter the path to a local directory containing your atlas-application.properties file. Your Job will then use these custom properties.
Ask the administrator of your cluster for this configuration file. For more information, see the Client Configs section in Atlas configuration.
-
Die on error: Select this check box to stop Job execution if Atlas-related issues occur, such as connection errors. Clear it to let your Job continue running.
-
Results
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!