Defining Azure Synapse Analytics connection parameters with Spark Universal
Complete the Azure Synapse Analytics connection configuration with Spark Universal in the Spark configuration tab of the Run view of your Spark Batch Job. This configuration is effective on a per-Job basis.
Before you begin
Procedure
-
Enter the basic configuration information to connect to Azure Synapse:
Parameter Usage Endpoint Enter the Development endpoint from you Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Authorization token Enter the token generated for your Azure Synapse account. Information noteNote: To generate a token you need to enter the following command:curl -X post -H "Content-Type: application/x-www-form-urlencoded" -d 'client_id=<YourClientID>&scope=https://dev.azuresynapse.net/.default&client_secret=<YourClientSecret>&grant_type=client_credentials' 'https://login.microsoftonline.com/<YourTenantID>/oauth2/v2.0/token'You can retrieve your Client ID, Client Secret and Tenant ID from your Azure Portal.
The authentication to Azure Synapse is performed via Azure Active Directory application. For more information on how to register to Azure Active Directory, see Use the portal to create an Azure AD application and service principal that can access resources from the official Microsoft documentation.
Information noteImportant: The token is only valid for one hour. You must regenerate a new one beyond the one-hour limit otherwise you could get an error (401 - Not authorized).Apache Spark pools Enter, in double quotation marks, the name of the Apache Spark Pool to be used. Information noteNote: On Azure Synapse workspace side, make sure that:- the Autoscale option in Basic settings and the Automatic pausing option in Additional settings are enabled when creating an Apache Spark pool
- the selected Apache Spark version is set to 3.0 (preview)
Poll interval when retrieving Job status (in ms) Enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job. The default value is 3000, meaning 30 seconds.
Maximum number of consecutive statuses missing Enter the maximum number of times Talend Studio should retry to get a status when there is no status response. The default value is 10.
-
Enter the basic storage information of Azure Synapse:
Parameter Usage Authentication method Select the authentication mode to be used from the drop-down list: - Secret Key
- Azure Active Directory
Storage Select the storage to be used in the drop-down list. ADLS Gen2 is the default storage for Azure Synapse Analytics workspace. If you are using Azure Active Directory authentication, make sure the application is linked to ADLS Gen2 with granted role Storage Blob Data Contribution.
Hostname Enter the Primary ADLS Gen2 account from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Container Enter the Primary ADLS Gen2 file storage from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace. Username Enter the storage account name linked to your Azure Synapse workspace. This property is only available when you select Secret Key from the Authentication method drop-down list.
Password Enter the access keys linked to your Azure Synapse workspace. For more information about how to retrieve your access keys, see View account access keys from the official Microsoft documentation.
This property is only available when you select Secret Key from the Authentication method drop-down list.
Directory ID Enter the directory ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Application ID Enter the application ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Use certificate to authenticate Select this check box to authenticate to your Azure Active Directory application using a certificate and then enter the location in which the certificate is stored in the Path to certificate field. Make sure you upload the certificate in the Certificates & secrets > Certificates section of your Azure Active Directory application. For more information about certificates, see the official Microsoft documentation.
This property is only available when you select Azure Active Directory from the Authentication method drop-down list.
Client key Enter the client key linked to your Azure Active Directory application. You can generate the client key from the Certificates & secrets tab of your Azure portal. This property is only available when you select Azure Active Directory from the Authentication method drop-down list and when you clear the Use certificate to authentication check box.
Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in your storage. -
Enter the basic configuration information:
Parameter Usage Use local timezone Select this check box to let Spark use the local time zone provided by the system. Information noteNote:- If you clear this check box, Spark use UTC time zone.
- Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: - If you select the check box, the components inside the Spark Batch Job run with DS which improves performance.
- If you clear the check box, the components inside the Spark Batch Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.
This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.
Information noteImportant: If your Spark Batch Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates. Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.Parallelize output files writing Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time. When you leave this checkbox cleared, the output files are written sequentially within one thread.
On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized.
This option is only available for Spark Batch Jobs containing the following output components:- tAvroOutput
- tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected)
- tFileOutputParquet
Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.Batch size (ms) Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches. Define a streaming timeout (ms) Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running. Information noteNote: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace. -
Select the Set tuning properties check box to define the
tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch
Jobs.
Information noteImportant: You must define the tuning parameters otherwise you could get an error (400 - Bad request).
- In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
- Select the Wait for the Job to complete check box to make Talend Studio or, if you use Talend JobServer, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.
Results
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!