Skip to main content Skip to complementary content

Defining EMR Serverless connection parameters with Spark Universal

About this task

Talend Studio connects to EMR Serverless to run the Job from this cluster.

Procedure

  1. Click the Run view beneath the design workspace, then click the Spark configuration view.
  2. Select Built-in from the Property type drop-down list.
    If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
    Information noteTip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
  3. Select Universal from the Distribution drop-down list, the Spark version from the Version drop-down list, and EMR Serverless from the Runtime mode/environment drop-down list.
  4. If you run your Spark Job on Windows, specify the location of the winutils.exe program:
    • If you want to use your own winutils.exe file, select the Define the Hadoop home directory check box and enter its folder path.
    • Otherwise, leave the Define the Hadoop home directory check box clear. Talend Studio will generate and use a directory automatically for this Job.
  5. Enter the basic configuration information:
    Parameter Usage
    Use local timezone Select this check box to let Spark use the local time zone provided by the system.
    Information noteNote:
    • If you clear this check box, Spark use UTC time zone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

    This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

    Information noteImportant: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Information noteNote: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
    Parallelize output files writing Select this checkbox to enable the Spark Batch Job to run multiple threads in parallel when writing output files. This option improves the performance of the execution time.

    When you leave this checkbox cleared, the output files are written sequentially in one thread.

    On subJobs level, each subJob is treated sequentially. Only the output file inside the subJob is parallelized.

    This option is only available for Spark Batch Jobs containing the following output components:
    • tAvroOutput
    • tFileOutputDelimited (only when the Use dataset API in migrated components checkbox is selected)
    • tFileOutputParquet
    Information noteImportant: To avoid memory problems during the execution of the Job, you need to take into account the size of the files being written and the execution environment capacity before using this parameter.
  6. Enter the EMR Serverless configuration information:
    Parameter Usage
    AWS role ARN Specify the ARN of the IAM role that grants your Spark Jobs the necessary permissions to access AWS resources.
    AWS access key Provide the access key ID for authenticating your Spark Jobs with AWS services.
    AWS region Specify the geographic region where your Spark Jobs will run and where AWS resources will be accessed.
    AWS secret key Provide the secret access key for authenticating your Spark Jobs with AWS services.
    AWS session token Provide the temporary session token for authenticating your Spark Jobs with AWS services.
    Thread pool size for deployments tasks Set the maximum number of concurrent threads used for running deployment operations.
    AWS socket timeout in ms Set the maximum amount of time, in milliseconds, that your Spark Jobs will wait for a response from AWS services before timing out.
    AWS connection timeout in ms Set the maximum amount of time, in milliseconds, that your Spark Jobs will wait to establish a connection with AWS services before timing out.
    EMR application deployment timeout in ms Set the maximum amount of time, in milliseconds, that your Spark Jobs will wait for an EMR application to be deployed before timing out.
    S3 JAR upload timeout in ms Set the maximum amount of time, in milliseconds, that your Spark Jobs will wait for JAR files to upload to Amazon S3 before timing out.
    Deploy new application Select this check box to enable the automatic deployment of a new EMR Serverless application for your Spark Jobs, rather than using an existing application.
    Application ID Specify the unique identifier of the EMR Serverless application that will be used to run your Spark Jobs.
    AWS S3 bucket name Specify the name of the Amazon S3 bucket where your Spark Jobs will store and retrieve data.
    AWS S3 key Specify the object key (path and filename) in your Amazon S3 bucket where your Spark Jobs will store or retrieve files.
  7. In the Spark "scratch" directory field, enter the local path where Talend Studio stores temporary files, like JARs to transfer.
    If you run the Job on Windows, the default disk is C:. Leaving /tmp in this field will use C:/tmp as the directory.
  8. To make your Job resilient to failure, select Activate checkpointing to enable Spark’s checkpointing operation.
    In the Checkpoint directory field, enter the cluster file system path where Spark saves context data, such as metadata and generated RDDs.
  9. In the Advanced properties table, add any Spark properties you want to override the defaults set by Talend Studio.
  10. Select the Use Atlas check box to trace data lineage, view Spark Job components, and track schema changes between components.
    This option is only available for Spark Universal 3.3.x.

    With this option activated, you need to set the following parameters:

    • Atlas URL: Enter the address of your Atlas instance, such as http://name_of_your_atlas_node:port.

    • In the Username and Password fields, enter the authentication information for access to Atlas.

    • Set Atlas configuration folder: Select this check box if your Atlas cluster uses custom properties like SSL or read timeout. In the field that appears, enter the path to a local directory containing your atlas-application.properties file. Your Job will then use these custom properties.

      Ask the administrator of your cluster for this configuration file. For more information, see the Client Configs section in Atlas configuration.

    • Die on error: Select this check box to stop Job execution if Atlas-related issues occur, such as connection errors. Clear it to let your Job continue running.

Results

The connection details are complete, you are ready to schedule executions of your Spark Job or to run it immediately.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!