Defining the EMR connection parameters

Complete the EMR connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Procedure

Enter the basic connection information to EMR:

Yarn client	The Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly. If you are using the Yarn client mode, you need to set the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored): In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used. Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears. Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server. Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml. If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used. If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver. Note that in this situation, you also need to add the name and the IP address of this machine to its host file.
Yarn cluster	The Spark driver runs in your Yarn cluster to orchestrate how the Job should be performed. If you are using the Yarn cluster mode, you need to define the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored): In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used. Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears. Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server. Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml. If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used. Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Yarn client

The Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

If you are using the Yarn client mode, you need to set the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.
Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.
Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.
Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.
If the Spark cluster cannot recognize the machine in which the Job is launched, select this Define the driver hostname or IP address check box and enter the host name or the IP address of this machine. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

Yarn cluster

The Spark driver runs in your Yarn cluster to orchestrate how the Job should be performed.

If you are using the Yarn cluster mode, you need to define the following parameters in their corresponding fields (if you leave the check box of a service clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

In the Resource managerUse datanode field, enter the address of the ResourceManager service of the Hadoop cluster to be used.
Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.
Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.
Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal names for the ResourceManager service and the JobHistory service in the displayed fields. This enables you to use your user name to authenticate against the credentials stored in Kerberos. These principals can be found in the configuration files of your distribution, such as in yarn-site.xml and in mapred-site.xml.

If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.
Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Ensure that the user name in the Yarn client mode is the same one you put in tS3Configuration, the component used to provides S3 connection information to Spark.

With the Yarn client mode, the Property type list is displayed to allow you to select an established Hadoop connection from the Repository, on the condition that you have created this connection in the Repository. Then the Studio will reuse that set of connection information for this Job.
If you need to launch from Windows, it is recommended to specify where the winutils.exe program to be used is stored.
- If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.
- Otherwise, leave this check box clear, the Studio generates one by itself and automatically uses it for this Job.
In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
- Tuning Spark for Apache Spark Batch Jobs for Spark Batch Jobs.
- Tuning Spark for Apache Spark Streaming Jobs for Spark Streaming Jobs.
It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise:
- Logging and checkpointing the activities of your Apache Spark Job.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here