Skip to main content Skip to complementary content

Defining Amazon EMR connection parameters with Spark Universal

When you run your Spark Jobs on YARN cluster using Amazon EMR distribution, you need to manually distribute the libraries as Amazon EMR does not have the same classpath on main and subordinate nodes.

About this task

Complete the following actions using a command prompt to distribute the libraries between main and subordinate nodes.

Procedure

  1. Upload the PEM file to the cluster:
    
    scp -i username_EC2.pem sanulich_EC2.pem hadoop@<mainNode>:/home/hadoop
  2. Confirm that the PEM file has the correct permissions:
    ssh -i username_EC2.pem hadoop@<mainNode>
    ls -al
    The correct permissions must be as follows:
     -r--------  1 username username    1674 кві 11 16:26  username_EC2.pem
  3. Optional: If the PEM file does not have the correct permissions, change the permissions as follows:
    
    chmod -rwx username_EC2.pem
    chmod  u+r username_EC2.pem
  4. Go to your Amazon EMR instance, and find the hostname of the subordinate nodes.
  5. Copy the JAR files from main to subordinates nodes:
    scp -i /home/hadoop/username_EC2.pem  /usr/lib/spark/jars/*.jar hadoop@<slaveNode>:/home/hadoop
  6. Connect to each subordinates nodes from main nodes:
    ssh -i /home/hadoop/username_EC2.pem hadoop@<slaveNode>
  7. Move the JAR file:
    sudo mv /home/hadoop/*.jar /usr/lib/spark/jars
  8. Open Talend Studio and then open your Spark Job.
  9. Click the Run view beneath the design workspace, then click the Spark configuration view.
  10. In the Advanced properties table, add the "spark.hadoop.dfs.client.use.datanode.hostname" property with the "true" value.

Results

Your Spark Job is correctly configured to run in YARN cluster mode with Amazon EMR distribution.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!