Creating a Spark Job

You can start either from the Job Designs node of the Repository tree view in the Integration perspective or from Big Data Batch node under the Job Designs node.

The two approaches are similar and thus the following procedure shows how to create a Spark Job from the Job Designs node.

Procedure

Right-click the Job Designs node and in the contextual menu, select Create Big Data Batch Job.
Then the New Big Data Batch Job wizard appears.
From the Framework drop-down list, select Spark.
In the Name, the Purpose and the Description fields, enter the descriptive information accordingly. Among the information, the Job name is mandatory.
Once done, the Finish button is activated.
If you need to change the Job version, click the M and the m buttons next to the Version field to make the changes.
If you need to change the Job status, select it from the drop-down list of the Status field.

If you need to edit the information in the uneditable fields, select File > Edit Project properties from the menu bar to open the Project Settings dialog box to make the desired changes.
Click Finish to close the wizard and validate the changes.
Then an empty Job is opened in the workspace of the Studio and the available components for Spark appear in the Palette.

Results

In the Repository tree view, this created Spark Job appears automatically under the Big Data Batch node under Job Designs.

Then you need to place the components you need to use from the Palette onto the workspace and link and configure them to design a Spark Job, the same way you do for a standard Job. You also need to set up the connection to the Spark cluster to be used in the Spark configuration tab of the Run view.

You can repeat the same operations to create a Spark Streaming Job. The only different step to take is that you need to select Create Big Data Streaming Job from the contextual menu after right-clicking the Job Designs node, and then you select Spark Streaming from the Framework drop-down list in the New Big Data Streaming Job wizard that is displayed.

After the creation of your Spark Job, you can reduce the time spent by the Job at runtime with the lightweight dependencies option. This option reduces the number of libraries to only the Talend libraries and thus affect how the Job runs. Also, all the dependencies remain but they are not sent to the cluster at runtime. This could prevent issues about dependencies conflict, missing signature, wrong JAR version or missing JAR for example. In the Run view, click the Spark Configuration tab and select the Use lightweight dependencies check box. You can also use another classpath, different from the Cloudera default one, by selecting the Use custom classpath check box and entering the JARs you want to use in a regex syntax separated by a comma. This option is available for the following distributions:

Amazon EMR 6.2.0
Cloudera CDH 6.1.1 and other 6.x versions compatible through dynamic distributions
Cloudera CDP 7.1.1 and other 7.1.x versions compatible through dynamic distributions

Note that if you need to run your Spark Job in a mode other than the Local mode, a Storage component, typically a tHDFSConfiguration component, is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred.

You can also create these types of Jobs by writing their Job scripts in the Jobscript view and then generate the Jobs accordingly. For more information on using Job scripts, see Job scripts reference guide.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here