Skip to main content Skip to complementary content

Using context variables with Cloudera

In this scenario, you want to choose where to run Spark Jobs between different Cloudera on-premises Runtimes 7.1.7 Spark 3.2.x and 7.1.9 Spark 3.3.x.

It is also relevant with a mix of Cloudera distributions on-premises (7.1.x) and Cloud (7.2.x).

This capability is enabled with Talend Studio context variable feature and with Qlik Spark Universal 3.3.x distribution mode (the latest for Cloudera distributions).

Before you begin

  • Check in the Cloudera documentation whether your targets distributions are compatible with Spark 2, Spark 3 or both at the same time.
  • From Cloudera Manager, download the client configuration for each Hadoop service used (HDFS, Hive, HBase...). For more information, see Downloading Client Configuration Files, from Cloudera documentation.

Creating a Metadata Connection to a Hadoop cluster

Procedure

  1. In Talend Studio, navigate to Metadata.
  2. Right-click Hadoop Cluster and select Create Hadoop Cluster.
  3. Enter a name for your cluster and click Next.
  4. Select your distribution, Universal in this example, and select the Spark mode, Yarn cluster in this example.
    Distribution choice.

Importing Hadoop configuration

Procedure

  1. Select Import configuration from local files and click Next.
  2. Specify the location of your client configurations and click Finish.
    Client configuration location.
  3. In the Update connection parameters tab, the default parameters are already filled.
    However, if needed, you can either:
    • select Use a key tab to authenticate to authenticate on a Hadoop cluster,
    • select Use custom classpath to define which Cloudera classpath to run. In this case, specify Spark 2 or Spark 3 libraries.
    Update connection parameters tab.

Contextualizing the Metadata connection

You are able to use a single cluster with different parameters thanks to context values.

Procedure

  1. To create a metadata connection in the wizard, click Export as context.
  2. In the Create / Reuse context wizard that opens, select Create a new repository context and click Next.
  3. Type in a name for the context to be created, and add any general information if required.

    The name of the Metadata entry is proposed by the wizard as the context name, and the information you provide in the Description field will appear as a tooltip when you hover over the context in the Repository.

  4. Click Next to create and view the context.
  5. Click Manage environments to create as many environments as necessary and select a default one.

    In this example, click Create to add a Spark 2 and a Spark 3 environments.

    Environments creation.
  6. Click Finish.
  7. In your Spark Job, select the context variable you want to run your Job with.

Results

You are now able to run your Job with different Cloudera runtimes.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!