Skip to main content Skip to complementary content

Defining a connection to the Storage Zone

Information note

As the server connection settings for the Landing Zone are derived from the Storage Zone settings, you must define a Storage Zone before defining a Landing Zone.

For more information on adding data sources, see Defining Landing Zones .

To define the Storage Zone connection:

  1. Open your project and click the Manage button in the bottom left of the Databases panel.

    The Manage Databases window opens.

  2. Either, click the Add New Database link in the middle of the window.

    -OR-

    Click the New toolbar button.

    The New Storage window opens. The settings will be relevant for the compute platform you selected when you set up your project. The sections below detail the settings according to each of the available compute platforms.

Information note

To use AVRO file format with Hive 3.x, you must set the following parameter:

metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader

Cloudera Compute Platform

Security

Use SSL - Select to connect using SSL.

  • Use self-signed certificate - Select to connect using a self-signed certificate.
  • Trusted store full path - Enter the full path to the store containing your trusted certificates.
  • Trusted store password - Enter the password for your trusted certificate store.

Authentication Type:

  • User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster. 
  • User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.

  • Knox - Select this option if you need to access the Hortonworks Hadoop distribution through a Knox Gateway. Then, provide the following information:
    • Host - The FQDN (Fully Qualified Domain Name) of the Knox Gateway host.
    • Knox port - The port number to use to access the host. The default is "8443".
    • Knox Gateway path - The context path for the gateway. The default is "gateway".

      Information note

      The port and path values are set in the gateway-site.xml file. If you are unsure whether the default values have been changed, contact your IT department.

    • Cluster name - The cluster name as configured in Knox. The default is "Default".

    • User name - Enter your user name for accessing the Knox gateway.
    • Password - Enter your password for accessing the Knox gateway.
  • Kerberos - Select to authenticate against the Hadoop cluster using Kerberos. Then, provide the following information:

    • Realm: The name of the realm in which your Hadoop cluster resides.

      For example, if the full principal name is john.doe@EXAMPLE.COM, then EXAMPLE.COM is the realm.

    • Principal: The user name to use for authentication. The principal must be a member of the realm entered above. For example, if the full principal name is john.doe@EXAMPLE.COM, then john.doe is the principal.
    • Keytab file: The full path of the Keytab file. The Keytab file should contain the key of the Principal specified above.

      Information note

      The krb5.ini file should be located in C:\Windows (according to the Java default). However, if Replicate is installed on the same machine as Compose, the file might be in C:\Program Files\MIT\Kerberos. In such a case, simply copy the file to C:\Windows.

    • Host: The FQDN that will be used to locate the correct Principal in Kerberos. This is only required if the IP address of the Hive machine is not known to Kerberos.
    • Service name: The default is "hive". You should only change this if you are sure that the service name is different.
    Information note

    In case of an issue with the Kerberos authentication, do the following:

    1. Test the connection to the Hive machine with Kerberos.

    2. Check the Kerberos configuration on HDFS.

If you are unsure about any of the above, consult your IT administrator.

Hive Access

  • Use ZooKeeper - Select if your Hive machines are managed by Apache ZooKeeper.
  • ZooKeeper hosts - The machines that make up the ZooKeeper ensemble (cluster). These should be specified in the following format:

    host1:port1,host2:port2,host3:port3

  • ZooKeeper namespace - The namespace on ZooKeeper under which the HiveServer2 znodes are located.
  • Host - If you are not using ZooKeeper, specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Cloudera Data Platform Private Cloud or Hortonworks Data Platform target endpoint settings in the Replicate task.

  • Port - If you are not using ZooKeeper, optionally change the default port.
  • Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.

    Information note

    If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.

  • JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.

    Example:

    KEY=VALUE;KEY1=VALUE1

    Information note
    • You can set Hive parameters in the JDBC parameters. For example:
      • mapred.job.queue.name=<queuename>
      • hive.execution.engine=<enginename>
    • To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:

      hive.query.name=my_description

Amazon EMR Compute Platform

Security

  • Use SSL - Select to connect using SSL.

Authentication type:

  • User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster. 
  • User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.

If you are unsure about any of the above, consult your IT administrator.

Hive Access

  • Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Amazon EMR target endpoint settings in the Replicate task.

  • Port - Optionally change the default port.
  • Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.

    Information note

    If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.

  • JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.

    Example:

    KEY=VALUE;KEY1=VALUE1

    Information note
    • You can set Hive parameters in the JDBC parameters. For example:
      • mapred.job.queue.name=<queuename>
      • hive.execution.engine=<enginename>
    • To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
      hive.query.name=my_description
  • Hive metadata storage type - Select one of the following storage mediums for your Hive metadata:
    • Hive Metastore - This is the default metadata storage type.
    • AWS Glue Data Catalog - You can choose to store Hive metadata using the AWS Glue Data Catalog. AWS Glue allows you to store and share metadata in the AWS Cloud in the same way as in a Hive metastore.

      Information note

      When using AWS Glue Data Catalog for metadata storage, Compose control tables will be created with the data type STRING instead of VARCHAR (LENGTH).

Databricks Compute Platform

Information noteFrom Compose May 2022 SR 01, if you use Replicate November 2022 to land data in Databricks, only the Databricks (Cloud Storage) target endpoint can be used. If you are using an earlier supported version of Replicate, you can continue using the existing Databricks target endpoints.

Security

Information note

All connections to Databricks use SSL.

  • Authentication type - "Databricks Delta". This cannot be changed.
  • User name - The default user name is "token", which requires you to specify your personal access token in the Password field. Although it's strongly recommended to use a token, you can also access your Databricks account using a standard user name and password.
  • Password - If you are using a personal access token, this will be the token value. Otherwise, specify the password for accessing your Databricks account.

  • HTTP Path - The HTTP path to your Databricks compute resources.

    Example:

    sql/protocolv1/o/8388045294310983/0123-xxxxxx-xxxxxxx

If you are unsure about any of the above, consult your IT administrator.

Hive Access

  • Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Databricks target endpoint settings in the Replicate task.

  • Port - Optionally change the default port.
  • Catalog - If you want the storage tables to be created in Unity Catalog, specify the catalog name.

    Information noteIf the Replicate task is Full Load and Store Changes, the storage catalog name can be whatever you choose. However, both the catalog name defined in the Replicate Databricks (Cloud Storage) target endpoint and the catalog name defined in the landing settings must be hive_metastore.
  • Database name - Select the Hive target database. This must be different from the database specified in the Landing Zone settings. If you specified a catalog (above), only databases in the catalog will be available for selection.

    Information note
    • If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
    • To prevent table name conflicts when using Databricks, the Landing Zone and Storage Zone databases should be different.
  • JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.

    Example:

    KEY=VALUE;KEY1=VALUE1

    Information note
    • You can set Hive parameters in the JDBC parameters. For example:
      • mapred.job.queue.name=<queuename>
      • hive.execution.engine=<enginename>
    • To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
      hive.query.name=my_description

HDInsight Compute Platform

Security

Information note

All connections to HDInsight use SSL.

  • User name - Specify the name of a user authorized to access the Hadoop cluster. 
  • Password - Specify the password of the user specified in the User name field.

Hive Access

  • Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Microsoft Azure HDInsight target endpoint settings in the Replicate task.

  • Port - Optionally change the default port.
  • Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.

    Information note

    If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.

  • JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.

    Example:

    KEY=VALUE;KEY1=VALUE1

    Information note
    • You can set Hive parameters in the JDBC parameters. For example:
      • mapred.job.queue.name=<queuename>
      • hive.execution.engine=<enginename>
    • To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:

      hive.query.name=my_description

Dataproc Compute Platform

Security

  • Use SSL - Select to connect using SSL.

Authentication type:

  • User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster. 
  • User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.

If you are unsure about any of the above, consult your IT administrator.

Hive Access

  • Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Google Dataproc target endpoint settings in the Replicate task.

  • Port - Optionally change the default port.
  • Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.

    Information note

    If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.

  • JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.

    Example:

    KEY=VALUE;KEY1=VALUE1

    Information note
    • You can set Hive parameters in the JDBC parameters. For example:
      • mapred.job.queue.name=<queuename>
      • hive.execution.engine=<enginename>
    • To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
      hive.query.name=my_description

After defining your Storage Zone connection parameters:

  1. Click Test Connection to verify that Compose is able to establish a connection with the specified database.
  2. Click OK to save your settings. The database is added to the list on the left side of the Manage Databases window.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!