Defining a connection to the Storage Zone
As the server connection settings for the Landing Zone are derived from the Storage Zone settings, you must define a Storage Zone before defining a Landing Zone.
For more information on adding data sources, see Defining Landing Zones .
To define the Storage Zone connection:
-
Open your project and click the Manage button in the bottom left of the Databases panel.
The Manage Databases window opens.
-
Either, click the Add New Database link in the middle of the window.
-OR-
Click the New toolbar button.
The New Storage window opens. The settings will be relevant for the compute platform you selected when you set up your project. The sections below detail the settings according to each of the available compute platforms.
To use AVRO file format with Hive 3.x, you must set the following parameter:
metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
Cloudera Compute Platform
Security
Use SSL - Select to connect using SSL.
- Use self-signed certificate - Select to connect using a self-signed certificate.
- Trusted store full path - Enter the full path to the store containing your trusted certificates.
- Trusted store password - Enter the password for your trusted certificate store.
Authentication Type:
- User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster.
-
User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.
- Knox - Select this option if you need to access the Hortonworks Hadoop distribution through a Knox Gateway. Then, provide the following information:
- Host - The FQDN (Fully Qualified Domain Name) of the Knox Gateway host.
- Knox port - The port number to use to access the host. The default is "8443".
Knox Gateway path - The context path for the gateway. The default is "gateway".
Information noteThe port and path values are set in the gateway-site.xml file. If you are unsure whether the default values have been changed, contact your IT department.
Cluster name - The cluster name as configured in Knox. The default is "Default".
- User name - Enter your user name for accessing the Knox gateway.
- Password - Enter your password for accessing the Knox gateway.
- Kerberos - Select to authenticate against the Hadoop cluster using Kerberos. Then, provide the following information:
Realm: The name of the realm in which your Hadoop cluster resides.
For example, if the full principal name is john.doe@EXAMPLE.COM, then EXAMPLE.COM is the realm.
- Principal: The user name to use for authentication. The principal must be a member of the realm entered above. For example, if the full principal name is john.doe@EXAMPLE.COM, then john.doe is the principal.
Keytab file: The full path of the Keytab file. The Keytab file should contain the key of the Principal specified above.
Information noteThe krb5.ini file should be located in C:\Windows (according to the Java default). However, if Replicate is installed on the same machine as Compose, the file might be in C:\Program Files\MIT\Kerberos. In such a case, simply copy the file to C:\Windows.
- Host: The FQDN that will be used to locate the correct Principal in Kerberos. This is only required if the IP address of the Hive machine is not known to Kerberos.
- Service name: The default is "hive". You should only change this if you are sure that the service name is different.
Information noteIn case of an issue with the Kerberos authentication, do the following:
1. Test the connection to the Hive machine with Kerberos.
2. Check the Kerberos configuration on HDFS.
If you are unsure about any of the above, consult your IT administrator.
Hive Access
- Use ZooKeeper - Select if your Hive machines are managed by Apache ZooKeeper.
-
ZooKeeper hosts - The machines that make up the ZooKeeper ensemble (cluster). These should be specified in the following format:
host1:port1,host2:port2,host3:port3
- ZooKeeper namespace - The namespace on ZooKeeper under which the HiveServer2 znodes are located.
-
Host - If you are not using ZooKeeper, specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Cloudera Data Platform Private Cloud or Hortonworks Data Platform target endpoint settings in the Replicate task.
- Port - If you are not using ZooKeeper, optionally change the default port.
-
Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.
Information noteIf the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
-
JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.
Example:
KEY=VALUE;KEY1=VALUE1
Information note- You can set Hive parameters in the JDBC parameters. For example:
mapred.job.queue.name=<queuename>
hive.execution.engine=<enginename>
- To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
hive.query.name=my_description
- You can set Hive parameters in the JDBC parameters. For example:
Amazon EMR Compute Platform
Security
- Use SSL - Select to connect using SSL.
Authentication type:
- User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster.
-
User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.
If you are unsure about any of the above, consult your IT administrator.
Hive Access
-
Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Amazon EMR target endpoint settings in the Replicate task.
- Port - Optionally change the default port.
-
Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.
Information noteIf the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
-
JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.
Example:
KEY=VALUE;KEY1=VALUE1
Information note- You can set Hive parameters in the JDBC parameters. For example:
mapred.job.queue.name=<queuename>
hive.execution.engine=<enginename>
- To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
hive.query.name=my_description
- You can set Hive parameters in the JDBC parameters. For example:
- Hive metadata storage type - Select one of the following storage mediums for your Hive metadata:
- Hive Metastore - This is the default metadata storage type.
AWS Glue Data Catalog - You can choose to store Hive metadata using the AWS Glue Data Catalog. AWS Glue allows you to store and share metadata in the AWS Cloud in the same way as in a Hive metastore.
Information noteWhen using AWS Glue Data Catalog for metadata storage, Compose control tables will be created with the data type STRING instead of VARCHAR (LENGTH).
Databricks Compute Platform
Security
All connections to Databricks use SSL.
- Authentication type - "Databricks Delta". This cannot be changed.
- User name - The default user name is "token", which requires you to specify your personal access token in the Password field. Although it's strongly recommended to use a token, you can also access your Databricks account using a standard user name and password.
-
Password - If you are using a personal access token, this will be the token value. Otherwise, specify the password for accessing your Databricks account.
-
HTTP Path - The HTTP path to your Databricks compute resources.
Example:
sql/protocolv1/o/8388045294310983/0123-xxxxxx-xxxxxxx
If you are unsure about any of the above, consult your IT administrator.
Hive Access
-
Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Databricks target endpoint settings in the Replicate task.
- Port - Optionally change the default port.
-
Catalog - If you want the storage tables to be created in Unity Catalog, specify the catalog name.
Information noteIf the Replicate task is Full Load and Store Changes, the storage catalog name can be whatever you choose. However, both the catalog name defined in the Replicate Databricks (Cloud Storage) target endpoint and the catalog name defined in the landing settings must be hive_metastore. -
Database name - Select the Hive target database. This must be different from the database specified in the Landing Zone settings. If you specified a catalog (above), only databases in the catalog will be available for selection.
Information note- If the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
- To prevent table name conflicts when using Databricks, the Landing Zone and Storage Zone databases should be different.
-
JDBC parameters - Additional parameters to add to the default Simba JDBC connection string.
The following parameters are set by default and should not be changed:
-
UseNativeQuery=1
This is required for reading Compose tables or views in landing or storage on Databricks, from outside of Compose. It is related to the way that the logical partitions are created in the Replicate Change tables (as Databricks does not support partitions).
-
spark.sql.crossJoin.enabled=true
This parameter controls Spark SQL’s behavior when encountering queries that could potentially result in a Cartesian product (cross join).
Additional parameters should be added as key value pairs separated by a semi-colon.
Example:
KEY=VALUE;KEY1=VALUE1
Information note- You can set Hive parameters in the JDBC parameters. For example:
mapred.job.queue.name=<queuename>
hive.execution.engine=<enginename>
- To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
hive.query.name=my_description
-
HDInsight Compute Platform
Security
All connections to HDInsight use SSL.
- User name - Specify the name of a user authorized to access the Hadoop cluster.
-
Password - Specify the password of the user specified in the User name field.
Hive Access
-
Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Microsoft Azure HDInsight target endpoint settings in the Replicate task.
- Port - Optionally change the default port.
-
Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.
Information noteIf the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
-
JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.
Example:
KEY=VALUE;KEY1=VALUE1
Information note- You can set Hive parameters in the JDBC parameters. For example:
mapred.job.queue.name=<queuename>
hive.execution.engine=<enginename>
- To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
hive.query.name=my_description
- You can set Hive parameters in the JDBC parameters. For example:
Dataproc Compute Platform
Security
- Use SSL - Select to connect using SSL.
Authentication type:
- User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster.
-
User name and password - Select to connect to the Hadoop cluster with a user name and password. Then, in the User name and Password fields, specify the name and password of a user authorized to access the Hadoop cluster.
If you are unsure about any of the above, consult your IT administrator.
Hive Access
-
Host - Specify the IP address of the Hive machine. This should be the same as the host name or IP address specified in the Google Dataproc target endpoint settings in the Replicate task.
- Port - Optionally change the default port.
-
Database name - Specify the name of the Hive target database. This must be different from the database specified in the Landing Zone settings.
Information noteIf the database does not exist Compose will try and create it. This requires the Compose user to be granted the necessary permission to create the database.
-
JDBC parameters - Additional parameters to add to the default Simba JDBC connection string. These should be key values separated by a semi-colon.
Example:
KEY=VALUE;KEY1=VALUE1
Information note- You can set Hive parameters in the JDBC parameters. For example:
mapred.job.queue.name=<queuename>
hive.execution.engine=<enginename>
- To distinguish Compose Hive sessions from other Hive Sessions, if Tez is being used, you can define a JDBC parameter to change the query description, as follows:
hive.query.name=my_description
- You can set Hive parameters in the JDBC parameters. For example:
After defining your Storage Zone connection parameters:
- Click Test Connection to verify that Compose is able to establish a connection with the specified database.
- Click OK to save your settings. The database is added to the list on the left side of the Manage Databases window.