Setting general connection properties
This section describes how to configure general connection properties. For an explanation of how to configure advanced connection properties, see Setting advanced connection properties.
Support for Partitions, Buckets and Skews To load data into tables with partitions, buckets or skews, you first need to perform the procedure described below.
To load data into tables with partitions, buckets or skews:
- Create the tables in Hive with these attributes (partitions, buckets or skews) prior to running the task.
-
Add the following values to the
hive.security.authorization.sqlstd.confwhitelist.append
property in the Hive configuration file:If the target tables are partitioned:
|hive.exec.dynamic.partition|hive.exec.dynamic.partition.mode
If the target tables have buckets:
|hive.enforce.bucketing
If the target tables have skews:
|hive.mapred.supports.subdirectories
Information noteIn some Hadoop Distributions, you may need to specify the value without the "
hive
" prefix.
For example,
instead of|enforce.bucketing
|hive.enforce.bucketing
.Information noteIf the value(s) already exist in the
hive.security.authorization.sqlstd.confwhitelist property
, you do not need to add them to thehive.security.authorization.sqlstd.confwhitelist.append
property. - Set the Target Table Preparation task setting to Truncate before loading or Do nothing. For more information on these settings, see Full Load Settings.
To add a Hadoop target endpoint to Qlik Replicate:
-
In the Qlik Replicate console, click Manage Endpoint Connections to open the Manage Endpoint Connections dialog box.
For more information on adding an endpoint to Qlik Replicate, see Defining and managing endpoints.
- In the Name field, type a name for your endpoint. This can be any name that will help to identify the endpoint being used.
- In the Description field, type a description that helps to identify the Hadoop endpoint. This is optional.
- Select Hadoop as the endpoint Type.
-
In the Security section, do the following:
-
To encrypt the data between the Replicate machine and HDFS, select Use SSL. In order to use SSL, first make sure that the SSL prerequisites described in Prerequisites been met.
In the CA path field, either specify the directory containing the CA certificate.
-OR-
Specify the full path to a specific CA certificate.
-
Select one of the following authentication types:
-
User name - Select to connect to the Hadoop cluster with only a user name. Then, in the User name field, specify the name of a user authorized to access the Hadoop cluster.
- User name and password - Select to connect to the Hadoop NameNode or to the Knox Gateway (when enabled - see below) with a user name and password. Then, in the User name and Password fields, specify the required user name and password.
-
Kerberos - Select to authenticate against the Hadoop cluster using Kerberos. Replicate automatically detects whether Qlik Replicate Server is running on Linux or on Windows and displays the appropriate settings.
Information noteNote In order to use Kerberos authentication on Linux, the Kerberos client (workstation) package should be installed.
Qlik Replicate Server on Linux:
When Qlik Replicate Server is running on Linux, select either Ticket or Keytab from the Kerberos options drop-down list.
If you selected Ticket, select one of the following options:
-
Use global Kerberos ticket file - Select this option if you want to use the same ticket for several Hadoop endpoints (source or target). In this case, you must make sure to select this option for each Hadoop endpoint instance that you define.
-
Use specific Kerberos ticket file - Select this option if you want to use a different ticket file for each Hadoop endpoint (source or target). Then specify the ticket file name in the designated field.
This option is especially useful if you need to perform a task-level audit of Replicate activity (using a third-party tool) on the Hadoop NameNode. To set this up, define several instances of the same Hadoop endpoint and specify a unique Kerberos ticket file for each instance. Then, for each task, simply select a different Hadoop endpoint instance.
Information note-
You need to define a global Kerberos ticket file even if you select the Use specific Kerberos ticket file option. The global Kerberos ticket file is used for authentication when selecting a Hive endpoint, when testing the connection (using the Test Connection button), and when selecting which tables to replicate.
-
When replicating from a Hadoop source endpoint to a Hadoop target endpoint, both endpoints must be configured to use the same ticket file.
For additional steps required to complete setup for Kerberos ticket-based authentication, see Using Kerberos authentication.
If you selected Keytab, provide the following information:
-
Realm: The name of the realm in which your Hadoop cluster resides.
For example, if the full principal name is
john.doe@EXAMPLE.COM
, thenEXAMPLE.COM
is the realm. -
Principal: The user name to use for authentication. The principal must be a member of the realm entered above.
For example, if the full principal name is
john.doe@EXAMPLE.COM
, thenjohn.doe
is the principal. - Keytab file: The full path of the Keytab file. The Keytab file should contain the key of the Principal specified above.
Qlik Replicate Server on Windows:
When Qlik Replicate Server is running on Windows, select one of the following:
-
Use the following KDC: Select Active Directory (default) if your KDC is Microsoft Active Directory or select MIT if your KDC is MIT KDC running on Linux/UNIX.
Information noteWhen the Replicate KDC and the Hadoop KDC are in different domains, a relationship of trust must exist between the two domains.
- Realm: The name of the realm/domain in which your Hadoop cluster resides (where realm is the MIT term while domain is the Active Directory term).
- Principal: The user name to use for authentication. The principal must be a member of the realm/domain entered above.
- When Active Directory is selected - Password: The password for the principal entered above.
- When MIT is selected - Keytab file: The keytab file containing the principal entered above.
Information noteWhen replicating from a Hadoop source endpoint to a Hadoop target endpoint, both endpoints must be configured to use the same parameters (KDC, realm, principal, and password).
If you are unsure about any of the above, consult your IT/security administrator.
For additional steps required to complete setup for Kerberos authentication, see Using Kerberos authentication.
-
-
User name and password - Select to connect to the Hadoop NameNode or to the Knox Gateway (when enabled - see below) with a user name and password. Then, in the User name and Password fields, specify the required user name and password.
Information noteConsider the following:
-
A user name and password is required to access the MapR Control System.
-
This information is case sensitive.
Information noteMake sure that the specified user has the required Hadoop access privileges. For information on how to provide the required privileges, see Security requirements.
-
-
-
-
If you need to access the Hortonworks Hadoop distribution through a Knox Gateway, select Use Knox Gateway. Then provide values for the following fields:
Information noteTo be able to select this option, first select Use SSL and then select Password from the Authentication type drop-down list.
- Knox Gateway host - The FQDN (Fully Qualified Domain Name) of the Knox Gateway host.
- Knox port - The port number to use to access the host. The default is "8443".
-
Knox Gateway path - The context path for the gateway. The default is "gateway".
Information noteThe port and path values are set in the gateway-site.xml file. If you are unsure whether the default values have been changed, contact your IT department.
- Cluster name - The cluster name as configured in Knox. The default is "default".
-
In the HDFS section, select WebHDFS, HttpFS or NFS as the HDFS access method. If you are accessing MapR, it is recommended to use HttpFS.
Information noteWhen the Use Knox Gateway option is selected, the NameNode, HttpFS Host, and Port fields described below are not relevant (and are therefore hidden).
-
If you selected WebHDFS:
-
In the NameNode field, specify the IP address of the NameNode.
Information noteThis is the Active node when High Availability is enabled (see below).
-
Replicate supports replication to an HDFS High Availability cluster. In such a configuration, Replicate communicates with the Active node, but switches to the Standby node in the event of failover. To enable this feature, select the High Availability check box. Then, specify the FQDN (Fully Qualified Domain Name) of the Standby NameNode in the Standby NameNode field.
- In the Port field, optionally change the default port (50070).
-
In the Target Folder field, specify where to create the data files on HDFS.
-
-
If you selected HttpFS:
- In the HttpFS Host field, specify the IP address of the HttpFS host.
- In the Port field, optionally change the default port (14000).
- In the Target Folder field, specify where to create the data files on HDFS.
-
If you selected NFS:
- In the Target folder field, enter the path to the folder located under the MapR cluster mount point. For example:
/mapr/my.cluster.com/data
- In order to do this, you first need to mount the MapR cluster using NFS. For information on how to do this, refer to the MapR help.
- In the Target folder field, enter the path to the folder located under the MapR cluster mount point. For example:
Information noteThe Target folder name can only contain ASCII characters.
-
-
In the Hive Access section, do the following:
-
From the Access Hive using drop-down list, select one of the following options:
Information noteWhen the Use Knox Gateway option is selected, the Host and Port fields described below are not relevant (and are therefore hidden).
-
ODBC - Select this option to access Hive using an ODBC driver (the default). Then continue from the Host field.
Information noteIf you select his option, make sure that the latest 64-bit ODBC driver for your Hadoop distribution is installed on the Qlik Replicate Server machine.
-
HQL scripts - When this option is selected, Replicate will generate HQL table creation scripts in the specified Script folder.
Information noteWhen this option is selected, the target storage format must be set to "Text".
- No Access - When this option is selected, after the data files are created on HDFS, Replicate will take no further action.
-
- In the Host field, specify the IP address of the Hive machine.
- In the Port field, optionally change the default port.
- In the Database field, specify the name of the Hive target database.
-