Setting general connection properties
This section describes how to configure general connection properties. For an explanation of how to configure advanced connection properties, see Setting advanced connection properties.
To add a Databricks (Cloud Storage) target endpoint to Qlik Replicate:
-
In the Qlik Replicate console, click Manage Endpoint Connections to open the Manage Endpoint Connections dialog box.
For more information on adding an endpoint to Qlik Replicate, see Defining and managing endpoints.
- In the Name field, type a name for your endpoint. This can be any name that will help to identify the endpoint being used.
- In the Description field, type a description that helps to identify the endpoint. This is optional.
- Select Databricks (Cloud Storage) as the endpoint Type.
Databricks ODBC Access
Expand the Databricks ODBC Access section and provide the following information:
- Host: The host name of the Databricks workspace where the specified Amazon S3 bucket is mounted.
- Port: The port via which to access the workspace (you can change the default port 443, if required).
-
Authentication: Select one of the following:
- Personal Access Token: In the Token field, enter your personal token for accessing the workspace.
- OAuth: Provide the following information:
Client ID: The client ID of your application.
Client Secret: The client secret of your application.
Information noteOAuth authentication is supported from Replicate November 2023 Service Release 01 only.Information noteTo use OAuth authentication, your Databricks database must be configured to use OAuth. For instructions, see the vendor's online help.
-
HTTP Path: The path to the cluster being used.
-
If you want the tables to be created in Unity Catalog, select Use Unity Catalog and then specify the Catalog name.
Information noteWhen the Use Unity Catalog option is selected, note the following:
-
Prerequisite: To allow Replicate to access external (unmanaged) tables, you need to define an external location in Databricks. For more information, see:
- Limitation: Change Data Partitioning is not supported and should be set to “Off”. For more information, see: Store Changes Settings.
-
- Database: The name of the Databricks target database.
-
Cluster type: Select either All-purpose (Interactive) or SQL Warehouse according to your cluster type.
Information noteWhen both Use Unity Catalog and All-purpose clusters are selected (in the endpoint settings' General tab), the Sequence storage format will not be available. Choose Text or Parquet instead (in the Advanced tab).
Cloud Storage Access
Select a Storage type and then configure the settings according to the sections below.
Amazon S3
- Bucket name: The name of your Amazon S3 bucket.
- Bucket region:
The region where your bucket is located. It is recommended to leave the default (Auto-Detect) as it usually eliminates the need to select a specific region. However, due to security considerations, for some regions (for example, AWS GovCloud) you might need to explicitly select the region. If the region you require does not appear in the regions list, select Other and specify the code in the Region code field.
For a list of region codes, see AWS Regions.
- Access type: Choose one of the following:
Key pair
Choose this method to authenticate with your Access Key and Secret Key.
IAM Roles for EC2
Choose this method if the machine on which Qlik Replicate is installed is configured to authenticate itself using an IAM role.
For more information about this access option, see:
http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html
-
Access key: The access key information for Amazon S3.
Information noteThis option is only available when Key pair is the access option. -
Secret key: The secret key information for Amazon S3.
Information noteThis option is only available when Key pair is the access option. - Target directory: The target folder in your Amazon S3 bucket.
-
Databricks storage access method: Choose which method your Databricks cluster uses to access the Amazon S3 storage: Access Directly (the default) or Access through DBFS Mount. AWS S3 storage can be accessed by mounting buckets using DBFS or directly. Replicate needs to know which method Databricks uses to access the storage so that it can set the "location" property when it creates the tables in Databricks. The "location" property enables Databricks to access the storage data using its configured storage access method.
-
Mount Path - When Access through DBFS Mount is selected, you also need to specify the mount path.
Information noteThe mount path cannot contain special characters or spaces.
For more information on configuring Databricks to access the Amazon S3 storage, see https://docs.databricks.com/data/data-sources/aws/amazon-s3.html
Warning noteAll tables being replicated in a single task must be configured to access the storage using the Databricks storage access method defined in the endpoint settings (see earlier). The same is true for Replicate Control Tables, which are common to all tasks. Any tables that are configured to use a different storage access method will not be replicated or, in the case of Control Tables, will not be written to the target database. To prevent this from happening, you must perform the procedures below if you need to do any of the following:
- Change the Databricks storage access method during a task and retain the existing tables
- Define a new task with a Databricks target whose Databricks storage access method differs from existing tasks (with a Databricks target)
To change the Databricks storage access methods during a task:
- Stop the task.
- Change the Databricks storage access method.
-
For all tables (including Control Tables), execute the ALTER TABLE statement:
ALTER TABLE table_identifier [ partition_spec ] SET LOCATION 'new_location'For details, see https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#set-location.
- Start the task.
To create a new task with a Databricks endpoint whose "Databricks storage access method" differs from existing tasks:
-
When you define the task, specify a dedicated Control Table schema in the task settings' Control tables tab.
-
Google Cloud Storage
-
JSON credentials: The JSON credentials for the service account key used to access the Google Cloud Storage bucket.
For more information about JSON credentials, see the Google Cloud online help.
-
Bucket name: The name of the bucket in Google Cloud Storage where you want the data files to be written. This must be the same as the bucket you configured for your Databricks cluster.
- Target directory: Where to create the data files in the specified bucket.
Microsoft Azure Data Lake Storage (ADLS) Gen2
-
Storage account: The name of your storage account.
Information noteTo connect to an Azure resource on Government Cloud or China Cloud, you need to specify the full resource name of the storage account. For example, assuming the storage account is "myaccount", then the resource name for China Cloud would be myaccount.dfs.core.chinacloudapi.cn
In addition, you also need to specify the login URL using the adlsLoginUrl internal parameter. For China Cloud, this would be https://login.chinacloudapi.cn
For information on setting internal parameters, see Setting advanced connection properties
- Azure Active Directory Tenant ID: The Azure Active Directory tenant ID.
- Application Registration Client ID: The application registration client ID.
- Application Registration Secret: The Application registration secret.
- Container: The container in which your folders and files and folders reside.
-
Target directory: Specify where to create the data files on ADLS.
Information note- The Target folder name can only contain ASCII characters.
- Connecting to a proxy server with a username and password is not supported with Azure Data Lake Storage (ADLS) Gen2 storage.
- Databricks storage access method: Choose which method your Databricks cluster uses to access the Amazon S3 storage: Access Directly (the default) or Access through DBFS Mount. AWS S3 storage can be accessed by mounting buckets using DBFS or directly. Replicate needs to know which method Databricks uses to access the storage so that it can set the "location" property when it creates the tables in Databricks. The "location" property enables Databricks to access the storage data using its configured storage access method.
Mount Path - When Access through DBFS Mount is selected, you also need to specify the mount path.
Information noteThe mount path cannot contain special characters or spaces.
For more information on configuring Databricks to access Azure storage, see https://docs.databricks.com/data/data-sources/azure/azure-storage.html
Microsoft Azure Blob Storage
-
Storage account: The name of an account with write permissions to the container.
Information noteTo connect to an Azure resource on Government Cloud or China Cloud, you need to specify the full resource name of the storage account. For example, assuming the storage account is MyBlobStorage, then the resource name for China cloud would be MyBlobStorage.dfs.core.chinacloudapi.cn
For information on setting internal parameters, see Setting advanced connection properties
- Access key: The account access key.
- Container name: The container name.
- Target directory: Specify where to create the data files on Blob storage.
- Databricks storage access method: Choose which method your Databricks cluster uses to access the Amazon S3 storage: Access Directly (the default) or Access through DBFS Mount. AWS S3 storage can be accessed by mounting buckets using DBFS or directly. Replicate needs to know which method Databricks uses to access the storage so that it can set the "location" property when it creates the tables in Databricks. The "location" property enables Databricks to access the storage data using its configured storage access method.
Mount Path - When Access through DBFS Mount is selected, you also need to specify the mount path.
Information noteThe mount path cannot contain special characters or spaces.
For more information on configuring Databricks to access Azure storage, see https://docs.databricks.com/data/data-sources/azure/azure-storage.html
All tables being replicated in a single task must be configured to access the storage using the Databricks storage access method defined in the endpoint settings (see earlier). The same is true for Replicate Control Tables, which are common to all tasks. Any tables that are configured to use a different storage access method will not be replicated or, in the case of Control Tables, will not be written to the target database. To prevent this from happening, you must perform the procedures below if you need to do any of the following:
- Change the Databricks storage access method during a task and retain the existing tables
- Define a new task with a Databricks target whose Databricks storage access method differs from existing tasks (with a Databricks target)
To change the Databricks storage access methods during a task:
- Stop the task.
- Change the Databricks storage access method.
-
For all tables (including Control Tables), execute the ALTER TABLE statement:
ALTER TABLE table_identifier [ partition_spec ] SET LOCATION 'new_location'For details, see https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#set-location.
- Start the task.
To create a new task with a Databricks endpoint whose "Databricks storage access method" differs from existing tasks:
-
When you define the task, specify a dedicated Control Table schema in the task settings' Control tables tab.