Step 2: Create a lakehouse cluster

A lakehouse cluster defines the compute environment to run Qlik Open Lakehouse storage tasks. Each cluster specifies settings that include the number of instances, machine type, and scaling strategy.

When you create a network integration for a Qlik Open Lakehouse pipeline project, a cluster with a single AWS Spot Instance is created automatically. However, you can create additional clusters in the Administration and Data Integration activity centers.

Lakehouse clusters link pipelines to a group of AWS instances, allowing you to optimize workloads by assigning critical jobs to high-performance clusters, and non-critical workloads to cost-effective machines.

While a cluster is associated with a single VPC, multiple clusters can run within the same VPC. Additionally, a single cluster can run multiple jobs. It is helpful to define the compute requirements of your workloads before creating a lakehouse cluster. Cluster settings, including the scaling strategy, can be modified as needed, although some changes may require the cluster to be rolled. For more information on editing cluster settings, see Managing lakehouse clusters

When you create a lakehouse cluster, you specify the number of Spot and On-Demand instances that Qlik provisions. For more information on how Qlik utilizes Spot and On-Demand instances in your cluster, see Lakehouse cluster (EC2 Auto-Scaling Group)

Using custom images is optional. When using custom images, an x86 image is required, but using both arm and x86 images is recommended to maximize availability of spot instances. For more information, seeAMI requirements .

Cluster capabilities

When you create a cluster, you must choose the workload type that the cluster runs: streaming, CDC, or mixed. In general it is best practice to use separate clusters for streaming and CDC (database and SaaS) sources. This ensures accurate and minimal billing charges. However, there are use cases when a mixed workload is appropriate and can share a cluster:

For the testing or evaluation of small scale projects that have insignificant billing volumes.
If non-streaming usage is minimal and you do not want to configure and maintain a separate cluster.

Prerequisites

To create a lakehouse cluster, you need:

A network integration within the current tenant.
Permission to access the network integration.

Creating a lakehouse cluster

To add a cluster to the current tenant, do the following:

In the Administration activity center, click Lakehouse clusters. Select the Lakehouse clusters tab, click Create new, then Lakehouse cluster, and configure it:

Name: Enter the name of the cluster.
Network integration: Select the network integration where the cluster will be deployed.

Integration space: Select the space that the cluster will belong to, as this is not inherited from the network integration.
Select the cluster capabilities for the workload:
- Streaming workloads: Select this option when ingesting from a streaming data source.
- CDC workloads: Select this option when ingesting from database and SaaS application sources.
- Mixed workloads: Select mixed workloads when testing, or the use of streaming sources is minimal and workloads comprise of mostly CDC sources.
Configure the family type:
- Type: Select the instance type.
- Size: Select the instance size.
Configure the instances:

AWS On-Demand Instances: Enter the number of AWS On-Demand Instances for this cluster.
AWS Spot Instances: Enter the Minimum and Maximum number of Spot Instances to use.

Choose an appropriate strategy for your workload from the following options:

Low cost – Optimizes for low cost, though may lead to occasional periods of high latency.
Low latency - Strives to maintain low latency, while allowing brief, necessary spikes.
Consistent low latency - Proactively scales up to ensure latency remains low.
Manual scaling - Retains a static number of instances with no automatic scaling.

Select how your cluster receives software updates:

Early rollout: Ideal for development and staging clusters to validate new releases against custom set-ups and code, prior to production.
Later rollout: Updates are applied after a successful early rollout, and recommended for production environments.

Add a Key and Value for any tags you want to include that help you identify, organize, and manage resources.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here