Onboarding data from Qlik Replicate | Qlik Cloud Help
Skip to main content Skip to complementary content

Onboarding data from Qlik Replicate

You can leverage the full functionality of Qlik Open Lakehouse without abandoning your existing Replicate infrastructure. Use the onboarding data wizard to create a pipeline that processes data from Qlik Replicate Amazon S3 output and stores it as Iceberg tables using Qlik Open Lakehouse.

Prerequisites

Before launching the onboarding data wizard, make sure you have fulfilled the following prerequisites.

Qlik Replicate prerequisites

Set up a Qlik Replicate task to move data to an Amazon S3 bucket (via the Amazon S3 target endpoint).

Qlik Replicate task prerequisites

  • Enable the Full Load and/or Store changes (required for keeping the data up-to-date) replication options only.

  • In the task settings' Store Changes Settings tab:

    • Configure the Change Table Settings as follows:

      • Suffix: __ct

      • Header column prefix: header__

      • DDL options: Apply to change table

      • On UPDATE: Store after image only

    • Leave Change Data Partitioning disabled (the default).

  • In the task settings' Metadata > Control Tables tab, make sure the Apply Exceptions and DDL History control tables are selected.

For more information on these settings, refer to the Qlik Replicate tasks settings help.

Qlik Replicate Amazon S3 target endpoint prerequisites

In the General tab:

  • In the Metadata Files section, enable the Create metadata files in the target folder (Replicate November 2025 or earlier) or Create a metadata file for each data file option (Replicate May 2026 or later). This will create files with a .dfm extension in the S3 bucket.
  • In the File Attributes section, make sure that:
    • CSV is the selected Format (the default)
    • Compress files using is set to None (the default)
    • Both of the Add metadata header options are enabled (With column names and With data types)
  • When configuring the Amazon S3 storage settings, if your bucket contains non-Replicate files, it is recommended to specify a Target Folder. Doing so will make it easier to locate the required output when onboarding the data with Qlik Open Lakehouse.

For more information on these settings, refer to the Amazon S3 target endpoint help.

Qlik Open Lakehouse prerequisites

An AWS S3 Data Stream connector configured to access the Replicate Amazon S3 output.

For instructions, see AWS S3 Data Stream.

How Replicate organizes data in S3

A familiarity with the structure Qlik Replicate uses to organizes data in S3 will make it easier to locate those files in the Directory browser, described in Create an onboarding task below.

  • s3://bucket-name/target-path/schema.table/

    Contains full Load files and DFM metadata files in sequential format, as in the following example:

    • LOAD00000001.csv

    • LOAD00000001.dfm

    • LOAD00000002.csv

    • LOAD00000002.dfm

  • s3://bucket-name/target-path/schema.table__ct/

    Contains Change Data Capture files and DFM metadata files in timestamp format, as in the following example:

    • 20250121-083015000.csv

    • 20250121-083015000.dfm

    • 20250121-091522000.csv

    • 20250121-091522000.dfm

Create an onboarding task

Follow these steps to configure an onboarding task for Replicate data:

  1. Launch the onboarding wizard as follows:

    • New project: Click Onboard data in the middle of the project window.

    • Existing project: In the top right of your Qlik Open Lakehouse project, click Creat new > Onboard data to launch the onboarding wizard.

  2. Enter a task name and description.
  3. Select Qlik Replicate from the Data origin options. Then, click Next.
  4. Select the AWS S3 Data Stream connector you configured earlier. Then, click Next.
  5. In the Directory browser pane on the left, select the root directory containing the Replicate output.

    Tip noteIf you did not specify a target folder in the Replicate Amazon S3 target endpoint settings (as recommended in the prerequisites above), your bucket might contain many unrelated files. In this case, if you are aware of the exact path, you can paste it directly into the Replicate root directory path field.
  6. Click Load to detect available tables. Any available tables will be shown in the Selected tables to register list.
  7. In the Select data source to discover tables region on the right, the following settings are available:
    • Include all current tables: Toggle on to land all existing tables. When this option is toggled off, you need to select individual tables.
    • Delete the source files following ingestion: Toggle on to permanently remove files from the Replicate S3 bucket after they are successfully processed.

      Once the S3 files are deleted, all connected pipelines that use the same S3 files as a source will become inactive and unable to function. You should therefore only enable this option if you are certain that the files are not required by any other pipeline process.

  8. Click Next.
  9. On the Content Type step, the file format (CSV) is automatically detected from your Replicate data. The following options are also available:
    • Historical Data store (Type 2): Toggle on if you need to maintain historical versions of records and keep an archive of the changes.
    • Verify that events are correctly loaded: You can select a dataset to verify that it contains the correct data.
  10. Click Next.
  11. Review the task configuration and click Create to create the onboarding task.
  12. Click Prepare to catalog the data task and prepare it for execution.

    You can track the prepare progress in the task tile. On completion, Prepare: Success should be displayed.

  13. When you are ready to start onboarding data, click Run.

What happens next

Once activated, the onboarding task:

  1. Starts to monitor your Replicate S3 bucket for data changes
  2. Detects Full Load and CDC (Change Data Capture) files from Replicate
  3. Processes the files and creates Iceberg tables in Qlik Open Lakehouse
  4. Continues to monitor for new data
  5. If enabled, automatically deletes source files after successful ingestion

Tasks settings

To change the task settings, open the Onboarding_Qlik_Replicate (default name) task and click the Settings toolbar button.

The Settings: <task name> dialog opens.

General tab

These options are described in Step 7 of Create a landing task above.

  • Automatically include all current and future tables

  • Delete the source files following ingestion

Runtime tab

Change the Lakehouse cluster

Considerations and limitations

General considerations and limitations

  • Full Load and Reload operations may experience a processing delay before data appears in Iceberg tables. The system waits up to 15 minutes after Full Load completion to ensure the files are complete. As a result, running full loads in Replicate (scheduled or manually) less than 15 minutes apart is not supported.
  • Empty Replicate tables (tables with no data files) are not detected or registered by the system.
  • If you enable Delete the source files following ingestion, the landing task cannot be reloaded. To reload data, you must regenerate it in Replicate.
  • Qlik Replicate's Tables are already loaded. Start processing changes from Advanced Run option is not supported. When Type 2 (historical data) is enabled, using this option will corrupt the history table.
  • Schema evolution is not supported

CDC vs Replicate Pipeline

In a regular pipeline ingesting from CDC sources, the Data is updated to metric always indicates the current freshness of the data (for example, 1 minute ago). This is because the pipeline has direct access to the database transaction log, which provides real-time knowledge of every INSERT/UPDATE/DELETE event. However, in a pipeline ingesting data from Replicate, the Data is updated to metric does not indicate the current freshness of the data, and may indeed give the false impression that the data is stale (for example, 8 hours ago). This is because the Replicate pipeline has no direct access to the transaction log. Instead, Replicate writes files to an S3 bucket, and the pipeline checks the file's creation timestamp to determine data freshness. So if there were no new changes in the source database for a while, Replicate will not create a new file and the Data is updated to timestamp will get old (giving the mistaken impression that the data is stale).

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!