Skip to main content Skip to complementary content

Data quality for connection-based datasets

Information noteYou need a Qlik Talend Cloud Enterprise subscription.

To benefit from semantic types discovery and data quality readings on your connection-based datasets, you need to set up an important prerequisite with your data connections in the context of data products.

  • Data quality is supported in both pullup and pushdown modes for Snowflake and Databricks datasets.

  • Data quality is supported in pullup mode for the datasets based on following databases:

    • Amazon Athena

    • Amazon Redshift

    • Apache Hive

    • Apache Phoenix

    • Apache Spark

    • Azure SQL Database

    • Azure Synapse Analytics

    • Cassandra

    • Cloudera Impala

    • Couchbase

    • DynamoDB

    • Google BigQuery

    • Marketo

    • Microsoft SQL Server

    • MongoDB

    • MySQL Entreprise Edition

    • Oracle

    • PostgreSQL

    • Presto

    • SAP Hana

    • Snowflake

    • Teradata

Creating connection-based datasets

You can create connection-based datasets from the Catalog but you can also use pipeline projects.

Creating datasets from a pipeline project lets you perform all your data integration within a project using data tasks. For more information, see Creating a data pipeline project.

Creating datasets from the Catalog

When you do not need to use a pipeline project, you can create datasets to compute the data quality and consume the datasets through data products.

  1. In Qlik Talend Data Integration > Catalog, click Create new > Dataset.
  2. Select the connection and click Next.
  3. Select the datasets and click Next.

    When a dataset is not in the list, it means that it is not in one of the supported formats:

    • Excel files:.xls, .xlsx
    • Delimited text files:.csv, .txt
    • Excel files:.xls, .xlsx
    • JSON files:.json
    • XML files:.xml
    • Qlik data files:.qvd (QlikView Data), .qvx (QlikView Exchange)
    • Parquet files:.parquet
    • KML files:.kml (Geographic data)

  4. Select the space and click Create datasets. You are redirected to the Catalog and you see the new datasets in the list.

You can now compute the data quality and add the datasets to data products. For more information, see Configuring data quality computing.

Creating datasets from a pipeline project

  1. In Qlik Talend Data Integration > Connections, click Create connection.

  2. Configure your access to the database using the credentials of a user that has sufficient permissions and access to the tables you want to import.

  3. In Qlik Cloud Analytics click Create, and then Data connection.

  4. Configure your access to the same database as previously, using the credentials of the same user ideally, or one that has at least the READ permissions on the tables.

  5. (for Snowflake only) In the Role field, you must enter a role that corresponds to an existing role created in the Snowflake database, and that has the following privileges on these objects.

    • USAGE on WAREHOUSE

    • USAGE on DATABASE

    • USAGE on SCHEMA

    • CREATE TABLE on SCHEMA

    • CREATE FUNCTION on SCHEMA

    • CREATE VIEW on SCHEMA

    • SELECT on TABLE

  6. (for Databricks only) In Databricks, you must define the following privileges on the database:

    • CREATE TABLE

    • CREATE VOLUME

    • MODIFY

    • READ VOLUME

    • SELECT

    • USE SCHEMA

    • WRITE VOLUME

  7. Back on the Qlik Talend Data Integration homepage, click Add new and then Create data project.

  8. Use your connection from step 2 as source for your project and start building your pipeline. See Creating a data pipeline project for more information.

  9. At any point in your pipeline, select a data task, go to Settings, and then the Catalog tab where you can select the Publish to Catalog checkbox.

    It means that this version of the dataset will be published to the Catalog when the data project is prepared and run. It's also possible to check this option at the project level.

  10. Run your data project.

After running your data project, the new dataset is added to the Catalog and you will be able to access quality indicators and more details about their content. This configuration also makes it possible to use the datasets as a source for analytics apps.

You can add as many datasets as necessary before building your data product. Since the Catalog can be accessed from both the Qlik Talend Data Integration hub, and Qlik Cloud Analytics Services hub, you can open your datasets in your preferred location, and the right connection will be used depending on the context.

Quality compute in pullup/pushdown

Using the Compute or Refresh button on the Overview of your dataset triggers a quality calculation on a sample of 1,000 rows of the database.

This operation happens in pullup mode by default. For Snowflake and Databricks datasets, this operation can happen both in pullup mode (default), or in pushdown mode, on the database side.

A sample of 100 rows is then sent back to Qlik Cloud, where you can display it as a preview with up to date semantic types and validity and completeness statistics. This sample is then stored on MongoDB.

Information noteData quality cannot be computed for datasets that have more than 500 columns.

Prerequisites for data quality in pushdown mode on Databricks

To compute data quality in pushdown mode on Databricks, Qlik needs to sync certain quality reference data, such as semantic types, to your Databricks instance. It also leverages some advanced features of Databricks.

For this feature to function properly, the following prerequisites must be met on your Databricks instance:

  • Unity Catalog must be enabled.

  • Users associated with the Databricks connection must have permissions to create a table, create a schema, create a volume, and write a volume.

    Qlik will create a schema named qlik_internal in the database specificed in your connection. This schema will not be automatically removed by Qlik. You will need to delete it manually if you stop using SaaS cloud infrastructure.

  • Collations must be enabled.

As for limitations, note that Date recognition in string columns is limited to the ISO-8601 format.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!