Creating and managing data marts | Qlik Cloud帮助
跳到主要内容 跳到补充内容

Creating and managing data marts

Once you have onboarded your data, you can then create data marts using the data from the Storage or Transform tasks. You can create any number of data marts depending on your business needs. Ideally, your data marts should contain repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the Sales department.

信息注释Data mart creation is available with a Qlik Talend Cloud Premium or Qlik Talend Cloud Enterprise subscription.
信息注释数据任务在其所属项目的所有者的上下文中运行。有关所需角色和权限的更多信息,请参阅数据空间角色和权限

除了在数据仓库中存储表外,您还可以将表存储为由数据平台管理的 Iceberg 表。此选项目前仅适用于 Snowflake 项目。在任务设置中的表类型下选择 Snowflake 托管的 Iceberg 表,就可以实现这一点。

Prerequisites

You can use data tasks of the following types as source for a data mart:

  • Storage

  • Transform

Before you can create a data mart, you need to do the following in the source data tasks:

  • Populate the datasets with data that you want to use in your data mart. For more information, see 将数据引入数据仓库.
  • Create a dataset relational model to define the relationships between the source datasets. For more information, see 创建数据模型.

    警告注释All source datasets must have keys.

Creating a data mart

To create a data mart:

  1. Open your project.

  2. Do one of the following:

    • Click Create in the top right and select Create data mart.
    • In the source data task, click in the bottom right corner and then select Create data mart.

    The Create data mart dialog opens.

    创建数据集市对话框

  3. Provide a name for your data mart and, optionally, provide a description as well.

  4. If you want to configure the data mart later, clear the Open check box and then click Create. Otherwise, just click Create.

    The Data mart opens on the Data mart tab.

    数据集市

  5. Select your source data as described in Selecting your source data
  6. If you want the data mart to contain dimensions, add dimensions as described in Adding dimensions and a fact to the data mart
  7. If you want the data mart to contain a fact, add a fact as described in Adding a fact
  8. If the data mart contains both dimensions and a fact, add the dimensions to your star schema as described in Building a star schema
  9. Create the datasets in your data mart and populate them with data as described in Populating your data mart

Selecting your source data

You select your source data from the datasets in the source data task.

To do this:

  1. Click the Select source data button in the middle of the tab or click the Select source data toolbar button.

    The Select source data dialog opens.

  2. From the Pipeline projects drop-down list, select which project to get source data from.

    You can add datasets from the current project, or from another project. To add datasets from another project:

    • You must have at least Can consume role in the space of the consumed project.

    • Both projects must be on the same data platform.

    If the selected project is under version control, you can select which branch to use as the source. For more information about cross-project pipelines, see 构建跨项目管道.

  3. From the Data tasks drop-down list, select a Storage task, or a Transform task if you created transformations.

  4. Either leave the default % to search for all datasets or enter the name of a specific dataset to find. Then click Search.

  5. Select the desired datasets and then click Add selected tables.

  6. Click OK to close the dialog and then proceed to Add a fact and/or Add dimensions.

信息注释You can repeat the process to add datasets from other data tasks.

Adding dimensions and a fact to the data mart

Once you have selected you source data, you can then proceed to build your data mart. A data mart can have a fact dataset, dimensions datasets, or a combination of both (where the dimension datasets are logically related to the fact dataset).

Adding dimensions

To add a dimension.

  1. Click the Add dimension button.

    The Add dimension dialog opens.

  2. The following settings are available:

    • Most granular dataset: Select a dataset.
    • Name: Specify a display name for the dimension. The default is the most granular dataset name.
    • Description: Optionally, provide a description.
    • History type: Select one of the following:
      • Type 1: The existing record in the dimension will be updated whenever the corresponding record in the Storage is updated.
      • Type 2: A new record will be added to the dimension whenever the corresponding record in the Storage is updated.
    • Related dataset to denormalize: Any datasets that can be denormalized in the dimension dataset (according to the relationships in the source data asset model) will be available for selection here.

      Example of a dimension that can be denormalized

      具有相关维度的维度示例

  3. Click OK to save your settings.

    The dimension will be added to the Dimensions list on the left.

See also Role-playing dimensions.

Viewing information about a dimension

When you select a dimension, the Source relational model tab will be shown in the center pane. This tab shows the source datasets that are consolidated in the dimension. Datasets that you chose to denormalize when you added the dimension will be shown selected (and grayed out).

Products dimension denormalized with the Categories and Suppliers dataset

具有相关维度的关系模型示例

Adding a fact

To add a fact:

  1. Click the Add fact button.

    The Add fact dialog opens.

  2. The following settings are available:

    • Fact: Select a dataset to be the fact. The dataset should define the granularity of the fact you are creating.
    • Name: Specify a display name for the fact. The default is the fact name.
    • Description: Optionally, provide a description.
    • Related datasets to denormalize: Any datasets that can be denormalized in your fact dataset will be available for selection here.
    • Advanced
      • Use current data: When selected (the default), the fact will not contain a transaction date column.
      • Choose transaction date: To locate data according to a specific transaction date, select this option and then select a date column. This is useful if your star schema contains type 2 dimensions and you need to find the correct data for a specific transaction. For example, if a customer has multiple addresses, it might be possible to find the correct address according to the order date.

        Example use case:

        A retailer needs to create a data mart to analyze the relationship between orders and customers. The data mart should be able to answer queries such as: Which US state had the highest sum of orders in Q4 2022?

        If the retailer selects the Use current data option, only the most current record version in the Customers table will be included in the calculation.

        Ignoring the transaction date will result in inaccurate data as illustrated below:

        显示选择“使用当前数据”时 Orders 事实表和 Customers 维度之间的关系。

        If the retailer selects the Choose transaction date option, however, the customer's orders will be associated with the correct record version in the Customers table.

        This will allow the retailer to accurately calculate the total sum of orders per state in Q4 2022.

        显示选择“使用交易日期”时 Orders 事实表和 Customers 维度之间的关系。

        提示注释Note that the transaction date can be leveraged differently in each data mart, according to business needs. For example, in one data mart it could be used to analyze order dates, while in another data mart it could be used to analyze shipping dates.
  3. Click OK to save your settings.

    The fact will be added to the Facts list on the left.

Viewing information about the fact

When you select a fact, the following tabs are shown in the center pane:

  • Star schema model (default): Shows a graphical representation of the dataset relationships within the data mart.
  • Fact model: Shows any datasets related to the fact dataset. Datasets that you chose to denormalize when you added the fact will be shown selected (and grayed out).

    Orders Details fact with a denormalized Orders dataset

    具有反规范化数据集的事实示例

  • Transaction date: The name of the transaction column if you selected the Choose transaction date option when adding the fact.

Unknown and late-arriving dimension handling

Every dimension contains -1 and 0 rows, which are the dimension’s business key (object ID). Row -1 is reserved for late-arriving dimensions, while 0 is reserved for unknown dimensions.

Unknown dimensions

“Unknown” is data that was unavailable when the dimension was originally created. For example, let’s say you have an ORDER fact with a SHIPPER column that is the business key for the SHIPPER dimension. If the ORDER fact row has not been shipped yet - and thus has NULL for the SHIPPER column - it will become related to the 0 record (which denotes an Unknown for the dimension). When the ORDER fact is later updated with a SHIPPER value (for example, USPS), the related dimension ID (0) will be updated in turn.

信息注释"Unknown" could also be missing or non-existing data, which might never be added.

Late-arriving dimensions

A late-arriving dimension has a key which exists in the new fact data, but which does not yet exist in the dimension. For example, if the SHIPPER in the ORDER fact has a value of "NEWSHIP" for a new shipper and that business key does not yet exist in the DIM_SHIPPER dimension, the data mart processing will relate that fact to the -1 row. This denotes a missing dimension member in your dimension table. When the "NEWSHIP" business key arrives for the DIM_SHIPPER dimension, its dimension row will be created, and the fact record updated to align with the previously missing dimension.

Building a star schema

Once you have added dimensions to your data mart, you can then proceed to connect them to your fact dataset, thereby creating a star schema.

To do this:

  1. Select your fact in the Facts list on the left.
  2. Select which dimensions to add from the Recommended dimensions list on the right.

    Recommended dimensions are shown connected to the fact dataset with a dotted line.

    In the image below, some of the dimensions were added earlier and are therefore connected with a solid gray line.

    具有推荐维度的星型模式示例

  3. Click Apply to add the dimensions.

    The dimensions will be shown connected to the fact dataset with a solid gray line.

  4. To close the Recommended dimensions panel, click 关联.

Populating your data mart

Once you have designed your data mart, you can then proceed to populate it.

To do this:

  1. Click the Prepare toolbar button in the top right.

    The preparation process includes creating datasets and views in the data mart, and updating the Catalog.

    You can follow the progress under Preparation progress in the lower part of the screen.

    After the preparation is complete, the Prepare button changes to Run.

  2. Optionally, validate the data mart as described in Validating and synchronizing your data mart below.

  3. Click the Run button.

    The window switches to Monitor view, showing the loading progress and status of the datasets in your data mart.

    数据集市监控

Every source record will be processed by the data mart, even deleted records. This is done to ensure that historical information is kept.

信息注释For dimension datasets with type 2 history, the processed records count will include all version rows for a record and show a higher value than the number of actual records processed.

Validating and synchronizing your data mart

Validating the data mart ensures that the data mart metadata is identical to the corresponding metadata in the Storage (or Transform if defined). Validating the data mart also compares the created metadata against the current star schema design. For example, if you run the validation after adding a dimension to an already created data mart, the validation will fail.

To validate the data mart:

  1. Either select Validate datasets from the 更多 menu to the right of the Run button or click the Validate datasets button at the bottom right of the window.

    A Validation is completed message will be shown.

  2. If the metadata is not in sync or if there are star schema design conflicts, the Validate and adjust pane at the bottom of the window will automatically open with the validation report.

    具有星型模式设计冲突的数据集市

    具有验证错误的数据集市

  3. To resolve any Pending design changes issues, click the 更多 button in the top right and select Prepare. If the value of the Can be altered without data loss column is Yes, an ALTER operation will be performed. Otherwise, the data mart tables will be recreated.

    Note that all Validation errors need to be resolved manually.

Managing data marts

This section describes the various options available for managing your datasets and data marts.

Refining facts or dimensions

In the Datasets tab, you can perform various operations to refine your facts and dimensions such as creating transformation rules (for instance, replacing columns values), and adding column-level expressions. The Datasets tab is located to the right of the Data mart tab:

Datasets tab

用于管理事实和维度数据集的数据集选项卡

Adding rules

For an explanation of how to add global rules, see 创建规则以转换数据集

信息注释For more information about rules as well as other operations you can perform on your datasets, see 管理数据集. Note that some of the options such as filtering and renaming datasets are not available for Data Mart data tasks.

Adding new columns

You can add new columns to the target dataset.

  • Click + Add.

    Provide a name for the column, and set an expression to define the column data.

    For more information, see 向数据集中添加列.

  • Click 向下 next to Add and select Add column from source.

    Select a column from the source dataset.

Reordering columns

You can change the ordinal position of a column.

  1. Select a column.

  2. Click 更多 and then Reorder.

  3. Use the arrows to move the column up or down.

  4. Close Change ordinal when you are ready.

Role-playing dimensions

A role-playing dimension is the same dimension used multiple times within the same star schema, but with different meanings. This is commonly seen with the Date and Customer dimensions. For example, your star schema might have two Date entities, one representing the Order Date and the other representing the Received Date.

To add or edit a dimension's role name:

  1. Click the 更多 icon in the dimension node and select Edit dimension name in this star schema.
  2. In the Edit dimension name in this star schema dialog, enter a name (or edit the existing name) in the Dimension name in this star schema field and click OK.

    The new name will appear below the original dimension name.

Additional management options

The following table describes additional management options:

To Do this
Add additional source datasets See Selecting your source data.
Add additional facts See Adding a fact
Add additional dimensions See Adding dimensions and a fact to the data mart.
Delete a dimension Select the dimension in the Dimensions pane, and then select Delete from the menu.
Delete a fact Select the fact in the Facts pane, and then select Delete from the menu.
Recreate a data mart

Click the button in the top right and select Recreate tables. Recreating a data mart might be necessary if, for example, there are changes in the Storage that cannot be automatically synced to the data mart.

信息注释如果个别表存在问题,建议首先尝试重新加载表,而不是重新创建它们。重新创建表可能会导致历史数据丢失。如果存在中断更改,还必须准备使用重新创建的数据任务的下游数据任务,以重新加载数据。
Stop a running data mart task Click the Stop button in the top right.
Prepare a data mart task

Click the button in the top right and select Prepare to sync the datasets with the Transform or Storage assets and resolve any design conflicts. This prepares a task for execution. This includes:

  • Validating that the design is valid.

  • Creating or altering the physical tables and views to match the design.

  • Generating the SQL code for the data task.

  • Creating or altering the catalog entries for the task output datasets.

You can follow the progress under Preparation progress in the lower part of the screen.

信息注释在准备任务之前,停止所有直接下游的任务。

Scheduling a data mart task

You can schedule a data mart task to be updated periodically. You can set a time based schedule, or set the task to run when input data tasks have completed running.

Click ... on a data task and select Scheduling to create a schedule. The default scheduling setting is inherited from the settings in the project. For more information about default settings, see 数据集市默认值.

You need to set Scheduling to On to enable the schedule.

Time based schedules

You can use a time based schedule to run the task regardless of when the different input sources are updated.

  • Select At specific time in Run the data task.

You can set an hourly, daily, weekly or monthly schedule.

Event based schedules

You can use an event based schedule to run the task when input data tasks have completed running.

  • Select On specific event in Run the data task.

You can select if you want to run the task when any of the input tasks have completed successfully, or when any of a selection of input tasks have completed successfully.

信息注释The task will not run if any input task or downstream task is running when the schedule is triggered. The task is skipped until the next scheduled run.

Reloading data

You can perform a manual reload of data. This is useful when there are issues with one or more tables.

  1. Open the data task and select the Monitor tab.

  2. Select the tables that you want to reload.

    When a dimension is selected for reload, all facts that use that dimension will also be reloaded to maintain integrity.

  3. Click Reload tables.

You can cancel the reload for tables that are pending reload by clicking Cancel reload. This will not affect tables that are already reloaded, and reloads that are currently running will be completed.

The reload is performed by:

  1. Truncating the selected dimensions and facts.

  2. Loading the selected dimension tables from the upstream data task.

  3. Loading the fact tables from the upstream data task. This includes:

    • Explicitly selected fact tables.

    • Fact tables that are related to a dimension that is reloaded.

信息注释If you reload dimension x and related fact a, and then create a new fact b using dimension x as well, fact b will not be reloaded automatically. You must reload the new fact b manually.

Deleting a task

如果数据任务未运行,并且同一项目中没有下游任务依赖项,则可以删除该数据任务。

  • 在项目的管道项目视图中,单击任务上的更多,然后选择删除

此任务创建的工件(表和视图)也将被删除,除非您选择保留它们。

信息注释请记住,您保留的工件将不再由该任务更新。

查看任务信息

点击菜单栏上的 信息 以查看任务信息,例如:

  • 所有者

  • 空间

  • 数据平台

  • 项目 ID

  • 数据任务运行时 ID

Data mart settings

Click the Settings toolbar button to open the Settings: <data-mart-name> dialog.

警告注释If the task has already been run, changing a setting other than Runtime settings require that you recreate the datasets.

General settings

In the General tab, the following settings are available:

  • Database: The database in which the data mart will be created
  • Data task schema: The schema in which the datasets will be created
  • Internal schema: The schema in which the internal datasets will be created
  • Default capitalization of schema name

    You can set the default capitalization for all schema names. If your database is configured to force capitalization, this option will not have effect.

  • 所有表格和视图的前缀

    可以为使用此任务创建的所有表和视图设置前缀。

    信息注释如果要在多个数据任务中使用数据库架构,则必须使用唯一的前缀。
  • 发布至目录

    选择此选项可将此版本的数据作为数据集发布到目录。目录内容将在下次准备此任务时更新。

    有关目录的更多信息,请参阅通过目录工具理解您的数据

Runtime settings

In the Runtime tab, the following settings are available:

  • Parallel execution: Enter the maximum number of database connections that Qlik Cloud is allowed to open for the task. The default number is 10.
  • Warehouse: Only relevant for Snowflake. The name of the Snowflake data warehouse.

View type settings

The View type settings are only applicable for Snowflake.

  • Standard views

    Use Standard views for most cases.

  • Snowflake secure views

    Use Snowflake secure views for views designated for data privacy or sensitive information protection, such as views created to limit access to sensitive data that should not be exposed to all users of the underlying tables.

    信息注释 Snowflake secure views can execute more slowly than Standard views.

Table type settings

这些设置仅在以 Snowflake 为数据平台的项目中可用。

  • 表类型

    您可以选择使用哪种表格类型:

    • Snowflake 表格

    • Snowflake-托管 Iceberg 表格

      您必须在 Snowflake 外部卷中设置外部卷的默认名称。

  • 要使用的云存储文件夹

    选择将数据放置到暂存区时要使用的文件夹。

    • 默认文件夹

      这将创建具有该默认名称的文件夹:<project name>/<data task name>

    • 根文件夹

      将数据存储在存储的根文件夹中。

    • 文件夹

      指定要使用的文件夹名称。

  • 与 Snowflake 开放目录同步

    启用此选项可让 Snowflake 开放目录管理云文件存储中的文件。

Best practices

  • It is not possible to change source for a dataset if facts and dimensions have been added. If you want to be flexible, you can add a preceding Transform task with non materialized views that can be used to change sources and also to have a model for all source tasks.

Limitations

使用符合所有这些条件的源数据集时存在局限性:

  • 通过 SQL 转换或转换流创建

  • 非物化

  • 历史数据存储(类型 2)已关闭

这些数据集被认为在每次运行时都会更新,这可能会影响效率和成本。您可以通过以下方式缓解这种情况:

  • 更改要物化的源数据集。

  • 使用显式数据集转换

  • 创建转换多个数据集的全局规则。

关系

  • 无法将两个数据集的数据关联起来。创建一个转换任务,在数据模型中定义关系,并将转换任务用做任务的源。

  • 当两个数据集在数据模型中关联时,即使只选择了其中一个数据集,任务中也会同时提供这两个数据集。

本页面有帮助吗?

如果您发现此页面或其内容有任何问题 – 打字错误、遗漏步骤或技术错误 – 请告诉我们!