Skip to main content Skip to complementary content

Transforming data

You can create reusable and rule-based data transformations as a part of your data pipeline. You can perform transformations as part of your data onboarding, or create reusable transformation data tasks. You can perform row-level transformations and create datasets that are either materialized as tables, or created as views that perform transformations on the fly.

  • You can perform explicit dataset transformations, or create global rules that transform multiple datasets. You can also filter a dataset to create a subset of rows.

  • You can add SQL transformations. A SQL transformation allows you to input a SQL SELECT query into a pipeline to define complex or simple transformations.

  • You can add visual transformation flows with sources, processors and targets to define complex or simple transformations.

Information noteData tasks operate in the context of their owner. For more information about required roles and permissions, see Data space roles and permissions.

Creating a transformation data task

The easiest way to create a transformation data task is to click ... on a storage data task and then selecting Transform data.

You can also click Add new in a data project and select Transform data. In this case you need to define which source data task to use.

  1. Define your source data and targets in Transform.

    You can either:

    • Select source datasets and click Add to target to add them to Target.

      You can then perform basic transformations on the datasets such as filtering data, or adding columns, in Datasets.

      For more information, see Managing datasets.

    • Select source datasets and click Add SQL transformation.

      A SQL transformation allows you to input a SQL SELECT query into a pipeline to define complex or simple transformations.

      For more information, see Adding SQL transformations.

    • Select source datasets and click Add transformation flow.

      The flow designer allows you to create a transformation flow with sources, processors and targets to define complex or simple transformations.

      For more information, see Adding transformation flows.

    Tip noteYou can also add more datasets from other storage data tasks by clicking Select source data.
  2. When you have added the transformations that you want, validate the datasets by clicking Validate datasets. If the validation finds errors, fix the errors before proceeding.

    For more information, see Validating and adjusting the datasets.

  3. Create a data model

    Click Model to set the relationships between the included datasets.

    For more information, see Creating a data model.

  4. Click Prepare to prepare the data task and all required artifacts. This can take a little while.

    You can follow the progress under Preparation progress in the lower part of the screen.

  5. When the status displays Prepared, you can run the data task .

    Click ... and then Run.

The data task will now start creating datasets to transform the data.

Information noteIt is not possible to change which datasets are included when you have started generating datasets.

Scheduling a transformation task

You can schedule a transformation task to be updated periodically. You can set a time based schedule, or set the task to run when input data tasks have completed running.

Click ... on a data task and select Scheduling to create a schedule. The default scheduling setting is inherited from the settings in the data project. For more information about default settings, see Transform default values.

You always need to set Scheduling to On to enable the schedule.

Information noteIf all datasets in the task are non-materialized, there is nothing to run as the transformation is performed on-the-fly with views. You can still create a schedule for a non-materialized transformation ask to run when the schedule condition is met. The task will complete immediately which can trigger a downstream task, for example a data mart. This allows you to build an event based pipeline schedule that includes non-materialized transformations without interrupting the pipeline flow.

Time based schedules

You can use a time based schedule to run the task regardless of when the different input sources are updated.

  • Select At specific time in Run the data task.

You can set an hourly, daily, weekly or monthly schedule.

Event based schedules

You can use an event based schedule to run the task when input data tasks have completed running.

  • Select On specific event in Run the data task.

You can select if you want to run the task when any of the input tasks have completed successfully, or when any of a selection of input tasks have completed successfully.

Information noteThe task will not run if any input task or downstream task is running when the schedule is triggered. The task is skipped until the next scheduled run.

Monitoring a transformation task

You can monitor the status and progress of a transformation task by clicking on Monitor.

For more information, see Monitoring an individual data task.

Reloading data

You can perform a manual reload of tables if the data is materialized as physical tables. This is useful when there are issues with one or more tables.

Information noteIf datasets are non-materialized, you must reload the source datasets in the upstream data task to refresh data.
  1. Open the data task and select the Monitor tab.

  2. Select the tables that you want to reload.

  3. Click Reload tables.

The reload will happen the next time the task is run. The reload process behaves differently depending on the history setting and transformation type of each dataset. This means that the reload process can differ between datasets in a data task.

You can cancel the reload for tables that are pending reload by clicking Cancel reload. This will not affect tables that are already reloaded, and reloads that are currently running will be completed.

Downstream tasks will be reloaded to apply changes, and to avoid backdating.

Downstream impact after reloading a Transform data task

Transformation reload downstream effects

The impact downstream depends on the type of reload operation executed, and the type of immediate downstream dataset. Standard processing means that the dataset will react and process data using the configured method for the specific dataset.

Example: Reloading a dataset via truncate and load

  • If the next dataset uses dataset transformations, it will be reloaded on next execution via a truncate and load.

  • If the next dataset is a SQL transformation or transformation flow, it will be reloaded using compare and apply.

Reloading a dataset with no history

In this case, there is no history to consider. To reduce processing on the target, the reload is performed by:

  1. Truncating the tables.

  2. Loading current data from the upstream data task.

Downstream tasks will be reloaded to apply changes.

Reloading a dataset with history enabled

The reload is performed by:

  1. Truncating current, prior, and changes tables.

  2. Loading data from the upstream data task, including prior tables.

Reloading a dataset based on SQL transformation or transformation flow

  • Truncate and reload

    Information noteThis option may cause loss of history.
    1. Truncating the current and change tables.

    2. Running the query and loading it to the current tables.

  • Reload and compare

    1. Running the query and comparing it with the current tables.

    2. Adding changes.

Information noteWhen a dataset based on SQL transformation or transformation flow is reloaded because of an upstream task reload, it is always reloaded by comparing and applying. If you want to truncate and reload it, you must issue a specific reload for these tables. In this case you must consider the effect on downstream tables as well.

Transformation settings

You can set properties for the transformation data task .

  • Click Settings.

Warning noteIf the task has already been run, changing a setting other than Runtime settings require that you recreate the datasets.

General settings

  • Database

    Database to use in the data source.

  • Data task schema

    You can change the name of the storage data task schema. Default name is store.

  • Internal schema

    You can change the name of the internal storage schema. Default name is store__internal.

  • Prefix for all tables and views

    You can set a prefix for all tables and views created with this task.

    Information noteYou must use a unique prefix when you want to use a database schema in several data tasks.
  • Materialized

    You can select to only create views that perform transformations on the fly (Non-materialized), or create both tables and views(Materialized).

  • History

    You can keep historical change data to let you easily recreate data as it looked at a specific point in time. You can use history views and live history views to see historical data.

Runtime settings

  • Parallel execution

    You can set the maximum number of data connections for full loads to a number from 1 to 5.

  • Warehouse

    The name of the cloud data warehouse.

Limitations

  • It is not possible to change data types in a transformation data task when the Non-materialized option is selected.

  • Non-materialized SQL-based data sets provide the full result set only, incremental filtering is not supported. Non-materialized SQL-based data sets also provide current data only and do not keep historical records.

Learn more

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!