Skip to main content Skip to complementary content

Preparation versioning

When working on your data, you can decide to capture the state of your preparation by creating a version.

Creating a version can be done at any moment, even when no steps have been applied yet. It allows you to freeze a preparation in a given state, with a timestamp and a short description.

Versions tab opened.

Use the Manage versions button to create a new version of your preparation, or consult previously created version in read-only mode. Each version can be individually exported.

Adding versions to your preparation is a good way to see the differences that have been made to the preparation over time, but mostly to ensure that it is always the same state of a preparation that is used in Talend Jobs, even if the preparation is still being worked on. Versions can be used in Data Integration as well as Big Data Jobs.

Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.

Creating preparation versions

In the following example, you will perform a few preparation steps on your data, create versions at two different moments, and see how you can switch between your versions, as well as switch back to the current state of your preparation.

The dataset used here contains customer data such as their names, occupation, phone number and email address, but that requires some cleansing. Formatting inconsistencies can be found in the columns containing the customers names, such as leading or trailing whitespaces, and inconsistent case. In addition, various phone and email entries are invalid.

Dataset containing customer data.

As you progress in your preparation, you are going to create two versions, that reflect the state of your preparation at two different times.

Procedure

  1. Click the header of the First_name column, and while pressing the Ctrl key, click the header of the Last_name column.

    The content of the two columns is now selected.

  2. Apply the Remove trailing and leading characters and the Change to title case functions to remove whitespaces and harmonize the case.
    Remove trailing and leading characters and Change to title case functions applied.

    Removing those formatting errors marks the first big step in your preparation, and you are going to create a version to track these changes.

  3. Click the Manage versions button located in the header bar.

    The Functions panel is replaced with the Versions panel. This panel is empty since no versions exist for this preparation yet.

    Versions panel opened.

    Adding new versions via the Manage versions button is only available to Talend Data Preparation user with administrator rights. Other users are only able to consult existing version in read-only mode.

  4. Click the Add version button.
  5. Enter a quick description of the version in the corresponding field, Fixing formatting errors in names in this example, and click Add version.
    Versions panel opened.

    The version is now listed in the Versions panel with a timestamp, and the description you added before.

    Versions panel opened with one version number.
  6. Click the version to access it in read-only mode.

    You can apply filters and browse the data, but you cannot apply functions on it.

  7. To leave the read-only mode and resume preparing your data, click the Switch to current state button located in the header bar.

    You are now back to the edit mode.

  8. To cleanse the remaining invalid entries from the Phone and Email columns, click the menu icon on the top left corner of the grid, and select Display rows with invalid or empty values.
  9. From the Functions panel, select the Delete these filtered rows functions.
    Delete these filtered rows option.

    All the invalid values have been removed from your dataset, and you are going to create another version to capture this state.

  10. Repeat steps 3 to 5 to create a new version, but this time, enter Removing all invalid values as description.

    Your two versions are now listed in the Versions panel and can be accessed in read-only mode.

    Versions panel opened with two version numbers.

Results

You have created two versions of your preparation, in order to capture its state at two different steps of the cleansing process. You can choose to export one of these versions, use it in a Talend Job, or continue to edit the current state of your preparation.

Using a version in a Talend Job

Preparation versions can be used in data integration or Big Data Jobs in Talend Studio.

In Talend Studio, the tDataprepRun component allows you to reuse a preparation, or any of its versions, and apply it on data with the same model.

Information noteNote: In order to use the tDataprepRun component with Talend Cloud Data Preparation, you must have at least a 7.1 version of Talend Studio.

You still have the possibility to use a preparation in its current state, but using a specific version can ensure that it is always the same state of a preparation that is used in your Jobs, even if the preparation is still being worked on, thus providing more consistency.

The following example will illustrate a Job that applies an existing preparation version on a Salesforce input, and outputs it to a Redshift database.

Job illustrated in Talend Studio.

This preparation was made on a dataset containing basic customer information such as names, phone numbers and email addresses. A few steps have been applied to remove formatting errors in the name entries, and to delete invalid values from the phone numbers.

Cleansing steps already made to the preparation.

Two versions have been created during the preparation: one after the first two steps, and another one after the third step.

Versions illustrated.

Before you begin

  • You have created a preparation with at least one version in Talend Cloud Data Preparation. In this case the existing preparation is called contacts cleansing.
  • The data imported from salesforce must have the same schema as the dataset used to create the preparation in the first place.

Procedure

  1. In Talend Studio, create a new Standard or Spark Job.
  2. In the design workspace of Talend Studio, add a tSalesforceInput, a tDataprepRun, a tRedshiftOutput, and link them together using two Row > Main links.
  3. Select the tSalesforceInput component and click the Component tab to define its basic settings.

    Make sure that the schema of the tSalesforceInput component matches the schema expected by the tDataprepRun component.

  4. Select the tDataprepRun component and click the Component tab to define its basic settings.
    tDataprepRun component properties in Talend Studio.
  5. Enter your Talend Cloud Data Preparation connection information.
  6. Click Choose an existing preparation to display a list of the prepations available in Talend Cloud Data Preparation.
    Select an existing preparation dialog box opened in Talend Studio.
  7. Select the checkbox in front of contacts cleansing, that contains the preparation version that you want to apply, and click OK.
  8. Click choose a version to select from the list of available versions for your preparation. In this case, select version 1.
    Set the version dialog box opened in Talend Studio.

    By default, the Job uses the current state of the selected preparation. Using the current state instead of a fixed version means that in the context of collaborative work, someone possibly made changes, that you are unaware of, on the preparation. As a consequence you cannot know exactly what the outcome of your Job will be. This is why it is safer to use a version in your Jobs.

  9. Click Fetch Schema to retrieve the schema of contacts cleansing.
  10. Select the tRedshiftOutput component and click the Component tab to define its basic settings.
  11. Save your Job and press F6 to run it.

Results

All the preparation steps included in the version of the preparation have been applied to your data, directly in the flow of your Job.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!