Preparation versioning
When working on your data, you can decide to capture the state of your preparation by creating a version.
Creating a version can be done at any moment, even when no steps have been applied yet. It allows you to freeze a preparation in a given state, with a timestamp and a short description.
Use the Manage versions button to create a new version of your preparation, or consult previously created version in read-only mode. Each version can be individually exported.
Adding versions to your preparation is a good way to see the differences that have been made to the preparation over time, but mostly to ensure that it is always the same state of a preparation that is used in Talend Jobs, even if the preparation is still being worked on. Versions can be used in Data Integration as well as Big Data Jobs.
Preparation versions are propagated when sharing or moving a preparation across your folder structure, but not when you copy it or apply it to a new dataset.
Creating preparation versions
In the following example, you will perform a few preparation steps on your data, create versions at two different moments, and see how you can switch between your versions, as well as switch back to the current state of your preparation.
The dataset used here contains customer data such as their names, occupation, phone number and email address, but that requires some cleansing. Formatting inconsistencies can be found in the columns containing the customers names, such as leading or trailing whitespaces, and inconsistent case. In addition, various phone and email entries are invalid.
As you progress in your preparation, you are going to create two versions, that reflect the state of your preparation at two different times.
Procedure
Results
You have created two versions of your preparation, in order to capture its state at two different steps of the cleansing process. You can choose to export one of these versions, use it in a Talend Job, or continue to edit the current state of your preparation.
Using a version in a Talend Job
Preparation versions can be used in data integration or Big Data Jobs in Talend Studio.
In Talend Studio, the tDataprepRun component allows you to reuse a preparation, or any of its versions, and apply it on data with the same model.
You still have the possibility to use a preparation in its current state, but using a specific version can ensure that it is always the same state of a preparation that is used in your Jobs, even if the preparation is still being worked on, thus providing more consistency.
The following example will illustrate a Job that applies an existing preparation version on a Salesforce input, and outputs it to a Redshift database.
This preparation was made on a dataset containing basic customer information such as names, phone numbers and email addresses. A few steps have been applied to remove formatting errors in the name entries, and to delete invalid values from the phone numbers.
Two versions have been created during the preparation: one after the first two steps, and another one after the third step.
Before you begin
- You have created a preparation with at least one version in Talend Cloud Data Preparation. In this case the existing preparation is called contacts cleansing.
- The data imported from salesforce must have the same schema as the dataset used to create the preparation in the first place.
Procedure
Results
All the preparation steps included in the version of the preparation have been applied to your data, directly in the flow of your Job.