Skip to main content Skip to complementary content

Improving the Talend Trust Score™ of a dataset using Talend Cloud Data Preparation

Talend Cloud Data Preparation, in combination with Talend Cloud Data Inventory can be used to improve the overall health and quality of your data.

In this example, you are working for a B2B e-commerce company. As a business user, you need to monitor, but also actively help improving the data quality and the overall health of your organization's data. This scenario will show how you can navigate your company's dataset inventory, identify the ones that need to be worked on, and fix different issues in order to improve their quality and their Talend Trust Score™.

Looking at your inventory through the Data Console

Use the Data console for a high level view of all your data.

After logging in to the Talend Cloud platform to start your work, open Talend Cloud Data Inventory to land on the Data Console view, that gives you visibility to all the datasets across the organization.

Data console view with quality indicators, charts, and information about datasets.

The Data console gives you instant insight on your data health and how to improve it, thanks to different tiles that each cover specific metrics of your dataset inventory, such as the Talend Trust Score™, data quality, semantic types and more. You can start assessing the overall quality and trust by looking at the Talend Trust Score™ tile.

You can see the total score, a radar chart illustrating the five axis that make up the score, and a chart with overall and axis score over time compared to the acceptable threshold defined beforehand.

Thresholds can be set up for each aspects of the Talend Trust Score™, as well as each tiles to define what is considered good or poor according to your organization standards. Datasets that do not meet the thresholds defined beforehand will be accessible directly from the tile so that you can take appropriate actions if needed.

Threshold being set for the Trust Score parameters.

You will now try to refine your search using filters, to find datasets that tend to bring down the overall Talend Trust Score™.

Using filters to find datasets to fix

You have heard from your leadership team that there has been some issues with the company billing system and financial reports are showing abnormal results. In consequence, you will filter your inventory via the Data console to check the datasets containing billing information. Those datasets have been tagged beforehand, and that is the criteria that you will use to narrow down your search.

Procedure

  1. At the top of the Data console view, click Add filter.
  2. From the drop-down list that opens, click Tags > Billing.
  3. Click Apply.
    Billing tag being applied to the search.

Results

The Data console view is updated to only reflect the quality of the matching datasets. You can see from the Talend Trust Score™ history chart that the latest datasets you have received do not meet the required threshold in terms of overall score.
Trust Score tile with radar chart and score history chart showing a poor score recently.

Looking at the Data quality tile, you notice that the number of valid values across the datasets is also not acceptable.

Data quality charts showing a number of valid values under the fixed standard.

In conclusion, the root cause for the recent drop in overall Talend Trust Score™ is among these remaining datasets. The next step is to look into the dataset list for more details.

Sharing the dataset to improve with competent users

You have identified that the datasets containing billing information may need to be improved. Because you are not really familiar with datasets referencing financial data, you will take advantage of the collaborative features of Talend Cloud Data Preparation and Talend Cloud Data Inventory. The best course of action is to share the dataset with the lowest score with one of your colleague from the finance department who has more expertise in this field.

Procedure

  1. Click Datasets from the left side menu to access the dataset list.
    The filter that you have set previously is still active, so only the few datasets with the Billing tag are displayed, and not your full inventory.
    Dataset list, filtered using the Biling tag.
  2. Point your mouse over the customers_billing_dataset dataset, which is the one with the lowest Talend Trust Score™, and in the Sharing column, click the sharing icon.
  3. In the sharing configuration window that opens, point your mouse over the Group finance user group, and click the + icon to add them as collaborators on this dataset.
    By default, they will be added with viewer rights only.
  4. In the Current collaborators column, click the Viewer label next to the user group, and from the drop-down list that opens, change their rights to Editor.
    Sharing window, where Group Finance is given access to the dataset.
  5. Click Share.

Results

The customers_billing_dataset dataset can now be accessed by your colleague from the finance department, and they will be able to take a closer look at the data and eventually fix the quality errors.

Fixing the issues with Talend Cloud Data Preparation

You are now a data analyst from the finance department, tasked with investigating the poor quality of the customers_billing_dataset dataset that you have been given access to. You will look at the data itself and create a new preparation.

Procedure

  1. From the Dataset list, click customers_billing_dataset to open the detailed view of the dataset.
    You can already get a sense of the dataset, with the Talend Trust Score™ diagram showing a downward trend in the last few days, which means that the latest data added to the database contains errors. This is confirmed by the Data quality tile showing a certain percentage of invalid and empty values.
    Detailed view of the customers_billing_dataset with charts and quality indicators.
  2. To take a look at the data itself, click the Sample icon from the left menu.
    The data is displayed in a grid view. You can quickly see discrepancies between valid and invalid values in certain columns, and most noticeably, you notice that the Billing_Country column contains full addresses that should have been split between several columns.
    Sample view of the dataset, showing errors to be fixed in the data.
  3. To start a new preparation on this dataset and fix these errors, click the Preparations > Add button on the top right of the screen.
    Mouse pointing over the Add preparation button.

    Talend Cloud Data Preparation opens and you can now start applying transformation operations on the data sample.

  4. Apply the following functions to correct the billing information:
    1. Split the text in parts on the Billing_Country column, to split it in 4 Parts and with , as Separator.
    2. Remove trailing and leading characters on the Billing_Country_Split_2, Billing_Country_Split_3 and Billing_Country_Split_4 columns, to remove whitespaces.
    3. Delete the rows that match on the Billing_Country_Split_1 column, and use the (FR)|(US)|(GB) regular expression as Value.
    The data from the full addresses has been split into new columns, that you have also cleaned to ensure it is in the right format. This leaves you only with the rows that initially contained the errors, now with the billing information properly split in dedicated columns for country, state, city, and street.

Results

The preparation now displays cleaner data that can be used to update the source dataset.
Sample view of the dataset, with improved data quality and formatting.

Running the preparation to update the source dataset

You need to send the fixed data from the preparation to the original dataset in order to update it.

But because of the splitting function that you used before, you will have to complete a mapping step to reconcile the schema of the preparation and the schema of the destination dataset coming from the database.

After running the preparation, you will be able to see the impact of the preparation on the different quality indicators.

Procedure

  1. Click the Run button on the top right of the screen to open the export options.
  2. Select Source dataset to update the source dataset.
  3. Click Next.
  4. Select Update from the Action drop-down list, so that the wrong records from the database are replaced with the ones from the preparation.
  5. Select Customer_id as column in the Operation keys drop-down list.
  6. Click Next.
  7. Use drag and drop to perform the following mappings between the resulting schema of the preparation, and the schema from the destination dataset:
    1. Customer_id with Customer_id
    2. Billing_Country_Split_1 with Billing_Street
    3. Billing_Country_Split_2 with Billing_City
    4. Billing_Country_Split_3 with Billing_State
    5. Billing_Country_Split_4 with Billing_country
    See Mapping the preparation and destination columns for more information on how to map columns.
    Mapping configuration between input and output columns.
  8. Click Next.
  9. Select Standard as run profile, so that the preparation runs on the Cloud Engine for Design.
  10. Click Run.
    The run starts in the background, and you are now back to the preparation screen.
  11. To check the status of the run, click the Run history button on the top right of the screen.
    Run history panel showing metrics and status of the run.
    This screen gives you various information about the current and past runs, for more information, see the The run history page.
  12. Once the run is complete and successful, click customers_billing_dataset under the Destination dataset section to directly go back to the detailed view of the updated dataset.
  13. In the Data quality tile, click Select sample type > Refresh head sample in order to retrieve the latest changes made to the content of the database.

Results

After refreshing, you can see that the Talend Trust Score™ of the dataset has significantly increased, as indicated by the differential next to the score itself.
Trust Score icon showing a 1.05 points increase.

Using Talend Cloud Data Inventory and Talend Cloud Data Preparation has allowed you to monitor the datasets of your whole organization, use different indicators to identify potential errors, and fix them accordingly, to improve the health of your data.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!