Skip to main content Skip to complementary content

Cleansing data

After profiling customer data and identifying its problems, some actions should be taken on data to cleanse them. You may start by generating two Talend Jobs: one to remove duplicates from the email column and the other to remove the values that do not match the email pattern.

This will help you see what to resolve and then you can decide what tool to use to intervene and resolve these address issues.

Removing duplicate values

After analyzing the email and postal columns using simple statistics indicators, the analysis results show the number of duplicate records in the columns. You can generate a ready-to-use Job on the analysis results. This Job removes duplicate values in the selected column.

You can follow the same procedure to remove duplicates from the Email or Phone columns.

Procedure

  1. In the Profiling perspective, click Analysis Results at the bottom of the editor.
  2. In the Simple Statistics results of the Email or Phone column, right-click the duplicate count bar in the chart and select Remove duplicates.

    This example uses the outcome of the simple statistics used on the Email column.

    The Integration perspective opens showing the generated Job.

    Job generated automatically from the analysis results.

    The database input component and the tUniqRow component are already configured according to your connection and the columns you are analyzing.

  3. Save the Job and press F6 to execute it.

Results

Duplicate values are written to the specified output database and file.

What to do next

You can follow the same procedure to remove duplicates from the postal column.

For more information on using the Profiling perspective to identify and remove corrupt, incomplete, or inaccurate data, see Data cleansing in the Talend Studio User Guide.

Removing non-matching values

The email pattern used on the email column showed that some records do not respect the standard email format. You can generate a ready-to-use Job to recuperate the non-matching rows from the column.

Procedure

  1. In the Profiling perspective, click the Analysis Results tab at the bottom of the editor.
  2. In the Pattern Matching results of the email column, right-click the chart bar or the numerical results and select Generate Job.

    The Integration perspective opens showing the generated Job.

    Job generated automatically from the analysis results.

    This Job uses the Extract Transform Load process to write in two separate output files the valid/invalid email rows that match/do not match the pattern.

  3. Save the Job and press F6 to execute it.

Results

The valid and invalid rows of the email column are written in the defined output files.

You can replace the output files with different Talend components and recuperate the valid/invalid email rows and write them in databases for example.

For more information on using the Profiling perspective to identify and remove corrupt, incomplete, or inaccurate data, see Data cleansing in the Talend Studio User Guide.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!