Creating a Job to deduplicate data
You can generate a ready-to-use Job to deduplicate data in a specific file in the Talend Studio metadata. Using the component settings of this automatically-generated Job, you can choose to output the duplicates and the unique values in two separate files or databases.
The sequence of deduplicating data in a specific file involves the following steps:
- Selecting the file you want to deduplicate.
- Choosing the columns on which to run the deduplicating Job.
- If required, defining a blocking key to partition the data to be processed. A blocking key is usually needed when there is a lot of data in the file.
- Choosing where to write the unique and duplicated records.
- Running the generated Job.
Procedure
Results
The unique and duplicate values in the file are identified and stored in the defined output files or databases. The generated Job is stored under the Job Designs node in the Repository tree view.