Selecting records of deduplicated Tate gallery artists

A pipeline with a source, a Field selector processor, and a destination.

Before you begin

You have previously added the dataset holding your source data.

Download and extract the file: field_selector-artists.zip. It contains a dataset of artists of the Tate galleries in London (including their name, date of birth, URL of their Tate page, etc.) with some duplicate names.
You also have created the connection and the related dataset that will hold the processed data.

Here, a file stored on a Test connection.

Procedure

Click Add pipeline on the Pipelines page. Your new pipeline opens.
Give the pipeline a meaningful name.
Example
Select deduplicated artists
Click ADD SOURCE to open the panel allowing you to select your source data, here a list of Tate artists with some duplicates.
Select your dataset and click Select in order to add it to the pipeline.
Rename it if needed.
Click and add a Field selector processor to the pipeline. The configuration panel opens.
Give a meaningful name to the processor.
Example
select fields with distinct
Enable the Distinct option in order to only return fields with different values and get rid of the duplicates.
Click the Edit icon in the Simple mode to open the Select fields window:
1. Select name in the Input list and enter full_name in the Output list, as you want to select and rename the fields related to the artists names.
2. Select yearOfBirth in the Input list and year_of_birth in the Output list, as you want to select and rename the fields related to the artist years of birth.
3. Select yearOfDeath in the Input list and enter year_of_death in the Output list, as you want to select and rename the fields related to the artist years of death.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the select and distinct operations. The artists names are deduplicated and only the fields with different values are returned.
Click ADD DESTINATION and select the dataset that will hold your reorganized data.
Rename it if needed.
On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is reorganized according to the conditions you have stated and the output is sent to the target system you have indicated.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here