Hashing fields to compare data safely

A pipeline with an S3 source, a Data masking processor, a Field selector processor, and an S3 destination.

Before you begin

You have previously created a connection to the system storing your source data.

Here, an Amazon S3 connection.
You have previously added the dataset holding your source data.

Download the file: string-crops.csv. It contains a dataset with data about harvested crops in Mali with crop types, value of production, harvested areas, etc.
You also have created the connection and the related dataset that will hold the processed data.

Here, a dataset stored in the same S3 bucket.

Procedure

Click Add pipeline on the Pipelines page. Your new pipeline opens.
Give the pipeline a meaningful name.
Example
Hash fields to compare data safely
Click ADD SOURCE to open the panel allowing you to select your source data, here data about harvested crops in Mali in the year 2005.
Example
Select your dataset and click Select in order to add it to the pipeline.
Rename it if needed.
Click and add a Data hashing processor to the pipeline. The configuration panel opens.
Give a meaningful name to the processor.
Example
hash fields
In the Configuration area:
1. Select Hash data in the Function name list.
2. Click the icon next to the Fields to process list to select all fields, as you want to hash all values at once.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the operation.

All fields are now hashed and secured, and you can see that the crop and id fields have the same output value which means the original value is the same in both fields.
Click and add a Field selector processor to the pipeline. The configuration panel opens.
Give a meaningful name to the processor.
Example
merge identical hashed values
In the Selectors area:
1. Select .crop in the Input list and enter crop_id in the Output list , as you know both the .crop and .id fields are identical and you want to merge the two fields.
2. Click the + sign to add a new element and select .crop_parent in the Input list and enter crop_type in the Output list, as you want to keep this field and rename it.
3. Click the + sign to add a new element and select .harvested_area in the Input list and enter harvested_area in the Output list, as you want to keep this field in the output.
4. Click the + sign to add a new element and select .value_of_production in the Input list and enter production_value in the Output list, as you want to keep this field and rename it.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the operation.
Click ADD DESTINATION and select the dataset that will hold your processed data.
Rename it if needed.
On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is hashed, identical fields have been merged and reorganized according to the conditions you have stated and the output is sent to the target system you have indicated.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here

Hashing fields to compare data safely

Before you begin

Procedure

Example

Example

Example

Example

Results

Did this page help you?