Skip to main content Skip to complementary content

Processing statistics stored on Google Cloud Storage and uploading the data to Amazon S3

This scenario aims at helping you set up and use connectors in a pipeline. You are advised to adapt it to your environment and use case.

Example of a pipeline created from the instructions below.

Before you begin

  • If you want to reproduce this scenario, download the file: gcstorage_s3_nyc_stats.xlsx . This file is an extract of the nyc-park-crime-stats-q4-2019.xlsx New York City open dataset that is publicly available for anyone to use.

Procedure

  1. Click Connections > Add connection.
  2. In the panel that opens, select the type of connection you want to create.

    Example

    Google Cloud Storage
  3. Select your engine in the Engine list.
    Information noteNote:
    • It is recommended to use the Remote Engine Gen2 rather than the Cloud Engine for Design for advanced processing of data.
    • If no Remote Engine Gen2 has been created from Talend Management Console or if it exists but appears as unavailable which means it is not up and running, you will not be able to select a Connection type in the list nor to save the new connection.
    • The list of available connection types depends on the engine you have selected.
  4. Select the type of connection you want to create.
    Here, select Google Cloud Storage.
  5. Fill in the JSON credentials needed to access your Google Cloud account as described in Google Cloud Storage properties, check the connection and click Add dataset.
  6. In the Add a new dataset panel, name your dataset NYC park crime stats crime.
  7. Fill in the required properties to access the file located in your Google Cloud Storage bucket (bucket name, file name, format, etc) and click View sample to see a preview of your dataset sample.
    Configuration of a new Azure Data Lake Storage Gen2 dataset.
  8. Click Validate to save your dataset.
  9. Do the same to add the S3 connection and dataset that will be used as a destination in your pipeline.
  10. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  11. Click ADD SOURCE to open the panel allowing you to select your source data, here a public dataset of New York park crimes stored in a Google Cloud Storage bucket.
  12. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  13. Click add processor and add a Math processor to the pipeline. The configuration panel opens.
  14. Give a meaningful name to the processor.

    Example

    calculate acre square root
  15. Configure the processor:
    1. Select Square root in the Function name list, as you want to calculate the square root of the SIZE__ACRES_ field.
    2. Select .SIZE__ACRES_ in the Fields to process list.
    3. Click Save to save your configuration.
      (Optional) Look at the preview of the processor to see your data after the calculation operation.
      In the Output data preview, the processor calculated the square root of the size acres field.
  16. Click add processor and add a Filter processor to the pipeline. The configuration panel opens.
  17. Give a meaningful name to the processor.

    Example

    filter on robberies
  18. Configure the processor:
    1. Add a new element and select .ROBBERY in the Input list, as you want to keep only the robbery category among the crimes listed in the dataset.
    2. Select None in the Optionally select a function to apply list.
    3. Select >= in the Operator list.
    4. Enter 1 in the Value field, as you want to filter on data that contains at least one robbery case.
    5. Click Save to save your configuration.
  19. (Optional) Look at the preview of the Filter processor to see your data sample after the filtering operation.

    Example

    In the Output data preview, 5 records match the criteria.
  20. Click ADD DESTINATION and select the S3 dataset that will hold your reorganized data.
    Rename it if needed.
  21. In the Configuration tab of the destination, enable the Overwrite option in order to overwrite the existing file on S3 with the file that will contain your processed data, then click Save to save your configuration.
  22. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  23. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed and the output flow is sent to the Amazon S3 bucket you have indicated.
Highlight of the pipeline output flow in the Amazon S3 bucket
If you download the output file, you can see that the crime data has been processed and robbery cases have been isolated.
Excel sheet of the crime data with the robbery column.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!