Skip to main content Skip to complementary content

Extracting a fixed-size sample from a dataset about drivers

A pipeline with a Test source, a Data sampling processor, and an FTP destination.

Before you begin

  • You have previously created a connection to the system storing your source data.

    Here, a Test connection.

  • You have previously added the dataset holding your source data.

    Download and extract the file: sampling-drivers.zip. It contains a dataset with data about bad drivers, including the percentage of drivers involved in fatal collisions due to speed, alcohol, distractions, information about car insurances, etc.

  • You also have created the connection and the related dataset that will hold the processed data.

    Here, an output folder stored on an FTP server.

Procedure

  1. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  2. Give the pipeline a meaningful name.

    Example

    Extract a subset of data about drivers
  3. Click ADD SOURCE to open the panel allowing you to select your source data, here data about drivers involved in fatal collisions and insurance data.

    Example

    Preview of a data sample about driver insurance data.
  4. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  5. Click Plus and add a Data sampling processor to the pipeline. The configuration panel opens.
  6. Give a meaningful name to the processor.

    Example

    extract 5 records
  7. In the Configuration area:
    1. Enter 5 in the Number of records field as you want to create a subset of the original dataset with only 5 records selected randomly.
  8. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the operation.

    You can see that a subset that contains only 5 records selected randomly has been created in the output.

    Preview of the Data sampling processor after extracting 5 random records from the source dataset.
  9. Click ADD DESTINATION and select the FTP folder that will hold your subset of data.
    Rename it if needed.
  10. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  11. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the subset of data is created according to the number of records you have specified and the output is sent to the FTP folder you have indicated. These subsets of data can then be used by data scientists for predictive analytics.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!