Extracting a fixed-size sample from a dataset about drivers

A pipeline with a Test source, a Data sampling processor, and an FTP destination.

Before you begin

You have previously created a connection to the system storing your source data.

Here, a Test connection.
You have previously added the dataset holding your source data.

Download and extract the file: sampling-drivers.zip. It contains a dataset with data about bad drivers, including the percentage of drivers involved in fatal collisions due to speed, alcohol, distractions, information about car insurances, etc.
You also have created the connection and the related dataset that will hold the processed data.

Here, an output folder stored on an FTP server.

Procedure

Click Add pipeline on the Pipelines page. Your new pipeline opens.
Give the pipeline a meaningful name.
Example
Extract a subset of data about drivers
Click ADD SOURCE to open the panel allowing you to select your source data, here data about drivers involved in fatal collisions and insurance data.
Example
Select your dataset and click Select in order to add it to the pipeline.
Rename it if needed.
Click and add a Data sampling processor to the pipeline. The configuration panel opens.
Give a meaningful name to the processor.
Example
extract 5 records
In the Configuration area:
1. Enter 5 in the Number of records field as you want to create a subset of the original dataset with only 5 records selected randomly.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the operation.

You can see that a subset that contains only 5 records selected randomly has been created in the output.
Click ADD DESTINATION and select the FTP folder that will hold your subset of data.
Rename it if needed.
On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the subset of data is created according to the number of records you have specified and the output is sent to the FTP folder you have indicated. These subsets of data can then be used by data scientists for predictive analytics.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here

Extracting a fixed-size sample from a dataset about drivers

Before you begin

Procedure

Example

Example

Example

Results

Did this page help you?