Skip to main content Skip to complementary content

Extracting parts of a field based on semantic types

Availability-noteBeta
You can use the Extract values by semantic type function to extract the different information contained in a cell into new columns, according to predefined or custom semantic types.

About this task

The function allows you to select up to five different semantic types that correspond to the type of information you want to extract from a given field. It works with semantic types based on regular expressions or dictionaries, as well as compound semantic types.

For this example, imagine that you are working for the Ministry of Culture, and you need to prepare data based on a survey issued to museum visitors. This survey was able for example to gather some basic demographic information on the visitors, such as their age or gender, but also some comments, that they could enter in a specific field. This comments field could be used by the visitors to share their experience, leave other contact information, or even recommend other museums from other countries they might have visited. This information could be used to build future partnerships for example.

However, after a simple parsing operation, the various information that were gathered in the comments field all ended up in a single field in the resulting dataset. You on the other hand, would like to extract the different types on information to sort them into specific columns. To accomplish that, you will make use of the Extract values by semantic type function, as well as the predefined or custom semantic types available with Talend Cloud Data Preparation, to identify the different categories of information left in the comments, and extract them to individual columns.

Dataset containing comments.

Procedure

  1. Click the header of the Comments column to select its content.
  2. In the functions panel, type Extract values by semantic type and click the result to open the options for the associated function.
    Extract values by semantic type panel opened.
  3. In the first Semantic type drop-down list select Museum.
    All the semantic types that are available in the drop-down list correspond to either the predefined semantic types, or the custom ones you created using Talend Dictionary Service. Each category will be extracted to a new column.
  4. In the second and third Semantic type drop-down lists, select Country and Email respectively.
    Those three categories correspond to the type of information that you hope museum visitors left in the comments field.
  5. Select the Normalize value check box to apply a standardization process to the extracted values based on the default or custom dictionary-based and compound semantic types.
  6. Click Submit.

Results

All the relevant information matching the selected semantic types, and that was contained in a single field, is extracted and displayed separately in new columns. If no relevant information was present in the original field, the resulting cells in the new columns are left empty.
Dataset containing comments displayed in separate new columns.
Information noteTip: This transformation can also be performed by using the Automatically formatting data based on examples function.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!