Skip to main content Skip to complementary content

Adding a new regular expression-based semantic type

You can create a semantic type based on a regular expression in Talend Dictionary Service and add it to the list of recognized data types in Data Stewardship.

In Talend Dictionary Service, not every type of data can currently be matched with and validated against one of the predefined semantic types. Italian social security numbers, also known as codice fiscale, are currently not recognized for example.

About this task

Let's say that you work for an Italian company, only dealing with Italian customers. In this example, you need to intervene and manage some customer data, such as their names, email address, or their social security number. When defining the data model in Data Stewardship, you are obliged to set the semantic type for the column containing the social security number to text as there is no predefined semantic type for Italian social security number. This is a bit disappointing and you would like to create a more specific category in order to match this type of data: a codice_fiscale semantic type in this case.

You can create this new semantic type in Talend Dictionary Service, and it will be automatically available Data Stewardship so that your data can be matched with and validated against a proper type.

Information noteImportant: For security reasons, a few regular expressions cannot be used, especially the backreferences. For more information, see the RE2/J documentation.

Procedure

  1. Select Semantic types > Add semantic type.
  2. Enter a name and a description for the new semantic type.
  3. Select the semantic type from the Type list.
  4. Keep the Use for validation switch activated.

    Using a regular expression, a dictionary or a compound type for validation means that it will be used to define which values are considered right or wrong in a given column. The result of this validation process can be seen in the quality bar of each column in your datasets.

    In any case, regular expressions or dictionary of values are used for data discovery, that calculates the matching percentage between the reference values and your data to define the semantic type of each column.

    In this example, if you were to deactivate the switch, the regular expression would only be used for data discovery, and no value would be considered invalid.

  5. From the Content list, select the type of content you want to validate.
    This option helps to optimize performance. Only the data which matches the selected type is validated
    Option Description
    Any character The complete string is validated against the regular expression
    Alphabetic Strings that contain alphabetical characters and no numeric character are validated against the regular expression
    Numeric Strings that contain numeric characters and no alphabetic character are validated against the regular expression
  6. Enter the regular expression syntax in the Validation pattern field.
    This regular expression is designed to match the Italian codice fiscale, which is an alphanumeric code of 16 characters.
    Configuration to add a new regular expression-based semantic type.
  7. Click Save and publish to send the semantic type to the Talend Dictionary Service server and make it available to be used by Data Stewardship.
    Clicking Save as draft stores the new type on the server without propagating it to the system. The new type is not usable unless it is published. For a use case of this option, let's say that you have new semantic types to deploy as part of a new project. You can prepare the work by creating the semantic types and save them as draft before the go-live of the project, and can deploy the semantic types only the day of go-live.
  8. Go back to Talend Cloud Data Stewardship and create the data model for the Italian customers data.
    The new semantic category codice_fiscale is available now in the list of semantic types and you can set it for the column containing the social security number.

Results

When you load the customer data to Talend Cloud Data Stewardship, data is now matched with and validated against the codice_fiscale semantic type, that you created in Talend Dictionary Service.
Data matching the codice fiscale semantic type.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!