Skip to main content Skip to complementary content

Adding a new dictionary-based semantic type

You can create a semantic type based on a dictionary in Talend Dictionary Service and add it to the list of recognized data types in Talend Data Stewardship. However, duplicate values are not allowed in a dictionary-based semantic type as they are useless and can slow down the process.

In Talend Data Stewardship, not every type of data can currently be matched with one of the predefined semantic types. The counties of United Kingdom for example, are currently not recognized as such.

About this task

Let's say that you work for a British company, with customers only residing in the United Kingdom. In this example, you need to intervene and manage some customer data, such as their names, email address, or the county they live in. You will wonder what semantic type to use for the column containing the counties when you define the data model in Data Stewardship. You want here to add a semantic type specific to your data: UK_counties semantic type in this case.

You can create this new semantic type in Talend Dictionary Service, and it will be automatically available Data Stewardship so that your data can be matched with and validated against a proper type.

Procedure

  1. Create a text file where you list the counties of United Kingdom.
    The file can have one or multiple values per line. Maximum length for a value is 255 characters.

    When you use multiple values on the same line, separate them by commas. In that case, all values are considered as synonyms. You should include in quotes non-alphabetical values, otherwise the file will be rejected.

  2. Select SEMANTIC TYPES > ADD SEMANTIC TYPE.
  3. Enter a name and a description for the new semantic type.
  4. Select the semantic type from the Type list.
  5. Keep the Use for validation switch activated.

    Using a regular expression, a dictionary or a compound type for validation means that it will be used to define which values are considered right or wrong in a given column. The result of this validation process can be seen in the quality bar of each column in your datasets.

    In any case, regular expressions or dictionary of values are used for data discovery, that calculates the matching percentage between the reference values and your data to define the semantic type of each column.

    In this example, if you were to deactivate the switch, the dictionary would only be used for data discovery, and no value would be considered invalid.

  6. From the Validation criterion list, select the rule to use while matching data against the values in the dictionary:
    Option Description
    Simplified text Punctuation, white spaces, case and accent are ignored during validation and data is considered as valid. For instance, if Pâté-en-croûte is the reference value in the dictionary, then pate-en-croute and PATE--EN CROUTE will both be considered valid but Pâté n croûte will not be considered valid.
    Ignore case and accents Case and accents are ignored during validation and data is considered as valid. For instance, if Pâté-en-croûte is the reference value in the dictionary, then pate-en-croute is considered valid (despite the case and accent differences), but pate en croute is not because the dashes have been replaced with spaces.
    Exact value Very restrictive. Data is considered as valid only if it is an exact match with the value.
  7. Click the icon to the right of Values and import the text file of the counties of United Kingdom.
    You can use the icon to add values manually and the search icon to search values in the list.
  8. Click SAVE AND PUBLISH to send the semantic type to the Talend Dictionary Service server and make it available to be used by Data Stewardship.
    Clicking SAVE AS DRAFT stores the new type on the server without propagating it to the system. The new type is not usable unless it is published. For a use case of this option, let's say that you have new semantic types to deploy as part of a new project. You can prepare the work by creating the semantic types and save them as draft before the go-live of the project, and can deploy the semantic types only the day of go-live.
  9. From the DATA MODELS page, create a data model for the United Kingdom customers data.
    UK_counties is now available in the list of the semantic types and you can set it for the County column.

Results

When you load data containing the United Kingdom counties to Talend Data Stewardship, they are matched with and validated against the proper semantic type, UK_counties that you manually created in Talend Dictionary Service.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!