Data Class Discovery
Talend Data Catalog has a concept of data classes. These data classes may be applied like tags to column level (e.g., columns in a database or fields in a file) objects and indicate that object to be a class of object, e.g., Social Security Number or Gender. In this way, one may categorize by data class and thus identify, sort, operate on different objects all of that same type.
You may manually assign data classes to a object from the element’s object page or when browsing in grid mode. In addition, as part of the harvesting and data profiling process, Talend Data Catalog will suggest data class assignments that may be confirmed and made permanent.
Data classes have been referred to as semantic types in the past. Currently, though, with the inclusion of metadata-detected data classes and other improvements, the concept has been generalized into data class and all data classification is based upon these.
Steps
- Ensure that you have specified the appropriate data sampling and profiling options before harvesting.
- Navigate to the object page for the object you wish to work with.
You may also review and editing data class assignments in grid mode. However, they cannot be assigned in bulk.
- Talend Data Catalog will have proposed data classes.
- To confirm a proposed data class, click the check.
- To reject a data class, click the X.
Reject a data class proposal is permanent, and in future harvests it will not be suggested again. You may, however, assign it manually in the future.
- To specify a data class that is not currently assigned, click in the box and start typing. A pull-down list with options of valid data classes will be provided to pick from.
Example
Navigate to the object page of the Gender field in the Employee.csv file.
There are two suggested data classes. Confirm the Gender type by clicking the check mark next to that type. Then reject the Civility type by clicking the X next to it.
You will receive a warning that this action is permanent.
Are you sure you want to reject Civility data class?
It will not be proposed again for this object if you reject it but you will still be able to manually add it.
And the result is a single confirmed type.
Explore Further
Valid Data Classes
The available set of data classes is strictly controlled, thus you may not simply type a new one in when assigning them to a object. A data class definition is more than just a name. In includes rules to match against (textual pattern matching rules or a list of valid values).
The current set of valid data classes may be reviewed, edited and removed using the manage data classes feature.
Hiding profiling and sample data by data class
You may ensure that that sample and profile data are hidden from the casual user by setting a Hide flag on that object.
In addition to manually setting this value, you may also define a data class to hide the data sampled and profiled on subsequent harvests. Thus, e.g., you could define the data class US Social Security Number to be hidden for all objects of that data class. Then, as the data is profiled in subsequent harvests, and Talend Data Catalog determines that an element is of that data class, its flag will be set to hidden. Go to manage data classes to manage this feature.