Data Sampling and Profiling

While technical and descriptive metadata contain a great wealth of information about objects, this is only true if the information has been documented on those elements. In many cases, that metadata is incomplete and the best way to determine what that metadata should be (e.g., semantic data type, valid values, etc.) is to look at the data itself.

Talend Data Catalog provides the option to sample and/or profile the actual data contained in files and tables, in addition to the metadata captured from a source format or tool. One may specify the number of records to profile and how many should be maintained as a sample for visualization later, and whether to randomly sample or start at the top.

Data sampling will provide sample rows from the dataset.

Data profiling helps to discover business knowledge embedded in data itself, improve users' understanding of the data and enables them to classify data with certainty. The data profiling process creates a summary of the data a model has. The summary has mainly statistics and charts. It helps users to find out if the correct data is available at the appropriate detail level.

That information is then available when one navigates to the file or table’s object page or when looking at individual fields or columns from the file or table. The application can store and display the following data profiling details for table/view and column objects:

Counts (standard and custom counts, like empty and valid rows)
Values (distinct values and their counts)
Patterns (patterns and their counts)
Data types (inferred data types and their counts)

Sampled data and the profiling results are hidden from most users by default. One must be assigned the Data Viewingcapability object role assignment for the model in question. One may also hide the sample data and profiling results for specific models.

For data sampling and profiling, you may

Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.
Sample and profile on demand, e.g. on a schema or table via the user interface at any subset of the model you wish to specify.
Check the Data sample and profile after metadata import checkbox and MQL Statement as part of the Import Options when defining the metadata import (harvesting) of a data source to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

In a best practices environment, the application or model administrators should define the data import policy (profiling, sampling, and classification), and the users or scheduled tasks should follow that policy. Not surprisingly, companies use different frequencies for metadata and data imports, and thus it is discouraged to always run the data import (data sampling and profiling) immediately after every import of metadata.

Thus, the in the user interface, the option to import data is separate from the data import policy by displaying the policy in the Data Defaults tab and leaving the option in the Import Options tab. In this way, the option and policy are independent. The option is turned off by default, but the administrators can enable sampling and profiling according to policy. In addition, the administrators may turn off sampling and profiling by default but enable data classification and data import. Still, data sampling and profiling is required in order to perform auto-tagging via data driven data classification.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here