Skip to main content

Data Sampling and Profiling

While technical and descriptive metadata contain a great wealth of information about objects, this is only true if the information has been documented on those elements. In many cases, that metadata is incomplete and the best way to determine what that metadata should be (e.g., semantic data type, valid values, etc.) is to look at the data itself.

Talend Data Catalog provides the option to sample and/or profile the actual data contained in files and tables, in addition to the metadata captured from a source format or tool. One may specify the number of records to profile and how many should be maintained as a sample for visualization later, and whether to randomly sample or start at the top.

Data sampling will provide sample rows from the dataset.

Data profiling helps to discover business knowledge embedded in data itself, improve users' understanding of the data and enables them to classify data with certainty. The data profiling process creates a summary of the data a model has. The summary has mainly statistics and charts. It helps users to find out if the correct data is available at the appropriate detail level.

That information is then available when one navigates to the file or table’s object page or when looking at individual fields or columns from the file or table. The application can store and display the following data profiling details for table/view and column objects:

  • Counts (standard and custom counts, like empty and valid rows)
  • Values (distinct values and their counts)
  • Patterns (patterns and their counts)
  • Data types (inferred data types and their counts)

Sampled data and the profiling results are hidden from most users by default. One must be assigned the Data Viewingcapability object role assignment for the model in question. One may also hide the sample data and profiling results for specific models.

For data sampling and profiling, you may

Information note

In a best practices environment, the application or model administrators should define the data import policy (profiling, sampling, and classification), and the users or scheduled tasks should follow that policy. Not surprisingly, companies use different frequencies for metadata and data imports, and thus it is discouraged to always run the data import (data sampling and profiling) immediately after every import of metadata.

Thus, the in the user interface, the option to import data is separate from the data import policy by displaying the policy in the Data Defaults tab and leaving the option in the Import Options tab. In this way, the option and policy are independent. The option is turned off by default, but the administrators can enable sampling and profiling according to policy. In addition, the administrators may turn off sampling and profiling by default but enable data classification and data import. Still, data sampling and profiling is required in order to perform auto-tagging via data driven data classification.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!