Data Sampling and Profiling Technical Details
Talend Data Catalog reuses the model bridge infrastructure and metamodel for data profiling. Database and file system bridges provide “concealed” support for data profiling. They run in the metadata import mode by default. You can run them in the profiling mode by specifying dedicated Miscellaneous options.
When the bridges are running in the metadata mode they import not only basic structural details, like tables and columns but also advanced details, like keys and indexes. When they are running in the profiling mode they import the same basic structural details to carry profiling statistics (e.g. UDPs on MIR Attribute). It allows MM to integrate the profiling statistics into already loaded metadata using basic structure.
The bridges use the data profiling library. The library is derived and depends on the open source Talend data quality library. When the bridge runs in the metadata mode it does not depend on the data profiling library.
The bridge uses two queries for data sampling/profiling:
- the first query if the count of rows is less than 100 000 rows
SELECT * FROM TableName DISTRIBUTE BY rand() SORT BY rand() limit 100
- the second query if the count of rows is more than or equal to 100 000 rows:
SELECT * FROM TableName TABLESAMPLE( n PERCENT)