Data Profiling Properties

The Talend Data Catalog repository, API and UI support the following profiling details (row counts):

Count: number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options)
Null – rows which are mull.
Distinct rows: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less (e.g. 10) distinct values, the data is considered to be distinct.
Duplicate rows: rows with identical values for this field
Valid rows: rows with valid contents for this field
Empty rows: null in database or empty in files
Invalid rows: rows without valid contents for this field

The valid/invalid depends upon the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value “a“, which is not a valid INTEGER, it would contribute to the invalid counter.

Average length: average of the lengths of each value profiled
Min length: lowest of the lengths of each value profiled
Max length: highest of the lengths of each value profiled
Min value:lowest value
Max value:highest value
Values [value, rows]:distribution of values and their frequency
Patterns [pattern, rows]: list of different patterns of data presentation discovered in the source and frequency
Data Types [type, rows]:list of data type matches and frequency. The column data type detected by the profiler. When a column has data of different data types the profiler pick the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type. Here is the list of supported types:
- Text
- Date
- Time
- DateTime
- Geographical
- No Percentiles
- Means, Median
- Variance
- Std. Deviation
- Number
Inferred Data Type: Inferred Data Types after dataprofiling the object.
Data classes: list of data classes matched and likelihood as a percentage.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here