Qlik Catalog identifies and flags data-quality problems at the point of ingest by applying a robust set of data validation rules for each data source as it is loaded. These rules and validation filters determine whether the incoming data matches the source and user-configured schema definition. The application does not allow ingest of good data with (for example) misplaced field and record delimiters or data if contents of individual fields and records don’t match the expected schema. Validation rules compare the contents of each field in each record with a set of parameters and thresholds to determine whether that record contains data of expected qualities. Below is a table describing validation values that are applied at the dataset file, record, and field levels. Many of these validations correspond to configurable object-level properties.
Validation at the dataset/file level
The primary purpose of dataset/file validation is to ensure that the packaging of the file looks correct before processing begins. In production systems datasets arrive on a daily or regular basis for ingestion. The application validates the integrity of the files early in the ingestion process. Changes to file formats may be made in upstream systems without advance notice. Files may sometimes be corrupted or truncated during file transmission and handling and these changes are identified and flagged. A file may have an optional header and or an optional trailer, both of which are validated by exact text match or regular expression pattern matching.
File format version numbers can be checked
CSV (text-delimited) files with column name headers can be used to detect changes in upstream systems
Unicode text files may contain a BOM (Byte Order Mark) which describes the specific character encoding (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF32). Qlik Catalog will optionally validate/enforce the presence of Unicode BOM. Example: If your file always contains a BOM but today's version does not, this change is detected.
When datasets are transmitted it is important that all records arrive at the destination. Text files may have a record count in the trailer, Mainframe datasets often have a record count in the trailer, or record counts may be transmitted in a separate manifest file. The total record count is recorded during data ingestion and validates against the dataset.expected.record.count extracted from the header-trailer-manifest.
Validation at the record level
Validation at the record level primarily corresponds with identifying and partitioning Bad records. If the record itself is considered Bad then no processing of field level data can take place.
- There is a fundamental categorization of records based upon record length. The most common length-based categories are:
Record length validation is performed by record.min.byte.count and record.max.byte.count properties. These are inclusive min and max values. Note that these are byte counts, not record counts. This becomes a factor when working with multi-byte character sets, including all Unicode encodings.
FIXED_BYTE_LENGTH_TERMINATED records must contain the specified record terminator (usually \n or \r\n) after the specified payload byte count
If a record.validation.regex is specified then each record is checked for regular expression pattern match. Failing records recognized as Bad, are sent to the Bad output stream, and are excluded from further processing.
TEXT-DELIMITED CSV files use one or more text characters as field delimiters. These delimiters effectively determine the number of fields in the record. For each record, the observedFieldCount is validated against the expectedFieldCount. Records with mismatched field counts are considered Bad, and sent to the Bad output stream, where they are excluded from further processing.
Validation at the field level
Validation at the field level is what drives the categorization of records as Ugly. A record is considered Ugly if one or more fields fail field level validation.
Some field-level validation is independent of the data type. Other field-level validation is type-specific.
Fields can be tagged as NOT_NULL. The application considers a field of length zero to be NULL (as does Oracle RDBMS). If a field is of length zero and the field is tagged as NOT_NULL then the field is considered Ugly and the record is categorized as Ugly.
Fields may have a regular expression validation pattern. While this can be applied to fields of any type, field regex validation is generally applied to STRING fields. If a field has a regex validation pattern and the specified field value does not fully match the regular expression validation pattern then the field is considered Ugly and the record is categorized as Ugly.
Fields may optionally specify a field.min.legal.value and field.max.legal.value as inclusive value limits. Comparison and interpretation is type-specific. If either or both of these limits are exceeded then the field is considered Ugly and the record is categorized as Ugly.
BOOLEAN field validation
- The following values are considered Boolean.true : 't', 'T', '1', 'y', 'Y', case-insensitive "true", case insensitive "yes"
- The following values are considered Boolean.false : 'f', 'F', '0', 'n', 'N', case-insensitive "false, case-insensitive "no"
- All other values are considered invalid Boolean values; the field is considered UGLY and the record is categorized as Ugly
INTEGER field validation
- Validation of INTEGER fields in Qlik Catalog is somewhat more flexible than other numeric parsers. Qlik Catalog allowing:
- Leading and trailing white space
- Leading or trailing '+' or '-' sign optionally separated from the digits by white space
- Exponential notation for true integer values [important for MS Excel]
- If field.min.legal.value and/or field.max.legal.value are specified and the value falls outside the specified range then the integer field fails validation and the record is categorized as Ugly. For example, the value Long MIN_VALUE == -2**63 == -9223372036854775808 is considered invalid and the record is categorized as Ugly.
- Other field values other than the INTEGER field may fail validation and the record is categorized as Ugly.
DOUBLE field validation
- The characters of a DOUBLE field must represent a valid numeric IEEE 64-bit double in standard computer scientific notation
- If field.min.legal.value and/or field.max.legal.value are specified and the value falls outside the specified range then the DOUBLE field fails validation and the record is categorized as Ugly.
- All other values, including "NaN" (Not a Number) is considered invalid, the field is considered Ugly and the record is categorized as Ugly
- Note inherent limitations on DOUBLE which restrict to 15 decimal digits of precision
During importing, the application optionally supports forcing numbers to comply with a desired precision and scale. This can be useful when upstream processing upstream binary floating point (i.e. Excel or IEEE binary floating point in any programming language generates inexact results due to binary rounding).
- 1.230000000001 => 1.23
- 4.555555555557 => 4.56
DECIMAL field validation
- If a DECIMAL value has more than 38 digits of precision (when the scale is non-negative) then the DECIMAL is considered out-of-range, the field is Ugly and the record is categorized as Ugly
- If field.min.legal.value and/or field.max.legal.value are specified and the value falls outside the specified range then the DECIMAL field fails validation and the record is categorized as Ugly
- All other values fail validation and the record is categorized as Ugly
STRING field validation
- One may optionally NormalizeWhitespace through entity and field level properties (described here) prior to validation.
- If field.min.legal.value and/or field.max.legal.value are specified and the observed value falls outside the specified range in a case-sensitive comparison then the observed value fails validation and the record is categorized as Ugly
- If field.min.legal.char.length and/or field.max.legal.char.length are specified and the observed STRING length falls outside the bounds (where UTF-16 surrogate pairs are counted as two chars), then the observed value fails validation and the record is categorized as Ugly
- If field.allow.whitespace == false and the observed value contains whitespace then the field validation fails and the record is categorized as Ugly
- If field.allow.denormalized.whitespace == false and the observed value contains denormalized whitespace (any non-space whitespace chars OR leading space OR trailing space OR more than one sequential space) then the field validation fails and the record is categorized as Ugly
- If field.allow.control.chars == false and the observed value contains control chars (Unicode block C0 or Block C1 control chars or DEL == \\u007F) then field validation fails and the record is categorized as Ugly
- If field.allow.non.ascii.chars == false and the observed value contains non-7-bit-US-ASCII characters (Unicode codepoints whose numeric value is >= 0x80) then field validation fails and the record is categorized as Ugly
- All other STRING values are considered valid
- STRING values are stored as UTF-8 with industry-standard backslash-escaping of the control chars, the DEL char, and the backslash char itself. The control chars \\b \\f \\n \\r \\t use their standard symbolic representation. All other control chars are written using standard \\uXXXX unicode hexadecimal notation.
Properties: Order of evaluation rules overview
The following overview and demonstration uses property [default.]field.allow.control.chars, a sister property of [default.]field.allow.non.ascii.chars. It can be run using JDBC loads of Catalog's PostgreSQL table podium_core.pd_workorder (from podium_md database), which has control chars in the status_log field.
Ultimately, it is data in fields that is evaluated and used to determine if a record is ugly. With respect to control characters, when it comes time to evaluate the data in the field, the following rules are applied in the following order. Note that demonstration results may vary based on data in the environment and setting [default.]field.allow.non.ascii.chars may also be required.
- If field property field.allow.control.chars is set to true or false, that is used as the setting;
- If not, and entity property default.field.allow.control.chars is set, that is used;
- If not, and source property default.field.allow.control.chars is set, that is used;
- If not, false is used.*
*There is a small set of properties, such as enable.profiling where instead of false, the default is taken from the core_env.properties file. The set includes: enable.profiling, distribution.excludes, enable.distribution, enable.validation, and enable.archive.
When running the demonstration, it is easiest to start with setting a property on the source, then the entity, then the field:
- Without setting any properties, load podium_core.pd_workorder; records will be marked as ugly (view the details).
- On source podium_core, set property default.field.allow.control.chars to true, reload; all records will be good;
- On entity pd_work_order, set property default.field.allow.control.chars to false, reload; records will be marked as ugly;
- On field status_log, set property field.allow.control.chars to true, reload; all records will be good.
Notice that the source and entity properties begin with "default." This is often, but not always, the case for other field validation properties.
Not all properties are applicable to sources, entities, and fields. For example, header.validation.pattern is specific to only an entity, and distribution.excludes (controlling distribution database tables tied to entities) does not make sense in the context of a field.
The Catalog user interface does NOT show the "effective property" used by code after the application of the above order of evaluation rules.