Data profiling
Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data. In short, data profiling helps you understand your data and model it correctly.
Qlik Compose enables you to profile the data in the landing zone tables before it is loaded into the data warehouse. If you discover a problem with certain data, then you can either manually adjust the source tables or create a rule for handling the data in question.
To profile the data:
- Click the Manage button at the bottom of the Data Warehouse panel.
- In the Manage Data Warehouse Tasks window, click the link in the Mapping column for the table you want to profile.
-
In the Edit Mappings - <Name> window, click the Data Profiler toolbar button. The Profile <Table Name> (Landing Zone) window opens. The following columns are displayed:
- Column Name - The name of the table column
- Nulls - The number of null values in the column
- Count - The number of rows in the column.
- Count Distinct - The number of unique rows in the column.
-
Duplicates - The number of duplicate values in the column.
Note that although Compose calculates the number of duplicate values by subtracting Count Distinct from Count, the actual number of records displayed when you click the Duplicates number will be higher. This is because Compose has no way of knowing which of the records that share the same column value are legitimate duplicates (if any). It therefore displays all records that share the same value so you can decide which of them to delete (if any).
For example, in the Employees table, there may be several employees that live in London (the City column). Therefore duplicates of "London" are perfectly acceptable. However, two employees with the same phone number and a different address, for example, may indicate that the phone number in one of the records was entered incorrectly.
Duplicate values are quite common and usually do not indicate a problem. Where this feature is particularly useful however, is for detecting duplicate Primary Key candidate columns.
- Data Type - The column data type
- Max - The highest data value
- Max Length - The longest data value
- Min - The lowest data value
- Min Length - The shortest data value
- For more information about a value, click the link in the column. A window opens showing the record(s) containing the value. To add a Data Quality rule, click the Data Quality button and continue as described in Defining and managing data quality rules.
- To only show columns that are mapped to a logical entity column, select the Only show mapped columns check box.
- To change the number or rows sampled, select a different value from the Rows to sample drop-down list. Note that the table may contain less rows than the selected value. The Sampled records value is the actual number of rows sampled.
-
To see all the table data, click the Show Data button.
Information noteThe table's Full Load data will always be shown, even for a mapping in a Change Processing (CDC) task.
- To recalculate the data, click the Recalculate button. This is useful if the data in the landing zone tables is being constantly updated (for example, due to a Replicate Change Processing task).
- To search for a particular value, start typing the value in the Search box. Only values that match the search term will be shown.