Data quality and data discovery
After opening a dataset, you can take a look at several parts of the overview to learn more about its overall quality, its schema, the quality statistics, and semantic types of each columns.
Quality indicators of the dataset
When you open the overview of a dataset that has just been registered, most of the information is grayed out. To calculate the data quality for the first time, click the Compute button. If the quality has already been computed once before, but you want to make sure that the data is up to date, click the Refresh button.
Each compute or refresh will cost you Snowflake credits. For more information, see Data quality for Snowflake datasets.
There are two main sections where the quality is displayed.
-
The Data quality area, that includes:
-
The repartition of valid, invalid, and empty values across the whole dataset in the form of a quality bar with three colors, and their respective percentages.
-
A Validity score, expressing the percentage of valid values, without taking empty values into account.
-
A Completeness score, expressing the percentage of values that are not empty.
-
-
The Schema area that shows the different fields of the dataset, wihch data type or semantic type has been applied, and a quality bar for each field of the dataset.
Semantic types discovery
Each field of a dataset is automatically assigned a semantic type to better describe its content. Behind the scenes a data discovery operation occurs to determine which type to assign.
The data discovery calculates how many values in a column match each semantic type and, if the result is greater than 40%, it assigns the semantic type to the column.
How is the percentage calculated?
This percentage is the sum of two percentages:
-
One percentage represents the number of values matching the semantic type; up to 100% allocated. To determine if a value matches a semantic type, the data discovery depends on the type of the semantic type:
-
Dictionary: Does the value match a value from the dictionary? Punctuation, case, spaces, and accents are ignored.
-
Regular expression: Does the value match the regular expression?
-
Compound: is the value discovered into at least one child?
A compound type is a group of existing semantic types, called children.
If the answer is positive, the value is considered valid.
-
-
The other percentage represents the similarity between the column name and the name of the semantic type; up to 10% allocated.
To compare the names:
-
The Levenshtein algorithm is used. It calculates the minimum number of edits (insertion, deletion, or substitution) required to transform one string into another.
-
The case and accents are ignored.
-
If the strings contain spaces, the word order is ignored. For example, US Phone and Phone US are considered identical.
The maximum percentage is 100%. If all values match a semantic type and the column name is identical to the name of the semantic type, the result still is 100%.
-
Data types discovery
Instead of semantic types, native data types can also be assigned. If no semantic type obtains more than 40%, the data discovery automatically assigns a data type.
To determine of which type is a value, the data discovery follows an order:
-
Is the value empty?
-
Is the value of type boolean? true and false are the only values considered of type boolean.
-
Is the value of type integer?
-
Is the value of type decimal?
-
Is the value of type date?
-
If the value is not of one of the above types, it is considered a text value.
As the verification is incremental, a value is only of one type. For example, the value 5 is of type integer. It will not be considered of type text.