Database Specific Sampling and Profiling Considerations
Data sampling and data profiling is available on all dedicated database bridges (e.g., Oracle, SQL Server, Hive, Teradata, etc.). It is implemented by automatically including the following options in the Miscellaneous bridge parameter:
Miscellaneous = -dp 1000 -dps 10
Which would be data profiling based upon 1000 rows and sampling of 10 rows.
However, the generic database JDBC import (e.g., HP Vertica) cannot predict the proper query to access data, and therefore does not support data profiling and sampling. However, the user may try by specifying the special option in the Miscellaneous bridge parameter. This is as follows:
Miscellaneous = -dp.query-template <SQL Query>
Examples for specific databases (to replace <SQL Query>):
HP Vertica:
SELECT @column FROM @schema.@table LIMIT @recordsLimit
IBM DB2 ISeries (AS/4oo):
SELECT @column FROM @schema.@table FETCH FIRST @recordsLimit ROWS ONLY
Microsoft SQL Server:
SELECT TOP @recordsLimit [@column] FROM [@schema].[@table]
Oracle database:
SELECT "@column" FROM "@schema"."@table" WHERE ROWNUM <= @recordsLimit
Oracle MySQL:
SELECT `@column` FROM `@schema`.`@table` LIMIT @recordsLimit
Data profiling and sampling requires access to the data in the data store. Thus, you must specify a user name / password with the appropriate access rights. For example:
-SAP HANA database:
GRANT the SELECT privileges to a role the user has.
https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.03/en-US/20f674e1751910148a8b990d33efbdc5.html
Steps
- Specify profiling information when harvesting the metadata source
- Execute the harvest.
Example
Table or file sample data
Log in as a user with the Data Viewingcapability object role assignment for the Data Lake.
Viewing the sample data records for the Employee.csv file on its object page in the Sample Data tab.
Sign in as Dan, who is a Data Lake Viewer:
The SSN field is obfuscated. This is due to the fact that the US Social Security Number data class has a Default Sensitivity of Classified and the Classified sensitivity label has the Hide Data flag set.
Field or column profiling data
Now, log in as a user with the Data Managementcapability object role assignment for the Data Lake or one that is a member of the Administrators or Data Administrators group.
Then, click the Fields tab and click Gender
Go to the Employee.csv file object page and then the Fields tab.
Click on the Gender field to go to its object page.
Data classes
A field or column may be assigned one or more data classes. E.g., the field in the example above is assigned the data class Gender. Once that data class is assigned, it is a property of the element and may be searched on, filtered by and further edited (remove the assignment).
Some actions can apply to all objects of a certain data class. In particular the Hide/Show property referenced above.
You may manage the list of data classes with proper admin privileges.
See data classification for more details.