Skip to main content Skip to complementary content

Data Sampling and Profiling

You may configure the harvest to perform data sampling and/or profiling when importing the metadata. In addition,

Steps

  1. Configure a model to be harvested.
  2. Click the Data Setup tab.
  3. Specify data sampling and data profiling options, as desired.
Information note

Specifying data sampling and profiling options does NOT cause data profiling and/or sampling on every import of the model. Instead, these settings define the parameters defining how the sampling and profiling should be performed.

You may use the Data sample and profile after metadata import checkbox and MQL Statement to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

Instead, you may:

-Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.

-Sample and profile on demand via the user interface at any subset of the model you wish to specify.

Information note

In a best practices environment, the application or model administrators should define the data import policy (profiling, sampling, and classification), and the users or scheduled tasks should follow that policy. Not surprisingly, companies use different frequencies for metadata and data imports, and thus it is discouraged to always run the data import (data sampling and profiling) immediately after every import of metadata.

Thus, the in the user interface, the option to import data is separate from the data import policy by displaying the policy in the Data Defaults tab and leaving the option in the Import Options tab. In this way, the option and policy are independent. The option is turned off by default, but the administrators can enable sampling and profiling according to policy. In addition, the administrators may turn off sampling and profiling by default but enable data classification and data import. Still, data sampling and profiling is required in order to perform auto-tagging via data driven data classification.

  1. Go to the Import Options tab and specify Data sample, profile and classify after metadata import according to the Data Setup tab in order to sample and/or profile or both just after the import. Optionally, you may then enter an MQL STATEMENT to define a data request scope, which is a subset of tables defined by a provided Metadata Query Language (MQL) (e.g. tables from a set of schemas, or table with/without a user defined data sampling flag).
  2. Click SAVE.
  3. Click IMPORT.

Example

Sign in as Administrator.

Create the folder and configuration (as needed).

Go to the MANAGE > Configuration. Click the model named Data Lake that is using the File System bridge.

In the Data Setup tab, specify Data Sampling and Data Profiling with the default number of rows.

Click SAVE.

Go to the Import Options tab and specify Data sample, profile and classify after metadata import according to the Data Setup tab in order to sample and/or profile or both just after the import.

Click SAVE and click IMPORT.

After importing, the data profiling and sampling will still not been executed, unless you did not check the Data sample, profile and classify after metadata import according to the Data Setup tab option.

Information note

Specifying data sampling and profiling options does NOT cause data profiling and/or sampling on every import of the model. Instead, these settings define the parameters defining how the sampling and profiling should be performed.

You may use the Data sample and profile after metadata import checkbox and MQL Statement to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

Instead, you may:

-Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.

-Sample and profile on demand via the user interface at any subset of the model you wish to specify.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!