Skip to main content Skip to complementary content

Detecting anomalies in columns (Functional Dependency Analysis)

This type of analysis helps you to detect anomalies in column dependencies through defining columns as either "determinant" or "dependent" and then analyzing values in dependent columns against those in determinant columns. This analysis supports only database tables.

About this task

This type of analysis detects to what extent a value in a determinant column functionally determines another value in a dependent column.

This can help you identify problems in your data, such as values that are not valid. For example, if you analyze the dependency between a column that contains United States Zip Codes and a column that contains states in the United States, the same Zip Code should always have the same state. Running the functional dependency analysis on these two columns will show if there are any violations of this dependency.

Defining the analysis to detect anomalies in columns

Before you begin

At least one database connection is set in the Profiling perspective of Talend Studio.

Procedure

  1. In the DQ repository tree view, expand Data Profiling and right-click Analyses > New analysis.
  2. Select Table > Functional dependency and click Create.
  3. Enter a name.
  4. Optionally, enter the analysis metadata (Purpose, Description and Author).
  5. Click Next.

Selecting the columns as either "determinant" or "dependent"

Procedure

  1. From the Connection menu, select the connection and click Next. You are in the Compare menu.
  2. Click Select column A to open the Column selection dialog box.
    Here you can select the first set of columns against which you want to analyze the values in the dependent columns. You can also drag the columns directly from the DQ repository tree view to the left column panel.
    In this example, you want to evaluate the records present in the city column and those present in the state_province column against each other to see if state names match to the listed city names and vice versa.
    Overview of the Column Selection window.
  3. In the Column Selection dialog box, expand the connection and browse to the columns you want to define as determinant columns.
    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.
  4. Select the check box next to the columns you want to analyze and click OK to proceed to the next step.
    In this example, select the city column as the determinant column.
    Overview of the column selection for A and B.
  5. Click Select column B and follow the same steps to select the second set of columns or drag it to the right column panel.. In this example, select the state_province column as the dependent column. This relation will show if the state names match to the listed city names.
    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ repository view, the selected column is automatically located under the corresponding connection in the tree view.
  6. Click Switch columns to automatically switch the defined columns and evaluate the reverse relation, what city names match to the listed state names.
    You can select to connect to a different database by selecting another connection from the Connection list in the Data Preview section. This list shows all the connections created in Talend Studio. If the columns listed in the Analyzed Columns Set view do not exist in the new database connection you want to set, you will receive a warning message that enables you to continue or cancel the operation.

Finalizing and executing the functional dependency analysis

Procedure

  1. From the Settings menu, set the number of concurrent connections allowed per analysis in the Number of connections per analysis field.
    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.
  2. Click Save and Run.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the analysis results.

    Graphs showing the results of the functional dependency analysis.
    This functional dependency analysis evaluated the records present in the city column and those present in the state_province column against each other to see if the city names match to the listed state names and vice versa. The returned results, in the %Match column, indicate the functional dependency strength for each determinant column. The records that do not match are indicated in red.
    The #Match column in the result table lists the numbers of the distinct determinant values in each of the analyzed columns. The #row column in the analysis results lists the actual relations between the determinant attribute and the dependent attribute. In this example, #Match in the first row of the result table represents the number of distinct cities, and #row represents the number of distinct pairs (city, state_province). Since these two numbers are not equal, then the functional dependency relationship here is only partial and the ratio of the numbers (%Match) measures the actual dependency strength. When these numbers are equal, you have a "strict" functional dependency relationship, that is to say each city appears only once with each state.
    Information noteNote: The presence of null values in either of the two analyzed columns will lessen the "dependency strength". The system does not ignore null values, but rather calculates them as values that violates the functional dependency.
  3. In the analysis results, right-click any of the dependency lines and select an action to perform:
    • View valid/invalid rows to access a list in the SQL editor of all valid and invalid rows measured according to the functional dependencies analysis.
    • View valid/invalid values to access a list in the SQL editor of all valid and invalid values measured according to the functional dependencies analysis.
    • View detailed valid/detailed invalid values to access a detailed list in the SQL editor of all valid and invalid values measured according to the functional dependencies analysis.
    From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!