Skip to main content Skip to complementary content

Detecting anomalies in columns (Functional Dependency Analysis)

This type of analysis helps you to detect anomalies in column dependencies through defining columns as either "determinant" or "dependent" and then analyzing values in dependent columns against those in determinant columns. This analysis supports only database tables.

About this task

This type of analysis detects to what extent a value in a determinant column functionally determines another value in a dependent column.

This can help you identify problems in your data, such as values that are not valid. For example, if you analyze the dependency between a column that contains United States Zip Codes and a column that contains states in the United States, the same Zip Code should always have the same state. Running the functional dependency analysis on these two columns will show if there are any violations of this dependency.

Defining the analysis to detect anomalies in columns

Before you begin

At least one database connection is set in the Profiling perspective of Talend Studio.

Procedure

  1. In the DQ Repository tree view, expand Data Profiling.
  2. In the filter field, start typing functional dependency analysis, select Functional Dependency Analysis from the list and click Next.
  3. Enter a name.
  4. Optionally, enter the column analysis metadata (Purpose, Description and Author).
  5. Click Next.

Selecting the columns as either "determinant" or "dependent"

Procedure

  1. Expand DB connections, browse to the database you want to analyze, select it and then click Finish to close the New Analysis wizard.
    The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under Analyses in the DQ Repository tree view.
    Overview of the Analyzed Columns Set section.
    The Data Preview section shows a sample data of all the table columns.
  2. In the Left Column panel, click A Columns Set to open the Column Selection dialog box.
    Here you can select the first set of columns against which you want to analyze the values in the dependent columns. You can also drag the columns directly from the DQ Repository tree view to the left column panel.
    In this example, you want to evaluate the records present in the city column and those present in the state_province column against each other to see if state names match to the listed city names and vice versa.
    Overview of the Column Selection window.
  3. In the Column Selection dialog box, expand DB Connections and browse to the columns you want to define as determinant columns.
    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.
  4. Select the check box next to the columns you want to analyze and click OK to proceed to the next step.
    The selected columns are displayed in the Left Columns panel of the Analyzed Columns Set view. In this example, select the city column as the determinant column.
    Overview of the Left Columns and Right Columns sections.
  5. Do the same to select the dependent columns or drag them from the DQ Repository tree view to the Right Columns panel. In this example, select the state_province column as the dependent column. This relation will show if the state names match to the listed city names.
    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column is automatically located under the corresponding connection in the tree view.
  6. Click the Reverse columns tab to automatically reverse the defined columns and evaluate the reverse relation, what city names match to the listed state names.
    You can select to connect to a different database by selecting another connection from the Connection list in the Data Preview section. This list shows all the connections created in Talend Studio. If the columns listed in the Analyzed Columns Set view do not exist in the new database connection you want to set, you will receive a warning message that enables you to continue or cancel the operation.

Finalizing and executing the functional dependency analysis

Procedure

  1. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database, if required.
    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.
  2. Save the analysis and press F6 to execute it.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis Results view.

    Graphs showing the results of the functional dependency analysis.
    This functional dependency analysis evaluated the records present in the city column and those present in the state_province column against each other to see if the city names match to the listed state names and vice versa. The returned results, in the %Match column, indicate the functional dependency strength for each determinant column. The records that do not match are indicated in red.
    The #Match column in the result table lists the numbers of the distinct determinant values in each of the analyzed columns. The #row column in the analysis results lists the actual relations between the determinant attribute and the dependent attribute. In this example, #Match in the first row of the result table represents the number of distinct cities, and #row represents the number of distinct pairs (city, state_province). Since these two numbers are not equal, then the functional dependency relationship here is only partial and the ratio of the numbers (%Match) measures the actual dependency strength. When these numbers are equal, you have a "strict" functional dependency relationship, that is to say each city appears only once with each state.
    Information noteNote: The presence of null values in either of the two analyzed columns will lessen the "dependency strength". The system does not ignore null values, but rather calculates them as values that violates the functional dependency.
  3. In the Analysis Results view, right-click any of the dependency lines and select an action to perform:
    • View valid/invalid rows to access a list in the SQL editor of all valid and invalid rows measured according to the functional dependencies analysis.
    • View valid/invalid values to access a list in the SQL editor of all valid and invalid values measured according to the functional dependencies analysis.
    • View detailed valid/detailed invalid values to access a detailed list in the SQL editor of all valid and invalid values measured according to the functional dependencies analysis.
    From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!