Skip to main content Skip to complementary content

Profiling an HDFS file

From the Profiling perspective of Talend Studio, you can generate a column analysis with simple statistics indicators on an HDFS file via a Hive connection.

The sequence to create a profiling analysis on an HDFS file involves the following steps:

  1. Create a connection to a Hadoop cluster.
  2. Create a connection to a Hive server.

    This step is not mandatory as you will be prompted to create the connection to Hive simultaneously while you create the connection to an HDFS file.

  3. Create a connection to an HDFS file.

    This step will guide you to create a Hive external table, which leaves the data in the file, but creates a table definition in the Hive metastore. This allows Talend Studio to run SQL queries on the data in the file via the Hive connection.

  4. Create a column analysis with simple indicators on the Hive table.

You can then modify the analysis settings and add other indicators as needed. You can also create other analyses later on this HDFS file by using the same Hive table.

Information noteNote:
You can profile files of the following formats:
  • TXT
  • CSV
  • Parquet, with a flat structure

Creating a connection to a Hadoop cluster

Before you begin

  • You have selected the Profiling perspective.
  • You have the proper access permission to the Hadoop distribution and its HDFS.

Procedure

  1. In the DQ Repository tree view, expand Metadata, right-click Hadoop Cluster and select Create Hadoop Cluster.
    Contextual menu of the Hadoop Cluster node.
    A wizard opens to guide you through the steps to create a connection to the cluster.
  2. Follow the steps in the wizard to create the connection, and select to manually enter the Hadoop configuration information.
    For detail information about creating connections to Hadoop clusters, see Managing Hadoop metadata.
  3. Click Check Services at the last step in the wizard to verify if the connection is successful and then click Finish.

Results

The new Hadoop connection is listed under the Hadoop Cluster node in the DQ Repository tree view.

Creating a connection to Hive

You can create a connection to Hive directly from the connection you defined for the Hadoop distribution. However, you can proceed differently and create the connection to Hive simultaneously while you create an analysis on HDFS file as outlined in Creating a connection to an HDFS file.

Before you begin

You have selected the Profiling perspective.

You have created a connection to the Hadoop distribution.

Procedure

  1. In the DQ Repository tree view, right-click the Hadoop connection to be used and select Create Hive to open a wizard.
    Contextual menu of a Hadoop connection.
  2. Follow the steps in the wizard to create the connection, and click Check at the last step to verify if the connection is successful.
  3. Click Finish.

Results

The new Hive connection is listed under the Hadoop Cluster and the DB connections nodes in the DQ Repository tree view.
New Hive connection under the Metadata node.

For more information about creating Hive connections, see Centralizing Hive metadata.

Creating a connection to an HDFS file

Before you begin

  • You have selected the Profiling perspective.
  • You have created a connection to the Hadoop distribution.

Procedure

  1. In the DQ Repository tree view, right-click the Hadoop connection to be used and select Create HDFS.
    A wizard opens to guide you through the steps to use a file schema from HDFS.
  2. Follow the steps in the wizard to create the connection, and click Check at the last step to verify if the connection is successful.
  3. Click Finish.

Results

The new HDFS connection is listed under the Hadoop connection in the DQ Repository tree view.
New HDFS connection under the Metadata node.

For more information about creating HDFS connections, see Centralizing HDFS metadata.

Creating a profiling analysis on the HDFS file via a Hive table

Before you begin

  • You have selected the Profiling perspective.
  • You have created a connection to the Hadoop distribution and the HDFS file.

About this task

You can profile files of the following formats:
  • TXT
  • CSV
  • Parquet, with a flat structure

Procedure

  1. In the DQ Repository tree view, right-click the HDFS connection to be used and select Create Simple Analysis.
    A dialog box opens listing the HDFS schemas in the connection.
    Overview of the HDFS schemas in a connection.
  2. Select the check box of the file you want to profile.
    Wait till you read Success in the Creation status column.
    Information noteNote: The Hive table you will create is based on folders and not on files. So you must not select files that have different structures.
  3. Click Check Connection to verify the connection status and then click Next to go to the next step that lists the schema of the selected file.
    Overview of the schema of a selected file.
  4. Modify the schema if needed.
    If there is a Date column in the schema, make sure to correctly set the date pattern, otherwise you may get null as results.
  5. Click Next to open a new view in the wizard where you can create a table with the HDFS schema on a Hive connection.
  6. Optional: If needed, enter a new name for the table. Use lower case as Hive stores tables in lower case.
    Example of a name in lower case in the New Table Name field.
  7. Either:
    • From the Select one existed Hive Connection list, select the Hive connection on which you want to create the table.

      You must have at least one Hive connection correctly configured before you can create the table. The Select one existed Hive Connection option will be disabled if you have not created at least one Hive connection.

      You can create a Hive connection if you select the Create a new Hive Connection option in this view of the wizard.

    • Select the Create a new Hive Connection option to create first a Hive connection, then to create the table on the new connection.
  8. Click Finish.
    The New Analysis wizard opens.
  9. Set the analysis metadata and click Finish.
    Overview of the Data Preview and the Analyzed Columns sections.

    A new analysis on the selected HDFS file is automatically created and opened in the analysis editor. Simple statistics indicators are automatically assigned for columns.

    The analysis actually applies to the Hive table, but computes statistics on the data from the HDFS by using the External table mechanism. External tables keep data in the original file outside of Hive. If the HDFS file you selected to analyze is deleted, then the analysis will not be able to run anymore.

  10. Click Refresh Data to display the column content.
    You can use the Select Columns tab to modify the columns to be analyzed.
  11. If needed, click Select Indicators to add more indicators or new patterns to the columns.
  12. Run the analysis to display the results in the Analysis Results section in the editor.
    Tables and graphics for the Simple Statistics indicator.

    For more information on column analysis, see Column analyses.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!