The first step in analyzing the content of one or multiple columns is to
define the columns to be analyzed. The analysis results provides statistics about the
values within each column.
Before you begin
You have defined at least one database connection in the Profiling perspective of Talend Studio.
About this task
When you select to analyze Date columns
and run the analysis with the Java engine, the date information is stored in the
Talend Studioand in the data mart as regular date/time
of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and
of format HH:mm:ss.SSS for time. The date and time formats
are slightly different when you run the analysis with the SQL engine.
Defining the basic column analysis
Big Data Platform
Cloud API Services Platform
Cloud Big Data Platform
Cloud Data Fabric
Cloud Data Management Platform
Data Fabric
Data Management Platform
Data Services Platform
MDM Platform
Qlik Talend Cloud Enterprise Edition
Qlik Talend Cloud Premium Edition
Real-Time Big Data Platform
Procedure
In the DQ Repository tree view, expand the Data Profiling folder.
Right-click the Analyses folder and select New
Analysis.
The Create New Analysis wizard
opens.
In the filter field, start typing basic column
analysis, select Basic Column Analysis and
click Next.
In the Name field, enter a name for the current column
analysis.
Information noteImportant:
Do not use the following special characters in the item names: ~ ! ` # ^ * & \\ / ? : ; \ , . ( ) ¥ ' " « » <
>
These characters are all replaced with "_" in the file system and you may end up
creating duplicate items.
Optional: Set column analysis metadata (Purpose,
Description and Author) in the
corresponding fields and click Next.
Selecting the database columns and setting sample data
Big Data Platform
Cloud API Services Platform
Cloud Big Data Platform
Cloud Data Fabric
Cloud Data Management Platform
Data Fabric
Data Management Platform
Data Services Platform
MDM Platform
Qlik Talend Cloud Enterprise Edition
Qlik Talend Cloud Premium Edition
Real-Time Big Data Platform
Procedure
Expand DB connections and in the desired database, browse
to the columns you want to analyze.
Information noteNote: When profiling a DB2 database, if double quotes exist in the column names of a
table, the double quotation marks cannot be retrieved when retrieving the
column. Therefore, it is recommended not to use double quotes in column
names in a DB2 database table.
Select the columns and then click Finish to close the
wizard.
A file for the newly created column analysis is listed under the Analysis node in the DQ Repository tree
view, and the analysis editor opens with the analysis metadata.
This example analyzes full names, email addresses and sales
figures.
In the Data preview view, click Refresh Data.
The data in the selected columns is displayed in the table.
In the Data preview view, select the action you want to
perform.
New Connection: open a wizard and create a connection
to the data source from within the editor.
The
Connection field on top of this section lists all
the connections created in Talend Studio.
Select Columns: open the Column
Selection dialog box where you can select the columns to analyze
or change the selection of the columns listed in the table. From the open
dialog box, you can filter the table or column lists by using the
Table filter or Column filter
fields respectively.
Select Indicators: open the Indicator
Selection dialog box where you can select the indicators to use
for profiling columns.
n first rows or n random rows:
list in the table N first data records from the selected columns or list N
random records from the selected columns.
Refresh Data: display the data in the selected columns
according to the criteria you set.
Run with sample data: run the analysis only on the
sample dataset in the Limit field.
In the Limit field, set the number for the data records
you want to display in the table and use as sample data.
In the Analyzed Columns section, use the
arrows in the top right corner to open different pages in the view if you analyze large
number of columns.
You can also drag the columns to be analyzed directly from the DQ Repository tree view to the Analyzed Columns list.
If one of the columns you want to analyze is a primary or a foreign key,
its data mining type becomes automatically Nominal when
you list it in the Analyzed Columns view.
If required, right-click any of the listed columns in the Analyzed Columns section and select Show in DQ Repository view to locate it in the database connection in
the DQ Repository tree view.