Skip to main content Skip to complementary content

Ingest from file system: Profile data

Rich and detailed profile data is captured by Qlik Catalog during ingestion—while the application summarizes this data in the UI, a lot of detail exists in the file system that can be ingested and also profiled. The following example describes the process when the data resides in the file system.

Administrators can leverage this data for reports and apply transforms like any other entity in the cluster.

Ingest of profile data from HDFS is setup using the Add Data wizard in the following example.

Step 1: Setup FILE_HDFS source connection

Create a new source connection by selecting Source Connections from sources grid in source module

Source connection button initiates new source connection wizard

In source connection wizard specify the following fields:

Connection Name: User-defined

Description: User-defined

Connection Type: FILE

Protocol: HDFS

Connection String: Find connection URI value by selecting About link from Support dropdown menu in main navigation bar; locate the Settings tab and copy value for receiving.uri (example: hdfs://host-nameservice:8020/)

Source connection wizard filled out for File from HDFS connection

Upon creation of HDFS FILE Source Connection, initiate Add Data wizard

Step 2. Source Connection

Select Add New Data: To new source

Choose Connection: Select the connection created in Step 1 where Connection Type: FILE Protocol: HDFS

Add New Source Name: Configurable. This is the source that the new entity will be added to.

Default Entity Level: Select Managed (see System Settings: Level Control)

Select Source Hierarchy: Configurable. Choose from dropdown.

Inbound Protocol: Pre-defined inbound protocol auto-populates for the selected source connection (in this case it is HDFS)

Base Directory: Configurable. (This is the directory where data will be stored in file system). Select Folder icon to browse HDFS directories. If creating a new folder, type in a new directory filepath and then select Ok, or select a pre-existing base folder.

Groups: Select group(s) requiring access from the dropdown options. At least one group must be added for the data to be discoverable.

Click Next.

Add data wizard, source specification fields for File on HDFS


Step 3: Entity & Fields

Entity Name: Name the entity

File Path: Location of the profile data on HDFS. The file path follows the following path format: /<podiumbase_directory>/receiving/SOURCE_NAME/ENTITY_NAME/partition/profile/profile.txt

Example: HDFS discovery

[podium@nname-podium ~]$ hadoop fs -ls /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing

Found 3 items

drwxr-xr-x   - podium superuser          0 2018-12-08 02:44 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306

drwxr-xr-x   - podium superuser          0 2018-12-10 06:49 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181210064726

drwxr-xr-x   - podium superuser          0 2018-12-10 06:53 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181210065143

[podium@nname-podium ~]$ hadoop fs -ls /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306

Found 6 items

-rw-r--r--   3 podium superuser          0 2018-12-08 02:43 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/_SUCCESS

drwxr-xr-x   - podium superuser          0 2018-12-08 02:43 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/good

drwxr-xr-x   - podium superuser          0 2018-12-08 02:43 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/log

drwxr-xr-x   - podium superuser          0 2018-12-08 02:44 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/profile

drwxr-xr-x   - podium superuser          0 2018-12-08 02:43 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/sample

drwxr-xr-x   - podium superuser          0 2018-12-08 02:43 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/splitprofile

[podium@name-podium ~]$ hadoop fs -ls /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/profile

Found 2 items

-rw-r--r--   3 podium superuser          0 2018-12-08 02:44 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/profile/_SUCCESS

-rw-r--r--   3 podium superuser      32876 2018-12-08 02:44 /tmp/podium/8080/receiving/INDEXSN/CPI_USHousing/20181208024306/profile/profile.txt

 

Data Format Specification: Select radio button for Source to extract metadata from the entity specified in Entity File Path or use the FDL text located at the end of this topic (copy and paste into a .txt file) and chose FDL Metadata Method.

Once Metadata Method is specified, select Generate Record Layout

Generate Record Layout with FDL Metadata Method

Add data wizard, entity specification with FDL metadata method

Step 4: Modify record layout

Scroll to Record Information and select Field Delimiter value <Tab (\t)> if it is not selected by default. Select Record Terminator value <ANY_NEWLINE> if it is not selected by default.

Save the Entity.

Record layout, field delimiter and recored terminator values

Step 5: Record layout preview

Scroll to the bottom of File Format Specification sections and select Record Layout Preview.

Field metadata displays in preview (field names may not display).Connection Information includes Connection Name, Connection Type, Connection Protocol. Entity metadata columns include: Name, Index, Business Description, Technical Description, Data Type, Required, Encrypted at Source, Encrypt, Key, Validation Regex, Foreign Key.

Savethe entity.

Record layout preview

Step 6: Load data to the entity

Select Load from More dropdown to load data to the entity.

Load data to new entity

Results for this example are shown below (From discover grid on the entity row, select icon sample dataSample Data, then view Data LoadLatest Load).

Profile data latest load of profile values

FDL Text for Save/Upload (example)

If preferred, users can use FDL to define metadata for profile data as an alternative to using the entity filepath. Profile data FDL can be found below: Copy and paste and save on local machine as something like FDL.txt file. Select and upload this file on Step 2 of the Add Data wizard (Data Format Specification).


####################################
############ FIELD INFO #############
####################################
# Enter information about each field in following format - one field description per line
# field.n=FIELD_NAME | INDEX | BUSINESS_DESCRIPTION | DATA_TYPE | LENGTH | SCALE | PRECISION | NOT_NULL | IS_ENCRYPTED_AT_SRC | DO_ENCRYPT | IS_KEY | DO_NOT_PROCESS | TRIM | VALIDATION_REGEX | OPEN_QUOTE | CLOSE_QUOTE
# where n=index i.e.1,2,3…..
# FIELD_NAME: Name of the column; column can be part of a group in which case the multiple parts of name are separated by dot ‘.’ - Required attribute and a value must be provided.
# INDEX: Position or index of field in record - Required attribute and a value must be provided.
# BUSINESS_DESCRIPTION: Business description
# DATA_TYPE: Data type of field. Valid values are: INTEGER, DECIMAL, DATE, SYMBOL, STRING, CLOB, BOOLEAN
# LENGTH: Length of field. Required attribute if the value of RECORD.FIELD_DELIMITER IS NULL.
# PRECISION: Precision
# SCALE: Scale
# NOT_NULL: If the field is required or not. Valid values are TRUE, FALSE. Default value is FALSE.
# IS_ENCRYPTED_AT_SRC: If the field data is encrypted at source or not. Valid values are TRUE, FALSE. Default value is FALSE.
# DO_ENCRYPT: Is the field data to be encrypted by the system or not. Valid values are TRUE, FALSE. Default value is FALSE.
# IS_KEY: Is the field part of the record key. Valid values are TRUE, FALSE. Default value is FALSE.
# DO_NOT_PROCESS: Ignore the field during processing. Valid values are TRUE, FALSE. Default value is FALSE.
# NO_TRIM: No not trim leading & trailing whitespace from the field. Valid values are TRUE, FALSE. Default value is FALSE.
# VALIDATION_REGEX: Regular expression to be used for field value validation.Default value is empty String
# NULL_PROXY: Control value that evaluates and converts qualifying values to null
field.entity_nid  =entity_nid                | 0 |Podium Unique Entity Identifier | INTEGER | 6 | 6 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
field.field_nid    =field_nid                | 1 |Podium Unique Field Identifier | INTEGER | 6 | 6 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
field.field_name    =field_name              | 2 |Field Name - Valid for field level statistics     | STRING  | 40|    |    | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
field.statistic_name =statistic_name         | 3 |Statistic Name, includes record and field level stats | STRING  | 80|    |    | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
field.observed_value =observed_value         | 4 |Observation Value - valid for distribution of values only    | STRING  | 40|    |    | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
field.statistic_value =statistic_value       | 5 |Data value associated with statistic name    | STRING  | 90|    |    | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
        

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!