Internal file format
Qlik Catalog provides the option to store internal data in six formats: TEXT_TAB_DELIMITED, AVRO, PARQUET, PARQUET_ALL_STRING, ORC, and ORC_ALL_STRING.
Note that single node environments only support TEXT_TAB_DELIMITED file format.
Internal file format can be set at:
-
Entity level (for ingestion)
-
Prepare dataflow level (for prepare transforms)
TEXT_TAB_DELIMITED: Text format for storing data in a tabular structure. Each record in the table is one line of the text file. These files store data in humanly readable formats (requiring more memory).
AVRO: Avro is a compact and fast binary data format. Avro schemas are defined with JSON. This facilitates implementation in languages that already have JSON libraries. Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.
PARQUET: Columnar storage format built to support efficient compression and encoding schemes. Parquet is available to any project in the Hadoop ecosystem, regardless of data processing framework, data model or programming language.
PARQUET_ALL_STRING: This option allows users to save all data in Parquet as string data type. This can resolve issues, for example, where a numeric value with leading zeros gets stripped with no record of the zeros in the original value.
ORC: A self-describing type-aware columnar file format designed for Hadoop workloads. ORC is optimized for large streaming reads with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. ORC stores data in stripes and keep additional information (indexes, aggregates) in the data block.
ORC_ALL_STRING: This option allows users to save all data in ORC as string data type. This can resolve issues, for example, where a numeric value with leading zeros gets stripped with no record of the zeros in the original value.