Azure Cloud Storage
Azure Cloud Storage is Microsoft’s managed object storage service for unstructured data, including text, binary files, media, logs, and application backups. It supports hot, cool, and archive access tiers, offers geo-redundant replication, and integrates with Microsoft Entra ID (formerly Azure Active Directory) for secure access control.
Qlik Talend Cloud connects to Azure Cloud Storage using a Microsoft Entra ID application (service principal) that has read access to the target storage account container. The connector retrieves files from the specified container, automatically discovers schemas by sampling file contents, and performs incremental data replication based on file modification timestamps.
Preparing for authentication
To access your data, you need to authenticate the connection with your account credentials.
To set up your Azure Cloud Storage account, you need:
- An Azure subscription with an Azure Storage account.
- A blob container in the storage account that contains the files to replicate.
- A Microsoft Entra ID application registration with a client secret.
- The Storage Blob Data Reader role assigned to the application's service principal, scoped to the storage account or the specific container. This is the recommended least-privilege role for read-only access.
To register a Microsoft Entra ID application and retrieve your credentials:
- Log into your Azure account.
- Navigate to Microsoft Entra ID > App registrations > New registration.
- Enter the following information for your application:
- Name: Enter a name, for example QlikDataIntegration.
- Supported account types: Select Accounts in this organizational directory only.
- Click Register.
- On the application Overview page, copy both the Application (client) ID and Directory (tenant) ID and save them to a secure file.
- Navigate to Certificates & secrets > Client secrets > New client secret.
- Enter a description and select an expiration period for the client secret.
- Click Add.
- Copy your client secret value and save it to a secure file.
- In the Azure portal, open your storage account, then navigate to Access Control (IAM) > Add > Add role assignment.
- Select the Storage Blob Data Reader role, and assign this role to the application you just registered.
- Click Save.
Supported file formats
- Delimited text files:
.csv,.tsv,.psv,.txt(with configurable delimiter) - JSON Lines:
.jsonl - Parquet:
.parquet - Avro:
.avro - Excel:
.xlsx(multiple worksheets per workbook are supported; each sheet's rows are replicated, and the sheet name is appended to the_sdc_source_filecolumn) - Gzip-compressed files:
.gz(containing any of the above formats)
Creating the connection
For more information, see Connecting to SaaS applications.
- Fill in the required connection properties.
-
Provide a name for the connection in Connection name.
-
Select Open connection metadata to define metadata for the connection when it has been created.
-
Click Create.
| Setting | Description |
|---|---|
| Data gateway |
Select a Data Movement gateway if required by your use case. Information note
This field is not available with the Qlik Talend Cloud Starter subscription, as it does not support Data Movement gateway. If you have another subscription tier and do not want to use Data Movement gateway, select None. For information on the benefits of Data Movement gateway and use cases that require it, see Qlik Data Gateway - Data Movement. |
| Start Date |
Enter the date, in the format |
| Storage Account Name | Name of the Azure Storage account, for example mystorageaccount without https:// or .blob.core.windows.net. |
| Container Name | Blob container name, for example my-container. |
| Tenant ID | Tenant ID. |
| Tables | Table configuration determines which files are read and how their contents are interpreted. Each table definition includes a file search pattern, a table name, and optional settings for customizing file handling. |
| Client ID | Client ID. |
| Client Secret | Client secret. |
Tables configuration
Each entry in the tables configuration represents a logical table derived from files in the container. The following properties can be configured for each table:
| Property | Required or Optional | Description |
|---|---|---|
| Table name | Required |
Specify the name of the logical table (for example, my_orders_csv). This becomes the stream name in Qlik Talend Cloud.
|
| Search pattern | Required |
Provide a regular expression to match file names (for example, .*\.csv$ matches all CSV files). Apply this to file names within the container or the specified directory, if provided.
|
| Directory | Optional |
Enter a folder path prefix within the container to narrow the file search (for example, exports/orders/). Improve performance by limiting the files scanned. This is not a regular expression.
|
| Primary key | Optional |
Define a comma-separated list of column names to use as the primary key (for example, id or id,date). For CSV files, use header field names; for JSONL files, use top-level object keys. Leave empty to use full-table replication. Populate to enable incremental replication based on file modification time.
|
| Specify datetime fields | Optional |
List the column names, separated by commas, to treat as datetime fields, even if not automatically detected during schema discovery (for example, created_at, updated_at).
|
| Delimiter | Optional |
Indicate the field separator for delimited text files. The default is , (comma). Use \t for TSV files or | for PSV files. If not specified, the delimiter is auto-detected based on the file extension.
|
-
Configure
.jsonland.csvfiles as separate tables to ensure accurate schema handling and data consistency. -
Ensure all
.csvfiles matching a search pattern include a consistent header row with identical column names and order. -
Use consistent object attribute keys across all
.jsonlfiles defined for each table. Key names and structures should align for reliable schema detection.
Tables replicated
Tables are defined in the tables configuration that you provide. Each table corresponds to a set of files in the blob container that match the specified search pattern and, if applicable, the directory prefix. The connector discovers the table schema by sampling up to five files per table, reading every fifth row, and analyzing up to 1,000 records per file.
Replication uses an incremental approach based on file modification timestamps when a primary key is configured. Files modified after the last sync bookmark are processed during each extraction. If no primary key is specified, the entire table is fully replicated on every run.
The following system columns are added to each table by default:
| Column | Description |
|---|---|
_sdc_source_container
|
The name of the Azure blob container where the record originated. |
_sdc_source_file
|
The full path of the file containing the record. For Excel files, the sheet name is appended (for example, exports/q1.xlsx/Sheet1).
|
_sdc_source_lineno
|
The line number of the record within the file. |
_sdc_extra
|
Extra fields parsed that do not match the discovered schema (.jsonl files only).
|
Limitations and considerations
- The storage account name is supplied as a bare name, not a URL.
-
Gzip-compressed files (
.gz) are supported. The connector reads the original filename from the gzip header to determine the inner file format. Gzip files created with--no-name(no filename in the header) are skipped. -
Files with
.csv,.txt,.tsv,.psv, or.jsonlextensions are checked for gzip magic bytes and are transparently decompressed, even if the file does not have a.gzextension. -
Nested compression (for example, a
.gzfile inside another.gz) is not supported and is skipped. -
The Search pattern field uses regular expression syntax, not glob patterns (for example, use
.*\.csv$instead of*.csv). - Files without a recognized extension are skipped, and a warning is issued.
- The connector includes built-in retry logic with exponential backoff for Azure API rate limits (HTTP 429) and transient server errors (HTTP 500, 502, 503, 504), up to five attempts.
- File encoding is expected to be UTF-8.