Skip to main content Skip to complementary content

Automatic feature engineering

With automatic feature engineering, Qlik AutoML can use existing features in your training data to create new features. These new auto-engineered features allow you to discover new patterns in your data, and can greatly improve the performance of your machine learning models.

Feature engineering is the process of creating new feature columns from current ones. AutoML can perform feature engineering automatically for improved handling of certain types of data. For general information about feature engineering, see Creating new feature columns.

Auto-engineered date features, and the parent features from which they are derived, are marked with a Auto-engineered icon.

After you select a dataset for use in your experiment, the dataset is analyzed and the columns within it are identified as containing certain data types. These data types allow AutoML to assign a feature type to each column in the dataset. Each column is given one of the following feature types:

  • Categorical

  • Numeric

  • Date

  • Free text

When possible, AutoML displays a list of auto-engineered features that can be created from eligible parent features. This list of auto-engineered features is further refined and reduced as preprocessing begins. Including auto-engineered features in your experiment is recommended but optional. You can remove individual auto-engineered features before you start training, and when configuring each new experiment version.

For more information about the processes completed before experiment training begins, see Automatic data preparation and transformation.

Date feature engineering

AutoML generates auto-engineered features from eligible columns with the date feature type, which have been identified as containing date and time information. Auto-engineered date features, and the parent features from which they are derived, are marked with a Auto-engineered icon.

When Qlik Cloud Analytics profiles the training dataset you have selected for use in AutoML, it links certain data types to the date feature type. This includes the following data types:

  • Date

  • Datetime

  • Time

  • Timestamp

Features that are assigned any of these data types during profiling are given the date feature type. For information about the available profile statistics that can be viewed for your data fields, see Profile List view.

When possible, AutoML displays a list of auto-engineered date features that can be created from eligible parent features that have the date feature type. Auto-engineered date features are included in the experiment by default. If you choose to include them, the new features are generated after v1 of the experiment.

Information noteIt is recommended that models trained before August 29, 2023 are re-trained if they include features containing dates or timestamps.

Auto-engineered date features have the numeric feature type. They are included in the experiment by default, but are optional. You can remove some, or all, of them before starting experiment training, or when configuring the next experiment version. When auto-engineered date features are included, the original parent date feature is removed from the experiment.

You can instead include the parent date feature in the experiment. When you choose to do this, the feature type of the parent feature is switched from date to categorical, and the auto-engineered date features are no longer usable. It is recommended to use available auto-engineered features in your experiment, because they bring improved performance to your machine learning models.

Auto-engineered date features do not count towards the AutoML dataset size (maximum cell counts in training datasets and apply datasets) that has been specified in your Qlik Cloud subscription. Only the original date column cells are counted.

Schema view showing auto-engineered features that can be generated from a parent date feature 'Invoice Date'. Note the difference between the Data type and Feature type of each feature.

Schema view in experiment training, showing the parent feature identified as a date feature with the possible auto-engineered features that can be created from it. For each feature (column) in the dataset, there is a defined 'Feature type', which is different, but possibly dependent on, the 'Data type' value that is shown for each feature (column)

Using date features as the experiment target

In the rare case in which you want to use a feature with date and time information as the target of your experiment, the feature type of the column will be switched from date to categorical, and the auto-engineered features will be removed. If you select another target, then later would like to add the date and time feature as a regular feature, you will need to change it back to the date feature type manually if needed. If you return the feature to the date feature type, the auto-engineered date features are generated again.

For more information about how to change feature types, see Changing feature types.

Available auto-engineered date features

When generating auto-engineered date features from a column in your dataset, AutoML extracts and calculates specific components of each date and date-time value, isolating each component in its own column. The table below lists the auto-engineered features that can be generated by AutoML.

List of auto-engineered features which can be derived from a date and time feature
Auto-engineered feature Data type Feature type Description
YEAR Integer Numeric Year field parsed directly from the source date or timestamp.
MONTH Integer Numeric Month field parsed directly from the source date or timestamp.
DAY Integer Numeric Day field parsed directly from the source date or timestamp.
HOUR Integer Numeric Hour field parsed directly from the source timestamp.
MINUTE Integer Numeric Minute field parsed directly from the source timestamp.
SECOND Integer Numeric Second field parsed directly from the source timestamp.
DAYOFWEEK Integer Numeric Day of the week, calculated from the source day, month and year.
WEEK Integer Numeric Week of the year, calculated from the source day, month and year.

For each new feature created, the original column name is suffixed by the applicable auto-engineered feature.

Auto-engineered date features in the experiment configuration pane

Features section in experiment configuration pane, showing Auto-engineered features.

Auto-engineered date features in predictions

Auto-engineered date features are generated when using the training dataset to create a model, which is deployed and used as an ML deployment to make predictions on new data (the apply dataset).

When a model trained with auto-engineered date features is deployed for making predictions, the apply dataset on which you are generating predictions does not need to include the auto-engineered date features. AutoML generates the auto-engineered features for the apply dataset before predicting. However, the apply dataset must include the parent date feature, and the column must have been profiled as having the Date, Datetime, Timestamp, or Time data type.

The prediction datasets created by an ML deployment, including SHAP and apply datasets, will include the auto-engineered date features.

Auto-engineered date features in real-time predictions

For the real-time predictions API to be able to process your date and timestamps fields, the JSON payload you send to the real-time predictions API must follow the requirements below:

  • Date and datetime values must be strings formatted in accordance with ISO 8601 standards

  • Data within each column needs to be of the same time zone

Information noteThe data you use to train your model do not have to follow these requirements.

Handling of free text data

Free text (for example, textual string data entered into forms) requires special processing by machine learning algorithms to be useful in a model. In Qlik AutoML, processing of free text is a form of automatic feature engineering. Technically speaking, this processing uses the TF-IDF (term frequency - inverse document frequency) method.

AutoML supports separate processing for features with free text data in English.

If a column in your training data contains free text, it is assigned the free text feature type. It can also be used as a categorical feature, although this is strongly discouraged if it has high cardinality (too many unique values).

You can select a maximum of three columns to be used as free text features in an experiment.

Information noteIt is recommended that models trained before January 23, 2024 are re-trained if they use fields which consist of free text data.

Requirements for free text encoding

For a column containing free text to be successfully encoded as free text, it must fulfill two requirements. These requirements are checked at different stages of experiment creation.

The requirements are:

  • The column must have an average character length of 50 or more characters.

  • The column must have an average word length of five or more words.

Treating a feature as free text

The process of treating a feature as free text is as follows:

  1. When you select your training data, Qlik AutoML identifies features that can possibly be processed as free text. They are marked with the Possible free text insight in schema view, and will have the free text feature type.

  2. After you run v1 of the experiment, additional analysis is completed. At this point, features initially marked as possible free text might be found to be unusable as free text features.

    If the features which are unusable as free text have high cardinality, it is recommended that you deselect them from the experiment. These features, when treated as categorical, contribute no value to model performance.

    If the features which are unusable as free text do not have high cardinality, you can include them in your experiment by clicking Treat as categorical, or by switching their Feature type from free text to categorical. If you leave the feature type as free text, it will also internally be treated as categorical, and will be impact encoded.

For full details about preprocessing, see Automatic data preparation and transformation.

For more information about each of the insights shown in schema view, see Common insights found in training data.

Using a free text feature as the experiment target

In rare cases, a free text feature can be selected as the target. If the feature meets all requirements for free text encoding, and contains between two and ten unique values, it can be used as the target. In these scenarios, the experiment is defined as a standard binary classification or multiclass classification problem.

Free text features in predictions

When you deploy a model trained with a free text feature, the resulting ML deployment can generate predictions as long as the following requirements are met for the apply dataset:

  • The column names of the feature match between the training dataset and the apply dataset

  • The column in the apply dataset, which corresponds to the free text feature in the training data, contains string data

Warning noteAs long as the requirements above are met, the prediction will run successfully. In other words, the prediction will run successfully even if the corresponding column in the apply dataset does not actually contain free text. A prediction generated in this situation is not considered reliable. Always ensure that the equivalent column in your apply dataset, which corresponds to a free text feature in your training data, contains free text.

Considerations

Including free text features in your experiment increases the complexity of the experiment and the processes required to run it. It is possible that Permutation importance charts will be unavailable for the resultant models if your free text data is complex enough.

Troubleshooting

Using free text data to train a model can be a resource-intensive process. You might encounter an error when you include free text columns containing large numbers of unique words as features.

Here are some guidelines for resolving these errors:

  • Reduce the data subset in your training dataset to include fewer rows of free text.

  • Remove free text features you do not need to include in model training.

  • Treat one or more free text columns as categorical, rather than free text, features. Note that this is not recommended if these free text features contain high cardinality.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!