Automatic feature engineering
With automatic feature engineering, Qlik AutoML can use existing features in your training data to create new features. These new auto-engineered features allow you to discover new patterns in your data, and can greatly improve the performance of your machine learning models.
Feature engineering is the process of creating new feature columns from current ones. AutoML can perform feature engineering automatically for improved handling of certain types of data. For general information about feature engineering, see Creating new feature columns.
Auto-engineered date features, and the parent features from which they are derived, are marked with a icon.
After you select a dataset for use in your experiment, the dataset is analyzed and the columns within it are identified as containing certain data types. These data types allow AutoML to assign a feature type to each column in the dataset. Each column is given one of the following feature types:
-
Categorical
-
Numeric
-
Date
-
Free text
When possible, AutoML displays a list of auto-engineered features that can be created from eligible parent features. This list of auto-engineered features is further refined and reduced as preprocessing begins. Including auto-engineered features in your experiment is recommended but optional. You can remove individual auto-engineered features before you start training, and when configuring each new experiment version.
For more information about the processes completed before experiment training begins, see Automatic data preparation and transformation.
Date feature engineering
AutoML generates auto-engineered features from eligible columns with the date feature type, which have been identified as containing date and time information. Auto-engineered date features, and the parent features from which they are derived, are marked with a icon.
When Qlik Cloud Analytics profiles the training dataset you have selected for use in AutoML, it links certain data types to the date feature type. This includes the following data types:
-
Date
-
Datetime
-
Time
-
Timestamp
Features that are assigned any of these data types during profiling are given the date feature type. For information about the available profile statistics that can be viewed for your data fields, see Profile List view.
When possible, AutoML displays a list of auto-engineered date features that can be created from eligible parent features that have the date feature type. Auto-engineered date features are included in the experiment by default. If you choose to include them, the new features are generated after v1 of the experiment.
Auto-engineered date features have the numeric feature type. They are included in the experiment by default, but are optional. You can remove some, or all, of them before starting experiment training, or when configuring the next experiment version. When auto-engineered date features are included, the original parent date feature is removed from the experiment.
You can instead include the parent date feature as a categorical or numeric feature. When you do this, the auto-engineered date features are no longer usable. In most cases, it is recommended to use available auto-engineered features in your experiment, because they bring improved performance to your machine learning models. However, there may be scenarios where a column identified as a date feature but you need it to be treated as categorical or numeric. In these cases, you can manually change the feature type.
Auto-engineered date features do not count towards the AutoML dataset size (maximum cell counts in training datasets and apply datasets) that has been specified in your Qlik Cloud subscription. Only the original date column cells are counted.
Using date features as the experiment target
In the rare case in which you want to use a feature with date and time information as the target of your experiment, the feature type of the column will be switched from date to categorical, and the auto-engineered features will be removed. If you select another target, then later would like to add the date and time feature as a regular feature, you will need to change it back to the date feature type manually if needed. If you return the feature to the date feature type, the auto-engineered date features are generated again.
For more information about how to change feature types, see Changing feature types.
Available auto-engineered date features
When generating auto-engineered date features from a column in your dataset, AutoML extracts and calculates specific components of each date and date-time value, isolating each component in its own column. The table below lists the auto-engineered features that can be generated by AutoML.
Auto-engineered feature | Data type | Feature type | Description |
---|---|---|---|
YEAR | Integer | Numeric | Year field parsed directly from the source date or timestamp. |
MONTH | Integer | Numeric | Month field parsed directly from the source date or timestamp. |
DAY | Integer | Numeric | Day field parsed directly from the source date or timestamp. |
HOUR | Integer | Numeric | Hour field parsed directly from the source timestamp. |
MINUTE | Integer | Numeric | Minute field parsed directly from the source timestamp. |
SECOND | Integer | Numeric | Second field parsed directly from the source timestamp. |
DAYOFWEEK | Integer | Numeric | Day of the week, calculated from the source day, month and year. |
WEEK | Integer | Numeric | Week of the year, calculated from the source day, month and year. |
For each new feature created, the original column name is suffixed by the applicable auto-engineered feature.
Auto-engineered date features in predictions
Auto-engineered date features are generated when using the training dataset to create a model, which is deployed and used as an ML deployment to make predictions on new data (the apply dataset).
When a model trained with auto-engineered date features is deployed for making predictions, the apply dataset on which you are generating predictions does not need to include the auto-engineered date features. AutoML generates the auto-engineered features for the apply dataset before predicting. However, the apply dataset must include the parent date feature, and the column must have been profiled as having the Date, Datetime, Timestamp, or Time data type.
The prediction datasets created by an ML deployment, including SHAP and apply datasets, will include the auto-engineered date features.
Auto-engineered date features in real-time predictions
For the real-time predictions API to be able to process your date and timestamps fields, the JSON payload you send to the real-time predictions API must follow the requirements below:
-
Date and datetime values must be strings formatted in accordance with ISO 8601 standards
-
Data within each column needs to be of the same time zone
Handling of free text data
Free text (for example, textual string data entered into forms) requires special processing by machine learning algorithms to be useful in a model. In Qlik AutoML, processing of free text is a form of automatic feature engineering. Technically speaking, this processing uses the TF-IDF (term frequency - inverse document frequency) method.
AutoML supports separate processing for features with free text data in English.
If a column in your training data contains free text, it is assigned the free text feature type. It can also be used as a categorical feature, although this is strongly discouraged if it has high cardinality (too many unique values).
You can select a maximum of three columns to be used as free text features in an experiment.
Requirements for free text encoding
For a column containing free text to be successfully encoded as free text, it must fulfill two requirements. These requirements are checked at different stages of experiment creation.
The requirements are:
-
The column must have an average character length of 50 or more characters.
-
The column must have an average word length of five or more words.
Treating a feature as free text
The process of treating a feature as free text is as follows:
-
When you select your training data, Qlik AutoML identifies features that can possibly be processed as free text. They are marked with the Possible free text insight in schema view, and will have the free text feature type.
-
After you run v1 of the experiment, additional analysis is completed. At this point, features initially marked as possible free text might be found to be unusable as free text features.
If the features which are unusable as free text have high cardinality, it is recommended that you deselect them from the experiment. These features, when treated as categorical, contribute no value to model performance.
If the features which are unusable as free text do not have high cardinality, you can include them in your experiment by clicking Treat as categorical, or by switching their Feature type from free text to categorical. If you leave the feature type as free text, it will also internally be treated as categorical, and will be impact encoded.
For full details about preprocessing, see Automatic data preparation and transformation.
For more information about each of the insights shown in schema view, see Viewing insights about the training data.
Using a free text feature as the experiment target
In rare cases, a free text feature can be selected as the target. If the feature meets all requirements for free text encoding, and contains between two and ten unique values, it can be used as the target. In these scenarios, the experiment is defined as a standard binary classification or multiclass classification problem.
Free text features in predictions
When you deploy a model trained with a free text feature, the resulting ML deployment can generate predictions as long as the following requirements are met for the apply dataset:
-
The column names of the feature match between the training dataset and the apply dataset
-
The column in the apply dataset, which corresponds to the free text feature in the training data, contains string data
Considerations
Including free text features in your experiment increases the complexity of the experiment and the processes required to run it. It is possible that Permutation importance charts will be unavailable for the resultant models if your free text data is complex enough.
Troubleshooting
Using free text data to train a model can be a resource-intensive process. You might encounter an error when you include free text columns containing large numbers of unique words as features.
Here are some guidelines for resolving these errors:
-
Reduce the data subset in your training dataset to include fewer rows of free text.
-
Remove free text features you do not need to include in model training.
-
Treat one or more free text columns as categorical, rather than free text, features. Note that this is not recommended if these free text features contain high cardinality.