To be able to learn a classification model from a text, you must divide this text
into tokens and convert it to the CoNLL format using
tNormalize.
Procedure
-
Double-click the tNLPPreprocessing component to open its
Basic settings view and define its properties.
-
Click Sync columns to retrieve the
schema from the previous component connected in the Job.
-
From the NLP Library list, select the library to
be used for tokenization. In this example,
ScalaNLP is used.
-
From the Column to preprocess list, select the column
that holds the text to be divided into tokens, which is
message in this example.
-
Double-click the tFilterColumns component to open its
Basic settings view and define its properties.
-
Click Edit schema to add the
tokens column in the output schema because this is
the column to be normalized, and click OK to
validate.
-
Double-click the tNormalize component to open its Basic settings
view and define its properties.
-
Click Sync columns to retrieve the
schema from the previous component connected in the Job.
-
From the Column to normalize list, select
tokens.
-
In the Item separator field, enter
"\t" to separate tokens using a tab in the
output file.
-
Double-click the tFileOutputDelimited component to open
its Basic settings view and define its properties.
-
Click Sync columns to retrieve the
schema from the previous component connected in the Job.
-
In the Folder field, specify the path to the
folder where the CoNLL files will be stored.
-
In the Row Separator field, enter
"\n".
-
In the Field Separator field, enter
"\t" to separate fields with a tab.
-
Press F6 to save and execute the
Job.
Results
The output files are created in the specified folder. The files contain a single
column with one token per row.
You can then manually label person names with PER and the
other tokens with O before you can learn a classification
model from this text data: