Rule types
- Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given ANTLR symbols.
- Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized data and standardize them when needed.
The advanced rule types are always executed after the ANTLR specific rules regardless of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and Using two parsing levels to extract information from unstructured data.
- INT: integer;
- WORD: word;
- WORD+: literals of several words;
- CAPWORD: capitalized word;
- DECIMAL: decimal float;
- FRACTION: fraction float;
- CURRENCY: currencies;
- ROMAN_NUMERAL: Roman numerals;
- ALPHANUM: combination of alphabetic and numeric characters;
- WHITESPACE: whitespace
- UNDEFINED: unexpected strings such as ASCII codes that any other token cannot recognize.
The following three tables successively present detailed information about the basic types, the advanced types and the ANTLR symbols used by the basic rule types. These three tables help you to complete the Conversion rules table in the Basic settings of this component.
For basic rule types:
Basic Rule Type | Usage | Example |
Conditions of rule composition |
---|---|---|---|
Enumeration | A rule of this type provides a list of possible matches. |
RuleName: LengthUnit
RuleValue: " 'inch' | 'cm' " |
Each option must be put in a pair of single
quotation marks unless this option is a pre-defined element. Defined options must be separated by the | symbol. |
Format
(Rule name starts with upper case) |
A rule of this type uses the pre-defined
elements along with any of user-defined Enumeration, Format or Combination rules to define the
composition of a string.
|
RuleName: Length
RuleValue: "DECIMAL WHITESPACE LengthUnit" This rule means that a whitespace between decimal and lengthunit is required, so it matches strings like, 1.4 cm but does not match a string like 1.4cm. To match both of these cases, you need to define this rule as, for example, "DECIMAL WHITESPACE* LengthUnit" . LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ". |
When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace. |
Format (Rule name starts with lower case) | A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match. |
RuleName: length
RuleValue: "DECIMAL LengthUnit" The rule matches strings like 1.4 cm or 1.4cm etc. where the Decimal is one of the pre-defined element types and LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ". |
N/A |
Combination | A rule of this type is used when you need to create several rules of the same name. |
RuleName: Size (or size) RuleValue: "length BY length" The rule matches strings like 1.4 cm by 1.4 cm, where length is a Format rule (starting with lower case) and BY is an Enumeration rule defining " 'By' | 'by' | 'x' | 'X' ". |
Literal texts or characters are not
accepted as a part of the rule value. When the literal texts or
characters are needed, you must create an Enumeration rule to define these texts
or characters and then use this Enumeration rule instead. When several Combination rules use the identical rule name, they are executed in top-down order in the Conversion rules table of the Basic settings of tStandardizeRow, so arrange them properly in order to obtain the best result. For an example, see the following scenario. |
For advanced rule types:
Advanced Rule Type | Usage | Example | Conditions |
---|---|---|---|
Regex | A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR. |
RuleName: ZipCode
RuleValue: "\\d{5}" The rule matches strings like "92150" |
Regular expressions must be Java compliant. |
Index | A rule of this type uses a synonym index
as reference to search for the matched incoming data. For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide. |
A scenario is available in Standardizing addresses from unstructured data. | On Windows, the backslashes \ need to be doubled or
replaced by slashes /
if the path is copied from the file system. If you run the Job using Spark Local mode or if you run the Job locally, the path to index folder must start with file:///, even. If the index is stored in HDFS, the path to the index folder must start with hdfs://. When processing a record, a given Index rule matches up only the first string identified as matchable. In a Talend Map/Reduce Job, you need to compress each synonym index to be used as a zip file. |
Shape | A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data. |
RuleName: Address
RuleValue: "<INT><WORD><StreetType>" This rule matches the addresses like 12 main street, where INT and WORD are pre-defined tokens (rule elements) and StreetType is an Index rule which you define along with this example rule in the Basic settings view of this component. For further information about the Shape rule type, see Standardizing addresses from unstructured data. |
Only the contents put in < > are recognizable. In the other cases, the contents are considered as error or are omitted. |
For the given ANTLR symbols:
Symbols | Description |
---|---|
| | alternative |
's' | char or string literal |
+ | 1 or more |
* | 0 or more |
? | optional or semantic predicate |
~ | match not |
For more information about ANTLR symbols, see: https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.