Word-based patterns
Talend Data Preparation allows you to
analyze the word-based patterns repartition in your data.
The word-based patterns indicators are case sensitive. The following table describes what the patterns that you can find in the profiling area corresponds to:
| Pattern | Description |
|---|---|
| [Word] | Word starting with an uppercase character and consisting of lowercase characters |
| [WORD] | Word with uppercase characters |
| [word] | Word with lowercase characters |
| [Char] | Single uppercase character |
| [char] | Single lowercase character |
| [Ideogram] | One of the CJK Unified Ideographs |
| [IdeogramSeq] | Sequence of ideograms |
| [hiraSeq] | Sequence of Japanese Hiragana characters |
| [kataSeq] | Sequence of Japanese Katakana characters |
| [hangulSeq] | Sequence of Korean Hangul characters |
| [digit] | One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9 |
| [number] | Sequence of digits |
The following examples illustrate how certain records would be interpreted in Talend Data Preparation.
| String | Pattern |
|---|---|
| A character is NOT a Word | [Char] [word] [word] [WORD] [char] [Word] |
| someWordsINwORDS | [word][Word][WORD][char][WORD] |
| Example123@domain.com | [Word][number]@[word].[word] |
| anotherExample8@domain.com | [word][Word][digit]@[word].[word] |
| 袁 花木蘭88 | [Ideogram] [IdeogramSeq][number] |
| Latin2中文 | [Word][digit][IdeogramSeq] |
| Latin3フランス | [Word][digit][kataSeq] |
| Latin4とうきょう | [Word][digit][hiraSeq] |
| Latin5나는 한국 사람입니다 | [Word][digit][hangulSeq] |