public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
ICUTokenizerConfig
that is generally applicable
to many languages.
Generally tokenizes Unicode text according to UAX#29
(BreakIterator.getWordInstance(ULocale.ROOT)
),
but with the following tailorings:
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
WORD_EMOJI
Token type for words that appear to be emoji sequences
|
static java.lang.String |
WORD_HANGUL
Token type for words containing Korean hangul
|
static java.lang.String |
WORD_HIRAGANA
Token type for words containing Japanese hiragana
|
static java.lang.String |
WORD_IDEO
Token type for words containing ideographic characters
|
static java.lang.String |
WORD_KATAKANA
Token type for words containing Japanese katakana
|
static java.lang.String |
WORD_LETTER
Token type for words that contain letters
|
static java.lang.String |
WORD_NUMBER
Token type for words that appear to be numbers
|
EMOJI_SEQUENCE_STATUS
Constructor and Description |
---|
DefaultICUTokenizerConfig(boolean cjkAsWords,
boolean myanmarAsWords)
Creates a new config.
|
Modifier and Type | Method and Description |
---|---|
boolean |
combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
|
com.ibm.icu.text.RuleBasedBreakIterator |
getBreakIterator(int script)
Return a breakiterator capable of processing a given script.
|
java.lang.String |
getType(int script,
int ruleStatus)
Return a token type value for a given script and BreakIterator
rule status.
|
public static final java.lang.String WORD_IDEO
public static final java.lang.String WORD_HIRAGANA
public static final java.lang.String WORD_KATAKANA
public static final java.lang.String WORD_HANGUL
public static final java.lang.String WORD_LETTER
public static final java.lang.String WORD_NUMBER
public static final java.lang.String WORD_EMOJI
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
cjkAsWords
- true if cjk text should undergo dictionary-based segmentation,
otherwise text will be segmented according to UAX#29 defaults.
If this is true, all Han+Hiragana+Katakana words will be tagged as
IDEOGRAPHIC.myanmarAsWords
- true if Myanmar text should undergo dictionary-based segmentation,
otherwise it will be tokenized as syllables.public boolean combineCJ()
ICUTokenizerConfig
combineCJ
in class ICUTokenizerConfig
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
ICUTokenizerConfig
getBreakIterator
in class ICUTokenizerConfig
public java.lang.String getType(int script, int ruleStatus)
ICUTokenizerConfig
getType
in class ICUTokenizerConfig
Copyright © 2000–2019 The Apache Software Foundation. All rights reserved.