Class RuleBasedBreakIterator
- All Implemented Interfaces:
Cloneable
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) class
(package private) class
DictionaryCache stores the boundaries obtained from a run of dictionary characters.Nested classes/interfaces inherited from class com.ibm.icu.text.BreakIterator
BreakIterator.BreakIteratorServiceShim
-
Field Summary
FieldsModifier and TypeFieldDescriptionCache of previously determined boundary positions.private List<LanguageBreakEngine>
List of all known break engines.static final String
Deprecated.This API is ICU internal only.private int
Counter for the number of characters encountered with the "dictionary" flag set.private boolean
True when iteration has run off the end, and iterator functions should return UBRK_DONE.private int[]
Array of look-ahead tentative results.private boolean
Flag used to indicate if phrase breaking is required.private int
The iteration state - current position, rule status for the current position, and whether the iterator ran off the end, yielding UBRK_DONE.Deprecated.This API is ICU internal only.private int
Index of the Rule {tag} values for the most recent match.private CharacterIterator
The character iterator through which this BreakIterator accesses the text.private static final List<LanguageBreakEngine>
List of all known break engines, common for all break iterators.private static final UnhandledBreakEngine
The "default" break engine - just skips over ranges of dictionary words, producing no breaks.private static final String
ICU debug argument name for RBBIprivate static final int
private static final int
private static final int
private static final int
private static final int
private static final boolean
Debugging flag.Fields inherited from class com.ibm.icu.text.BreakIterator
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD, WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprivate
private constructorRuleBasedBreakIterator
(String rules) Construct a RuleBasedBreakIterator from a set of rules supplied as a string. -
Method Summary
Modifier and TypeMethodDescriptionprotected static final void
checkOffset
(int offset, CharacterIterator text) Throw IllegalArgumentException unless begin <= offset < end.private static int
CISetIndex32
(CharacterIterator ci, int index) Set the index of a CharacterIterator.clone()
Clones this iterator.static void
compileRules
(String rules, OutputStream ruleBinary) Compile a set of source break rules into the binary state tables used by the break iterator engine.int
current()
Returns the current iteration position.void
dump
(PrintStream out) Deprecated.This API is ICU internal only.boolean
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.int
first()
Sets the current iteration position to the beginning of the text.int
following
(int startPos) Sets the iterator to refer to the first boundary position following the specified position.static RuleBasedBreakIterator
Create a break iterator from a precompiled set of break rules.static RuleBasedBreakIterator
Deprecated.This API is ICU internal only.(package private) static RuleBasedBreakIterator
getInstanceFromCompiledRules
(ByteBuffer bytes, boolean phraseBreaking) This factory method doesn't have an access modifier; it is only accessible in the same package.private LanguageBreakEngine
getLanguageBreakEngine
(int c) int
Return the status tag from the break rule that determined the boundary at the current iteration position.int
getRuleStatusVec
(int[] fillInArray) Get the status (tag) values from the break rule(s) that determined the boundary at the current iteration position.getText()
Returns a CharacterIterator over the text being analyzed.private int
The State Machine Engine for moving forward is here.private int
handleSafePrevious
(int fromPosition) Iterate backwards from an arbitrary position in the input text using the Safe Reverse rules.int
hashCode()
Compute a hashcode for this BreakIteratorboolean
isBoundary
(int offset) Returns true if the specified position is a boundary position.int
last()
Sets the current iteration position to the end of the text.int
next()
Advances the iterator to the next boundary position.int
next
(int n) Advances the iterator either forward or backward the specified number of steps.int
preceding
(int offset) Sets the iterator to refer to the last boundary position before the specified position.int
previous()
Moves the iterator backwards, to the boundary preceding the current one.void
setText
(CharacterIterator newText) Set the iterator to analyze a new piece of text.toString()
Returns the description (rules) used to create this iterator.Methods inherited from class com.ibm.icu.text.BreakIterator
getAvailableLocales, getAvailableULocales, getBreakInstance, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, registerInstance, registerInstance, setLocale, setText, setText, unregister
-
Field Details
-
START_STATE
private static final int START_STATE- See Also:
-
STOP_STATE
private static final int STOP_STATE- See Also:
-
RBBI_START
private static final int RBBI_START- See Also:
-
RBBI_RUN
private static final int RBBI_RUN- See Also:
-
RBBI_END
private static final int RBBI_END- See Also:
-
fText
The character iterator through which this BreakIterator accesses the text. -
fRData
Deprecated.This API is ICU internal only.The rule data for this BreakIterator instance. Not intended for public use. Declared public for testing purposes only. -
fPosition
private int fPositionThe iteration state - current position, rule status for the current position, and whether the iterator ran off the end, yielding UBRK_DONE. Current position is pinned to be 0 < position <= text.length. Current position is always set to a boundary. The current position of the iterator. Pinned, 0 < fPosition <= text.length. Never has the value UBRK_DONE (-1). -
fRuleStatusIndex
private int fRuleStatusIndexIndex of the Rule {tag} values for the most recent match. -
fDone
private boolean fDoneTrue when iteration has run off the end, and iterator functions should return UBRK_DONE. -
fLookAheadMatches
private int[] fLookAheadMatchesArray of look-ahead tentative results. -
fBreakCache
Cache of previously determined boundary positions. -
fPhraseBreaking
private boolean fPhraseBreakingFlag used to indicate if phrase breaking is required. -
fDictionaryCharCount
private int fDictionaryCharCountCounter for the number of characters encountered with the "dictionary" flag set. Normal RBBI iterators don't use it, although the code for updating it is live. Dictionary Based break iterators (a subclass of us) access this field directly. -
fDictionaryCache
-
RBBI_DEBUG_ARG
ICU debug argument name for RBBI- See Also:
-
TRACE
private static final boolean TRACEDebugging flag. Trace operation of state machine when true. -
gUnhandledBreakEngine
The "default" break engine - just skips over ranges of dictionary words, producing no breaks. Should only be used if characters need to be handled by a dictionary but we have no dictionary implementation for them. Only one instance; shared by all break iterators. -
gAllBreakEngines
List of all known break engines, common for all break iterators. Lazily updated as break engines are needed, because instantiation of break engines is expensive. Because gAllBreakEngines can be referenced concurrently from different BreakIterator instances, all access is synchronized. -
fBreakEngines
List of all known break engines. Similar to gAllBreakEngines, but local to a break iterator, allowing it to be used without synchronization. -
fDebugEnv
Deprecated.This API is ICU internal only.Control debug, trace and dump options.
-
-
Constructor Details
-
RuleBasedBreakIterator
private RuleBasedBreakIterator()private constructor -
RuleBasedBreakIterator
Construct a RuleBasedBreakIterator from a set of rules supplied as a string.- Parameters:
rules
- The break rules to be used.
-
-
Method Details
-
getInstanceFromCompiledRules
public static RuleBasedBreakIterator getInstanceFromCompiledRules(InputStream is) throws IOException Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.- Parameters:
is
- an input stream supplying the compiled binary rules.- Throws:
IOException
- if there is an error while reading the rules from the InputStream.- See Also:
-
getInstanceFromCompiledRules
static RuleBasedBreakIterator getInstanceFromCompiledRules(ByteBuffer bytes, boolean phraseBreaking) throws IOException This factory method doesn't have an access modifier; it is only accessible in the same package. Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.- Parameters:
bytes
- a buffer supplying the compiled binary rules.phraseBreaking
- a flag indicating if phrase breaking is required.- Throws:
IOException
- if there is an error while reading the rules from the buffer.- See Also:
-
getInstanceFromCompiledRules
@Deprecated public static RuleBasedBreakIterator getInstanceFromCompiledRules(ByteBuffer bytes) throws IOException Deprecated.This API is ICU internal only.Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.- Parameters:
bytes
- a buffer supplying the compiled binary rules.- Throws:
IOException
- if there is an error while reading the rules from the buffer.- See Also:
-
clone
Clones this iterator.- Overrides:
clone
in classBreakIterator
- Returns:
- A newly-constructed RuleBasedBreakIterator with the same behavior as this one.
-
equals
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text. -
toString
Returns the description (rules) used to create this iterator. (In ICU4C, the same function is RuleBasedBreakIterator::getRules()) -
hashCode
public int hashCode()Compute a hashcode for this BreakIterator -
dump
Deprecated.This API is ICU internal only.Dump the contents of the state table and character classes for this break iterator. For debugging only. -
compileRules
Compile a set of source break rules into the binary state tables used by the break iterator engine. Creating a break iterator from precompiled rules is much faster than creating one from source rules. Binary break rules are not guaranteed to be compatible between different versions of ICU.- Parameters:
rules
- The source form of the break rulesruleBinary
- An output stream to receive the compiled rules.- Throws:
IOException
- If there is an error writing the output.- See Also:
-
first
public int first()Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).- Specified by:
first
in classBreakIterator
- Returns:
- The offset of the beginning of the text.
-
last
public int last()Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).- Specified by:
last
in classBreakIterator
- Returns:
- The text's past-the-end offset.
-
next
public int next(int n) Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().- Specified by:
next
in classBreakIterator
- Parameters:
n
- The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).- Returns:
- The character offset of the boundary position n boundaries away from the current one.
-
next
public int next()Advances the iterator to the next boundary position.- Specified by:
next
in classBreakIterator
- Returns:
- The position of the first boundary after this one.
-
previous
public int previous()Moves the iterator backwards, to the boundary preceding the current one.- Specified by:
previous
in classBreakIterator
- Returns:
- The position of the boundary position immediately preceding the starting position.
-
following
public int following(int startPos) Sets the iterator to refer to the first boundary position following the specified position.- Specified by:
following
in classBreakIterator
- Parameters:
startPos
- The position from which to begin searching for a break position.- Returns:
- The position of the first break after the current position.
-
preceding
public int preceding(int offset) Sets the iterator to refer to the last boundary position before the specified position.- Overrides:
preceding
in classBreakIterator
- Parameters:
offset
- The position to begin searching for a break from.- Returns:
- The position of the last boundary before the starting position.
-
checkOffset
Throw IllegalArgumentException unless begin <= offset < end. -
isBoundary
public boolean isBoundary(int offset) Returns true if the specified position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".- Overrides:
isBoundary
in classBreakIterator
- Parameters:
offset
- the offset to check.- Returns:
- True if "offset" is a boundary position.
-
current
public int current()Returns the current iteration position. Note that DONE is never returned from this function; if iteration has run to the end of a string, current() will return the length of the string while next() will return BreakIterator.DONE).- Specified by:
current
in classBreakIterator
- Returns:
- The current iteration position.
-
getRuleStatus
public int getRuleStatus()Return the status tag from the break rule that determined the boundary at the current iteration position. The values appear in the rule source within brackets, {123}, for example. For rules that do not specify a status, a default value of 0 is returned. If more than one rule applies, the numerically largest of the possible status values is returned.Of the standard types of ICU break iterators, only the word and line break iterator provides status values. The values are defined in class RuleBasedBreakIterator, and allow distinguishing between words that contain alphabetic letters, "words" that appear to be numbers, punctuation and spaces, words containing ideographic characters, and more. Call
getRuleStatus
after obtaining a boundary position fromnext()
,previous()
, or any other break iterator functions that returns a boundary position.Note that
getRuleStatus()
returns the value corresponding tocurrent()
index even afternext()
has returned DONE.- Overrides:
getRuleStatus
in classBreakIterator
- Returns:
- the status from the break rule that determined the boundary at the current iteration position.
-
getRuleStatusVec
public int getRuleStatusVec(int[] fillInArray) Get the status (tag) values from the break rule(s) that determined the boundary at the current iteration position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.The status values used by the standard ICU break rules are defined as public constants in class RuleBasedBreakIterator.
If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.
- Overrides:
getRuleStatusVec
in classBreakIterator
- Parameters:
fillInArray
- an array to be filled in with the status values.- Returns:
- The number of rule status values from the rules that determined the boundary at the current iteration position. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
-
getText
Returns a CharacterIterator over the text being analyzed.Caution:The state of the returned CharacterIterator must not be modified in any way while the BreakIterator is still in use. Doing so will lead to undefined behavior of the BreakIterator. Clone the returned CharacterIterator first and work with that.
The returned CharacterIterator is a reference to the actual iterator being used by the BreakIterator. No guarantees are made about the current position of this iterator when it is returned; it may differ from the BreakIterators current position. If you need to move that position to examine the text, clone this function's return value first.
- Specified by:
getText
in classBreakIterator
- Returns:
- An iterator over the text being analyzed.
-
setText
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text. (The old iterator is dropped.)Caution: The supplied CharacterIterator is used directly by the BreakIterator, and must not be altered in any way by code outside of the BreakIterator. Doing so will lead to undefined behavior of the BreakIterator.
- Specified by:
setText
in classBreakIterator
- Parameters:
newText
- An iterator over the text to analyze.
-
getLanguageBreakEngine
-
handleNext
private int handleNext()The State Machine Engine for moving forward is here. This function is the heart of the RBBI run time engine. Input fPosition, the position in the text to begin from. Output fPosition: the boundary following the starting position. fDictionaryCharCount the number of dictionary characters encountered. If > 0, the segment will be further subdivided fRuleStatusIndex Info from the state table indicating which rules caused the boundary.- Returns:
- the new iterator position A note on supplementary characters and the position of underlying Java CharacterIterator: Normally, a character iterator is positioned at the char most recently returned by next(). Within this function, when a supplementary char is being processed, the char iterator is left sitting on the trail surrogate, in the middle of the code point. This is different from everywhere else, where an iterator always points at the lead surrogate of a supplementary.
-
handleSafePrevious
private int handleSafePrevious(int fromPosition) Iterate backwards from an arbitrary position in the input text using the Safe Reverse rules. This locates a "Safe Position" from which the forward break rules will operate correctly. A Safe Position is not necessarily a boundary itself. The logic of this function is very similar to handleNext(), above, but simpler because the safe table does not require as many options.- Parameters:
fromPosition
- the position in the input text to begin the iteration.
-
CISetIndex32
Set the index of a CharacterIterator. Pin the index to the valid range range of BeginIndex <= index <= EndIndex. If the index points to a trail surrogate of a supplementary character, adjust it to the start (lead surrogate) index.- Parameters:
ci
- A CharacterIterator to setindex
- the index to set- Returns:
- the resulting index, possibly pinned or adjusted.
-