Class WordDelimiterIterator
java.lang.Object
org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator
A BreakIterator-like API for iterating over subwords in text, according to
WordDelimiterGraphFilter rules.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
static final int
private final byte[]
(package private) int
Beginning of subwordstatic final byte[]
(package private) static final int
static final int
Indicates the end of iteration(package private) int
End of subword(package private) int
end position of text, excluding trailing delimitersprivate boolean
(package private) int
(package private) static final int
private boolean
if true, need to skip over a possessive found in the last call to next()(package private) final boolean
If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).(package private) final boolean
If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).(package private) int
start position of text, excluding leading delimiters(package private) final boolean
If true, causes trailing "'s" to be removed for each subword.(package private) static final int
(package private) char[]
(package private) static final int
-
Constructor Summary
ConstructorsConstructorDescriptionWordDelimiterIterator
(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive) Create a new WordDelimiterIterator operating with the supplied rules. -
Method Summary
Modifier and TypeMethodDescriptionprivate int
charType
(int ch) Determines the type of the given characterprivate boolean
endsWithPossessive
(int pos) Determines if the text at the given position indicates an English possessive which should be removedstatic byte
getType
(int ch) Computes the type of the given character(package private) static boolean
isAlpha
(int type) Checks if the given word type includesALPHA
private boolean
isBreak
(int lastType, int type) Determines whether the transition from lastType to type indicates a break(package private) static boolean
isDigit
(int type) Checks if the given word type includesDIGIT
(package private) boolean
Determines if the current word contains only one subword.(package private) static boolean
isSubwordDelim
(int type) Checks if the given word type includesSUBWORD_DELIM
(package private) static boolean
isUpper
(int type) Checks if the given word type includesUPPER
(package private) int
next()
Advance to the next subword in the string.private void
Set the internal word bounds (remove leading and trailing delimiters).(package private) void
setText
(char[] text, int length) Reset the text to a new value, and reset all statetoString()
(package private) int
type()
Return the type of the current subword.
-
Field Details
-
LOWER
static final int LOWER- See Also:
-
UPPER
static final int UPPER- See Also:
-
DIGIT
static final int DIGIT- See Also:
-
SUBWORD_DELIM
static final int SUBWORD_DELIM- See Also:
-
ALPHA
public static final int ALPHA- See Also:
-
ALPHANUM
public static final int ALPHANUM- See Also:
-
DONE
public static final int DONEIndicates the end of iteration- See Also:
-
DEFAULT_WORD_DELIM_TABLE
public static final byte[] DEFAULT_WORD_DELIM_TABLE -
text
char[] text -
length
int length -
startBounds
int startBoundsstart position of text, excluding leading delimiters -
endBounds
int endBoundsend position of text, excluding trailing delimiters -
current
int currentBeginning of subword -
end
int endEnd of subword -
hasFinalPossessive
private boolean hasFinalPossessive -
splitOnCaseChange
final boolean splitOnCaseChangeIf false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true) -
splitOnNumerics
final boolean splitOnNumericsIf false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true) -
stemEnglishPossessive
final boolean stemEnglishPossessiveIf true, causes trailing "'s" to be removed for each subword. (Defaults to true)"O'Neil's" => "O", "Neil"
-
charTypeTable
private final byte[] charTypeTable -
skipPossessive
private boolean skipPossessiveif true, need to skip over a possessive found in the last call to next()
-
-
Constructor Details
-
WordDelimiterIterator
WordDelimiterIterator(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive) Create a new WordDelimiterIterator operating with the supplied rules.- Parameters:
charTypeTable
- table containing character typessplitOnCaseChange
- if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regardless)splitOnNumerics
- if true, causes "j2se" to be three tokens; "j" "2" "se"stemEnglishPossessive
- if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"
-
-
Method Details
-
toString
-
next
int next()Advance to the next subword in the string.- Returns:
- index of the next subword, or
DONE
if all subwords have been returned
-
type
int type()Return the type of the current subword. This currently uses the type of the first character in the subword.- Returns:
- type of the current word
-
setText
void setText(char[] text, int length) Reset the text to a new value, and reset all state- Parameters:
text
- New textlength
- length of the text
-
isBreak
private boolean isBreak(int lastType, int type) Determines whether the transition from lastType to type indicates a break- Parameters:
lastType
- Last subword typetype
- Current subword type- Returns:
true
if the transition indicates a break,false
otherwise
-
isSingleWord
boolean isSingleWord()Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters- Returns:
true
if the current word contains only one subword,false
otherwise
-
setBounds
private void setBounds()Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove it yet, simply note it. -
endsWithPossessive
private boolean endsWithPossessive(int pos) Determines if the text at the given position indicates an English possessive which should be removed- Parameters:
pos
- Position in the text to check if it indicates an English possessive- Returns:
true
if the text at the position indicates an English possessive,false
otherwise
-
charType
private int charType(int ch) Determines the type of the given character- Parameters:
ch
- Character whose type is to be determined- Returns:
- Type of the character
-
getType
public static byte getType(int ch) Computes the type of the given character- Parameters:
ch
- Character whose type is to be determined- Returns:
- Type of the character
-
isAlpha
static boolean isAlpha(int type) Checks if the given word type includesALPHA
- Parameters:
type
- Word type to check- Returns:
true
if the type contains ALPHA,false
otherwise
-
isDigit
static boolean isDigit(int type) Checks if the given word type includesDIGIT
- Parameters:
type
- Word type to check- Returns:
true
if the type contains DIGIT,false
otherwise
-
isSubwordDelim
static boolean isSubwordDelim(int type) Checks if the given word type includesSUBWORD_DELIM
- Parameters:
type
- Word type to check- Returns:
true
if the type contains SUBWORD_DELIM,false
otherwise
-
isUpper
static boolean isUpper(int type) Checks if the given word type includesUPPER
- Parameters:
type
- Word type to check- Returns:
true
if the type contains UPPER,false
otherwise
-