Class WordDelimiterIterator
- java.lang.Object
-
- org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator
-
public final class WordDelimiterIterator extends java.lang.Object
A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.
-
-
Field Summary
Fields Modifier and Type Field Description static int
ALPHA
static int
ALPHANUM
private byte[]
charTypeTable
(package private) int
current
Beginning of subwordstatic byte[]
DEFAULT_WORD_DELIM_TABLE
(package private) static int
DIGIT
static int
DONE
Indicates the end of iteration(package private) int
end
End of subword(package private) int
endBounds
end position of text, excluding trailing delimitersprivate boolean
hasFinalPossessive
(package private) int
length
(package private) static int
LOWER
private boolean
skipPossessive
if true, need to skip over a possessive found in the last call to next()(package private) boolean
splitOnCaseChange
If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).(package private) boolean
splitOnNumerics
If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).(package private) int
startBounds
start position of text, excluding leading delimiters(package private) boolean
stemEnglishPossessive
If true, causes trailing "'s" to be removed for each subword.(package private) static int
SUBWORD_DELIM
(package private) char[]
text
(package private) static int
UPPER
-
Constructor Summary
Constructors Constructor Description WordDelimiterIterator(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive)
Create a new WordDelimiterIterator operating with the supplied rules.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private int
charType(int ch)
Determines the type of the given characterprivate boolean
endsWithPossessive(int pos)
Determines if the text at the given position indicates an English possessive which should be removedstatic byte
getType(int ch)
Computes the type of the given character(package private) static boolean
isAlpha(int type)
Checks if the given word type includesALPHA
private boolean
isBreak(int lastType, int type)
Determines whether the transition from lastType to type indicates a break(package private) static boolean
isDigit(int type)
Checks if the given word type includesDIGIT
(package private) boolean
isSingleWord()
Determines if the current word contains only one subword.(package private) static boolean
isSubwordDelim(int type)
Checks if the given word type includesSUBWORD_DELIM
(package private) static boolean
isUpper(int type)
Checks if the given word type includesUPPER
(package private) int
next()
Advance to the next subword in the string.private void
setBounds()
Set the internal word bounds (remove leading and trailing delimiters).(package private) void
setText(char[] text, int length)
Reset the text to a new value, and reset all state(package private) int
type()
Return the type of the current subword.
-
-
-
Field Detail
-
LOWER
static final int LOWER
- See Also:
- Constant Field Values
-
UPPER
static final int UPPER
- See Also:
- Constant Field Values
-
DIGIT
static final int DIGIT
- See Also:
- Constant Field Values
-
SUBWORD_DELIM
static final int SUBWORD_DELIM
- See Also:
- Constant Field Values
-
ALPHA
public static final int ALPHA
- See Also:
- Constant Field Values
-
ALPHANUM
public static final int ALPHANUM
- See Also:
- Constant Field Values
-
DONE
public static final int DONE
Indicates the end of iteration- See Also:
- Constant Field Values
-
DEFAULT_WORD_DELIM_TABLE
public static final byte[] DEFAULT_WORD_DELIM_TABLE
-
text
char[] text
-
length
int length
-
startBounds
int startBounds
start position of text, excluding leading delimiters
-
endBounds
int endBounds
end position of text, excluding trailing delimiters
-
current
int current
Beginning of subword
-
end
int end
End of subword
-
hasFinalPossessive
private boolean hasFinalPossessive
-
splitOnCaseChange
final boolean splitOnCaseChange
If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
-
splitOnNumerics
final boolean splitOnNumerics
If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
-
stemEnglishPossessive
final boolean stemEnglishPossessive
If true, causes trailing "'s" to be removed for each subword. (Defaults to true) "O'Neil's" => "O", "Neil"
-
charTypeTable
private final byte[] charTypeTable
-
skipPossessive
private boolean skipPossessive
if true, need to skip over a possessive found in the last call to next()
-
-
Constructor Detail
-
WordDelimiterIterator
WordDelimiterIterator(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive)
Create a new WordDelimiterIterator operating with the supplied rules.- Parameters:
charTypeTable
- table containing character typessplitOnCaseChange
- if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regardless)splitOnNumerics
- if true, causes "j2se" to be three tokens; "j" "2" "se"stemEnglishPossessive
- if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"
-
-
Method Detail
-
next
int next()
Advance to the next subword in the string.- Returns:
- index of the next subword, or
DONE
if all subwords have been returned
-
type
int type()
Return the type of the current subword. This currently uses the type of the first character in the subword.- Returns:
- type of the current word
-
setText
void setText(char[] text, int length)
Reset the text to a new value, and reset all state- Parameters:
text
- New textlength
- length of the text
-
isBreak
private boolean isBreak(int lastType, int type)
Determines whether the transition from lastType to type indicates a break- Parameters:
lastType
- Last subword typetype
- Current subword type- Returns:
true
if the transition indicates a break,false
otherwise
-
isSingleWord
boolean isSingleWord()
Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters- Returns:
true
if the current word contains only one subword,false
otherwise
-
setBounds
private void setBounds()
Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove it yet, simply note it.
-
endsWithPossessive
private boolean endsWithPossessive(int pos)
Determines if the text at the given position indicates an English possessive which should be removed- Parameters:
pos
- Position in the text to check if it indicates an English possessive- Returns:
true
if the text at the position indicates an English possessive,false
otherwise
-
charType
private int charType(int ch)
Determines the type of the given character- Parameters:
ch
- Character whose type is to be determined- Returns:
- Type of the character
-
getType
public static byte getType(int ch)
Computes the type of the given character- Parameters:
ch
- Character whose type is to be determined- Returns:
- Type of the character
-
isAlpha
static boolean isAlpha(int type)
Checks if the given word type includesALPHA
- Parameters:
type
- Word type to check- Returns:
true
if the type contains ALPHA,false
otherwise
-
isDigit
static boolean isDigit(int type)
Checks if the given word type includesDIGIT
- Parameters:
type
- Word type to check- Returns:
true
if the type contains DIGIT,false
otherwise
-
isSubwordDelim
static boolean isSubwordDelim(int type)
Checks if the given word type includesSUBWORD_DELIM
- Parameters:
type
- Word type to check- Returns:
true
if the type contains SUBWORD_DELIM,false
otherwise
-
isUpper
static boolean isUpper(int type)
Checks if the given word type includesUPPER
- Parameters:
type
- Word type to check- Returns:
true
if the type contains UPPER,false
otherwise
-
-