public class JapaneseNumberFilter extends TokenFilter
TokenFilter
that normalizes Japanese numbers (kansūji) to regular Arabic
decimal numbers in half-width characters.
Japanese numbers are often written using a combination of kanji and Arabic numbers with various kinds punctuation. For example, 3.2千 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2千 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation
tokens being found in the token stream. Please make sure your JapaneseTokenizer
has discardPunctuation
set to false. In case punctuation characters, such as .
(U+FF0E FULLWIDTH FULL STOP), is removed from the token stream, this filter would find
input tokens tokens 3 and 2千 and give outputs 3 and 2000 instead of 3200, which is
likely not the intended result. If you want to remove punctuation characters from your
index that are not part of normalized numbers, add a
StopFilter
with the punctuation you wish to
remove after JapaneseNumberFilter
in your analyzer chain.
Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
Tokens preceded by a token with PositionIncrementAttribute
of zero are left
left untouched and emitted as-is.
This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context.
For example, is 田中京一 is a name and means Tanaka Kyōichi, but 京一 (Kyōichi) out of
context can strictly speaking also represent the number 10000000000000001. This filter
respects the KeywordAttribute
, which can be used to prevent specific
normalizations from happening.
Also notice that token attributes such as
PartOfSpeechAttribute
,
ReadingAttribute
,
InflectionAttribute
and
BaseFormAttribute
are left
unchanged and will inherit the values of the last token used to compose the normalized
number and can be wrong. Hence, for 10万 (10000), we will have
ReadingAttribute
set to マン. This is a known issue and is subject to a future improvement.
Japanese formal numbers (daiji), accounting numbers and decimal fractions are currently not supported.
Modifier and Type | Class and Description |
---|---|
static class |
JapaneseNumberFilter.NumberBuffer
Buffer that holds a Japanese number string and a position index used as a parsed-to marker
|
AttributeSource.State
input
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
JapaneseNumberFilter(TokenStream input) |
Modifier and Type | Method and Description |
---|---|
boolean |
incrementToken()
Consumers (i.e.,
IndexWriter ) use this method to advance the stream to
the next token. |
boolean |
isArabicNumeral(char c)
Arabic numeral predicate.
|
boolean |
isNumeral(char c)
Numeral predicate
|
boolean |
isNumeral(java.lang.String input)
Numeral predicate
|
boolean |
isNumeralPunctuation(char c)
Numeral punctuation predicate
|
boolean |
isNumeralPunctuation(java.lang.String input)
Numeral punctuation predicate
|
java.lang.String |
normalizeNumber(java.lang.String number)
Normalizes a Japanese number
|
java.math.BigDecimal |
parseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse large kanji numerals (ten thousands or larger)
|
java.math.BigDecimal |
parseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse medium kanji numerals (tens, hundreds or thousands)
|
void |
reset()
This method is called by a consumer before it begins consumption using
TokenStream.incrementToken() . |
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public JapaneseNumberFilter(TokenStream input)
public final boolean incrementToken() throws java.io.IOException
TokenStream
IndexWriter
) use this method to advance the stream to
the next token. Implementing classes must implement this method and update
the appropriate AttributeImpl
s with the attributes of the next
token.
The producer must make no assumptions about the attributes after the method
has been returned: the caller may arbitrarily change it. If the producer
needs to preserve the state for subsequent calls, it can use
AttributeSource.captureState()
to create a copy of the current attribute state.
This method is called for every token of a document, so an efficient
implementation is crucial for good performance. To avoid calls to
AttributeSource.addAttribute(Class)
and AttributeSource.getAttribute(Class)
,
references to all AttributeImpl
s that this stream uses should be
retrieved during instantiation.
To ensure that filters and consumers know which attributes are available,
the attributes must be added during instantiation. Filters and consumers
are not required to check for availability of attributes in
TokenStream.incrementToken()
.
incrementToken
in class TokenStream
java.io.IOException
public void reset() throws java.io.IOException
TokenFilter
TokenStream.incrementToken()
.
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call super.reset()
, otherwise
some internal state will not be correctly reset (e.g., Tokenizer
will
throw IllegalStateException
on further usage).
NOTE:
The default implementation chains the call to the input TokenStream, so
be sure to call super.reset()
when overriding this method.
reset
in class TokenFilter
java.io.IOException
public java.lang.String normalizeNumber(java.lang.String number)
number
- number or normalizepublic java.math.BigDecimal parseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic java.math.BigDecimal parseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic boolean isNumeral(java.lang.String input)
input
- string to testpublic boolean isNumeral(char c)
c
- character to testpublic boolean isNumeralPunctuation(java.lang.String input)
input
- string to testpublic boolean isNumeralPunctuation(char c)
c
- character to testpublic boolean isArabicNumeral(char c)
c
- character to testCopyright © 2000–2019 The Apache Software Foundation. All rights reserved.