Package org.apache.lucene.analysis.standard
Fast, general-purpose grammar-based tokenizer
StandardTokenizer
implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer
from the analysis module, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer
includes
StandardTokenizer
,
LowerCaseFilter
and StopFilter
.-
Class Summary Class Description ClassicAnalyzer FiltersClassicTokenizer
withClassicFilter
,LowerCaseFilter
andStopFilter
, using a list of English stop words.ClassicFilter Normalizes tokens extracted withClassicTokenizer
.ClassicFilterFactory Factory forClassicFilter
.ClassicTokenizer A grammar-based tokenizer constructed with JFlexClassicTokenizerFactory Factory forClassicTokenizer
.ClassicTokenizerImpl This class implements the classic lucene StandardTokenizer up until 3.0StandardAnalyzer FiltersStandardTokenizer
withLowerCaseFilter
andStopFilter
, using a configurable list of stop words.StandardTokenizer A grammar-based tokenizer constructed with JFlex.StandardTokenizerFactory Factory forStandardTokenizer
.StandardTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.UAX29URLEmailAnalyzer FiltersUAX29URLEmailTokenizer
withLowerCaseFilter
andStopFilter
, using a list of English stop words.UAX29URLEmailTokenizer This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.UAX29URLEmailTokenizerFactory Factory forUAX29URLEmailTokenizer
.UAX29URLEmailTokenizerImpl This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.