Class HMMChineseTokenizer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public class HMMChineseTokenizer
    extends SegmentingTokenizerBase
    Tokenizer for Chinese or mixed Chinese-English text.

    The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

    • Field Detail

      • sentenceProto

        private static final java.text.BreakIterator sentenceProto
        used for breaking the text into sentences
      • tokens

        private java.util.Iterator<SegToken> tokens
    • Constructor Detail

      • HMMChineseTokenizer

        public HMMChineseTokenizer()
        Creates a new HMMChineseTokenizer
      • HMMChineseTokenizer

        public HMMChineseTokenizer​(AttributeFactory factory)
        Creates a new HMMChineseTokenizer, supplying the AttributeFactory
    • Method Detail

      • reset

        public void reset()
                   throws java.io.IOException
        Description copied from class: TokenStream
        This method is called by a consumer before it begins consumption using TokenStream.incrementToken().

        Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

        If you override this method, always call super.reset(), otherwise some internal state will not be correctly reset (e.g., Tokenizer will throw IllegalStateException on further usage).

        Overrides:
        reset in class SegmentingTokenizerBase
        Throws:
        java.io.IOException