Class Dictionary


  • public class Dictionary
    extends java.lang.Object
    In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
    • Field Detail

      • NOFLAGS

        static final char[] NOFLAGS
      • COMPLEXPREFIXES_KEY

        private static final java.lang.String COMPLEXPREFIXES_KEY
        See Also:
        Constant Field Values
      • ONLYINCOMPOUND_KEY

        private static final java.lang.String ONLYINCOMPOUND_KEY
        See Also:
        Constant Field Values
      • PREFIX_CONDITION_REGEX_PATTERN

        private static final java.lang.String PREFIX_CONDITION_REGEX_PATTERN
        See Also:
        Constant Field Values
      • SUFFIX_CONDITION_REGEX_PATTERN

        private static final java.lang.String SUFFIX_CONDITION_REGEX_PATTERN
        See Also:
        Constant Field Values
      • stripData

        char[] stripData
      • stripOffsets

        int[] stripOffsets
      • affixData

        byte[] affixData
      • currentAffix

        private int currentAffix
      • aliases

        private java.lang.String[] aliases
      • aliasCount

        private int aliasCount
      • morphAliases

        private java.lang.String[] morphAliases
      • morphAliasCount

        private int morphAliasCount
      • stemExceptions

        private java.lang.String[] stemExceptions
      • stemExceptionCount

        private int stemExceptionCount
      • hasStemExceptions

        boolean hasStemExceptions
      • tempPath

        private final java.nio.file.Path tempPath
      • ignoreCase

        boolean ignoreCase
      • complexPrefixes

        boolean complexPrefixes
      • twoStageAffix

        boolean twoStageAffix
      • circumfix

        int circumfix
      • keepcase

        int keepcase
      • needaffix

        int needaffix
      • onlyincompound

        int onlyincompound
      • ignore

        private char[] ignore
      • needsInputCleaning

        boolean needsInputCleaning
      • needsOutputCleaning

        boolean needsOutputCleaning
      • fullStrip

        boolean fullStrip
      • language

        java.lang.String language
      • alternateCasing

        boolean alternateCasing
      • ENCODING_PATTERN

        static final java.util.regex.Pattern ENCODING_PATTERN
        pattern accepts optional BOM + SET + any whitespace
      • CHARSET_ALIASES

        static final java.util.Map<java.lang.String,​java.lang.String> CHARSET_ALIASES
      • DEFAULT_TEMP_DIR

        private static java.nio.file.Path DEFAULT_TEMP_DIR
    • Constructor Detail

      • Dictionary

        public Dictionary​(Directory tempDir,
                          java.lang.String tempFileNamePrefix,
                          java.io.InputStream affix,
                          java.io.InputStream dictionary)
                   throws java.io.IOException,
                          java.text.ParseException
        Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.
        Parameters:
        tempDir - Directory to use for offline sorting
        tempFileNamePrefix - prefix to use to generate temp file names
        affix - InputStream for reading the hunspell affix file (won't be closed).
        dictionary - InputStream for reading the hunspell dictionary file (won't be closed).
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStreams
        java.text.ParseException - Can be thrown if the content of the files does not meet expected formats
      • Dictionary

        public Dictionary​(Directory tempDir,
                          java.lang.String tempFileNamePrefix,
                          java.io.InputStream affix,
                          java.util.List<java.io.InputStream> dictionaries,
                          boolean ignoreCase)
                   throws java.io.IOException,
                          java.text.ParseException
        Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.
        Parameters:
        tempDir - Directory to use for offline sorting
        tempFileNamePrefix - prefix to use to generate temp file names
        affix - InputStream for reading the hunspell affix file (won't be closed).
        dictionaries - InputStream for reading the hunspell dictionary files (won't be closed).
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStreams
        java.text.ParseException - Can be thrown if the content of the files does not meet expected formats
    • Method Detail

      • lookupWord

        IntsRef lookupWord​(char[] word,
                           int offset,
                           int length)
        Looks up Hunspell word forms from the dictionary
      • lookupPrefix

        IntsRef lookupPrefix​(char[] word,
                             int offset,
                             int length)
      • lookupSuffix

        IntsRef lookupSuffix​(char[] word,
                             int offset,
                             int length)
      • lookup

        IntsRef lookup​(FST<IntsRef> fst,
                       char[] word,
                       int offset,
                       int length)
      • readAffixFile

        private void readAffixFile​(java.io.InputStream affixStream,
                                   java.nio.charset.CharsetDecoder decoder)
                            throws java.io.IOException,
                                   java.text.ParseException
        Reads the affix file through the provided InputStream, building up the prefix and suffix maps
        Parameters:
        affixStream - InputStream to read the content of the affix file from
        decoder - CharsetDecoder to decode the content of the file
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStream
        java.text.ParseException
      • affixFST

        private FST<IntsRef> affixFST​(java.util.TreeMap<java.lang.String,​java.util.List<java.lang.Integer>> affixes)
                               throws java.io.IOException
        Throws:
        java.io.IOException
      • escapeDash

        static java.lang.String escapeDash​(java.lang.String re)
      • parseAffix

        private void parseAffix​(java.util.TreeMap<java.lang.String,​java.util.List<java.lang.Integer>> affixes,
                                java.lang.String header,
                                java.io.LineNumberReader reader,
                                java.lang.String conditionPattern,
                                java.util.Map<java.lang.String,​java.lang.Integer> seenPatterns,
                                java.util.Map<java.lang.String,​java.lang.Integer> seenStrips)
                         throws java.io.IOException,
                                java.text.ParseException
        Parses a specific affix rule putting the result into the provided affix map
        Parameters:
        affixes - Map where the result of the parsing will be put
        header - Header line of the affix rule
        reader - BufferedReader to read the content of the rule from
        conditionPattern - String.format(String, Object...) pattern to be used to generate the condition regex pattern
        seenPatterns - map from condition -> index of patterns, for deduplication.
        Throws:
        java.io.IOException - Can be thrown while reading the rule
        java.text.ParseException
      • parseConversions

        private FST<CharsRef> parseConversions​(java.io.LineNumberReader reader,
                                               int num)
                                        throws java.io.IOException,
                                               java.text.ParseException
        Throws:
        java.io.IOException
        java.text.ParseException
      • getDictionaryEncoding

        static java.lang.String getDictionaryEncoding​(java.io.InputStream affix)
                                               throws java.io.IOException,
                                                      java.text.ParseException
        Parses the encoding specified in the affix file readable through the provided InputStream
        Parameters:
        affix - InputStream for reading the affix file
        Returns:
        Encoding specified in the affix file
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStream
        java.text.ParseException - Thrown if the first non-empty non-comment line read from the file does not adhere to the format SET <encoding>
      • getJavaEncoding

        private java.nio.charset.CharsetDecoder getJavaEncoding​(java.lang.String encoding)
        Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...
        Parameters:
        encoding - Encoding to retrieve the CharsetDecoder for
        Returns:
        CharSetDecoder for the given encoding
      • getFlagParsingStrategy

        static Dictionary.FlagParsingStrategy getFlagParsingStrategy​(java.lang.String flagLine)
        Determines the appropriate Dictionary.FlagParsingStrategy based on the FLAG definition line taken from the affix file
        Parameters:
        flagLine - Line containing the flag information
        Returns:
        FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
      • unescapeEntry

        java.lang.String unescapeEntry​(java.lang.String entry)
      • morphBoundary

        static int morphBoundary​(java.lang.String line)
      • indexOfSpaceOrTab

        static int indexOfSpaceOrTab​(java.lang.String text,
                                     int start)
      • readDictionaryFiles

        private void readDictionaryFiles​(Directory tempDir,
                                         java.lang.String tempFileNamePrefix,
                                         java.util.List<java.io.InputStream> dictionaries,
                                         java.nio.charset.CharsetDecoder decoder,
                                         Builder<IntsRef> words)
                                  throws java.io.IOException
        Reads the dictionary file through the provided InputStreams, building up the words map
        Parameters:
        dictionaries - InputStreams to read the dictionary file through
        decoder - CharsetDecoder used to decode the contents of the file
        Throws:
        java.io.IOException - Can be thrown while reading from the file
      • decodeFlags

        static char[] decodeFlags​(BytesRef b)
      • encodeFlags

        static void encodeFlags​(BytesRefBuilder b,
                                char[] flags)
      • parseAlias

        private void parseAlias​(java.lang.String line)
      • getAliasValue

        private java.lang.String getAliasValue​(int id)
      • getStemException

        java.lang.String getStemException​(int id)
      • parseMorphAlias

        private void parseMorphAlias​(java.lang.String line)
      • parseStemException

        private java.lang.String parseStemException​(java.lang.String morphData)
      • hasFlag

        static boolean hasFlag​(char[] flags,
                               char flag)
      • cleanInput

        java.lang.CharSequence cleanInput​(java.lang.CharSequence input,
                                          java.lang.StringBuilder reuse)
      • caseFold

        char caseFold​(char c)
        folds single character (according to LANG if present)
      • applyMappings

        static void applyMappings​(FST<CharsRef> fst,
                                  java.lang.StringBuilder sb)
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • getIgnoreCase

        public boolean getIgnoreCase()
        Returns true if this dictionary was constructed with the ignoreCase option
      • setDefaultTempDir

        public static void setDefaultTempDir​(java.nio.file.Path tempDir)
        Used by test framework
      • getDefaultTempDir

        static java.nio.file.Path getDefaultTempDir()
                                             throws java.io.IOException
        Returns the default temporary directory. By default, java.io.tmpdir. If not accessible or not available, an IOException is thrown
        Throws:
        java.io.IOException