Class Dictionary
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.Dictionary
-
public class Dictionary extends java.lang.Object
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
Dictionary.DoubleASCIIFlagParsingStrategy
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static class
Dictionary.FlagParsingStrategy
Abstraction of the process of parsing flags taken from the affix and dic filesprivate static class
Dictionary.NumFlagParsingStrategy
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded in its numerical form.private static class
Dictionary.SimpleFlagParsingStrategy
Simple implementation ofDictionary.FlagParsingStrategy
that treats the chars in each String as a individual flags.
-
Field Summary
Fields Modifier and Type Field Description (package private) byte[]
affixData
private static java.lang.String
ALIAS_KEY
private int
aliasCount
private java.lang.String[]
aliases
(package private) boolean
alternateCasing
(package private) static java.util.Map<java.lang.String,java.lang.String>
CHARSET_ALIASES
(package private) int
circumfix
private static java.lang.String
CIRCUMFIX_KEY
(package private) boolean
complexPrefixes
private static java.lang.String
COMPLEXPREFIXES_KEY
private int
currentAffix
private static java.nio.file.Path
DEFAULT_TEMP_DIR
(package private) static java.util.regex.Pattern
ENCODING_PATTERN
pattern accepts optional BOM + SET + any whitespaceprivate static java.lang.String
FLAG_KEY
(package private) char
FLAG_SEPARATOR
(package private) BytesRefHash
flagLookup
private Dictionary.FlagParsingStrategy
flagParsingStrategy
(package private) boolean
fullStrip
private static java.lang.String
FULLSTRIP_KEY
(package private) boolean
hasStemExceptions
(package private) FST<CharsRef>
iconv
private static java.lang.String
ICONV_KEY
private char[]
ignore
private static java.lang.String
IGNORE_KEY
(package private) boolean
ignoreCase
(package private) int
keepcase
private static java.lang.String
KEEPCASE_KEY
private static java.lang.String
LANG_KEY
(package private) java.lang.String
language
private static java.lang.String
LONG_FLAG_TYPE
private static java.lang.String
MORPH_ALIAS_KEY
(package private) char
MORPH_SEPARATOR
private int
morphAliasCount
private java.lang.String[]
morphAliases
(package private) int
needaffix
private static java.lang.String
NEEDAFFIX_KEY
(package private) boolean
needsInputCleaning
(package private) boolean
needsOutputCleaning
(package private) static char[]
NOFLAGS
private static java.lang.String
NUM_FLAG_TYPE
(package private) FST<CharsRef>
oconv
private static java.lang.String
OCONV_KEY
(package private) int
onlyincompound
private static java.lang.String
ONLYINCOMPOUND_KEY
(package private) java.util.ArrayList<CharacterRunAutomaton>
patterns
private static java.lang.String
PREFIX_CONDITION_REGEX_PATTERN
private static java.lang.String
PREFIX_KEY
(package private) FST<IntsRef>
prefixes
private static java.lang.String
PSEUDOROOT_KEY
private int
stemExceptionCount
private java.lang.String[]
stemExceptions
(package private) char[]
stripData
(package private) int[]
stripOffsets
private static java.lang.String
SUFFIX_CONDITION_REGEX_PATTERN
private static java.lang.String
SUFFIX_KEY
(package private) FST<IntsRef>
suffixes
private java.nio.file.Path
tempPath
(package private) boolean
twoStageAffix
private static java.lang.String
UTF8_FLAG_TYPE
(package private) FST<IntsRef>
words
-
Constructor Summary
Constructors Constructor Description Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary)
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase)
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private FST<IntsRef>
affixFST(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes)
(package private) static void
applyMappings(FST<CharsRef> fst, java.lang.StringBuilder sb)
(package private) char
caseFold(char c)
folds single character (according to LANG if present)(package private) java.lang.CharSequence
cleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)
(package private) static char[]
decodeFlags(BytesRef b)
(package private) static void
encodeFlags(BytesRefBuilder b, char[] flags)
(package private) static java.lang.String
escapeDash(java.lang.String re)
private java.lang.String
getAliasValue(int id)
(package private) static java.nio.file.Path
getDefaultTempDir()
Returns the default temporary directory.(package private) static java.lang.String
getDictionaryEncoding(java.io.InputStream affix)
Parses the encoding specified in the affix file readable through the provided InputStream(package private) static Dictionary.FlagParsingStrategy
getFlagParsingStrategy(java.lang.String flagLine)
Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix fileboolean
getIgnoreCase()
Returns true if this dictionary was constructed with theignoreCase
optionprivate java.nio.charset.CharsetDecoder
getJavaEncoding(java.lang.String encoding)
Retrieves the CharsetDecoder for the given encoding.(package private) java.lang.String
getStemException(int id)
(package private) static boolean
hasFlag(char[] flags, char flag)
(package private) static int
indexOfSpaceOrTab(java.lang.String text, int start)
(package private) IntsRef
lookup(FST<IntsRef> fst, char[] word, int offset, int length)
(package private) IntsRef
lookupPrefix(char[] word, int offset, int length)
(package private) IntsRef
lookupSuffix(char[] word, int offset, int length)
(package private) IntsRef
lookupWord(char[] word, int offset, int length)
Looks up Hunspell word forms from the dictionary(package private) static int
morphBoundary(java.lang.String line)
private void
parseAffix(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes, java.lang.String header, java.io.LineNumberReader reader, java.lang.String conditionPattern, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips)
Parses a specific affix rule putting the result into the provided affix mapprivate void
parseAlias(java.lang.String line)
private FST<CharsRef>
parseConversions(java.io.LineNumberReader reader, int num)
private void
parseMorphAlias(java.lang.String line)
private java.lang.String
parseStemException(java.lang.String morphData)
private void
readAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder)
Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate void
readDictionaryFiles(Directory tempDir, java.lang.String tempFileNamePrefix, java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, Builder<IntsRef> words)
Reads the dictionary file through the provided InputStreams, building up the words mapstatic void
setDefaultTempDir(java.nio.file.Path tempDir)
Used by test framework(package private) java.lang.String
unescapeEntry(java.lang.String entry)
-
-
-
Field Detail
-
NOFLAGS
static final char[] NOFLAGS
-
ALIAS_KEY
private static final java.lang.String ALIAS_KEY
- See Also:
- Constant Field Values
-
MORPH_ALIAS_KEY
private static final java.lang.String MORPH_ALIAS_KEY
- See Also:
- Constant Field Values
-
PREFIX_KEY
private static final java.lang.String PREFIX_KEY
- See Also:
- Constant Field Values
-
SUFFIX_KEY
private static final java.lang.String SUFFIX_KEY
- See Also:
- Constant Field Values
-
FLAG_KEY
private static final java.lang.String FLAG_KEY
- See Also:
- Constant Field Values
-
COMPLEXPREFIXES_KEY
private static final java.lang.String COMPLEXPREFIXES_KEY
- See Also:
- Constant Field Values
-
CIRCUMFIX_KEY
private static final java.lang.String CIRCUMFIX_KEY
- See Also:
- Constant Field Values
-
IGNORE_KEY
private static final java.lang.String IGNORE_KEY
- See Also:
- Constant Field Values
-
ICONV_KEY
private static final java.lang.String ICONV_KEY
- See Also:
- Constant Field Values
-
OCONV_KEY
private static final java.lang.String OCONV_KEY
- See Also:
- Constant Field Values
-
FULLSTRIP_KEY
private static final java.lang.String FULLSTRIP_KEY
- See Also:
- Constant Field Values
-
LANG_KEY
private static final java.lang.String LANG_KEY
- See Also:
- Constant Field Values
-
KEEPCASE_KEY
private static final java.lang.String KEEPCASE_KEY
- See Also:
- Constant Field Values
-
NEEDAFFIX_KEY
private static final java.lang.String NEEDAFFIX_KEY
- See Also:
- Constant Field Values
-
PSEUDOROOT_KEY
private static final java.lang.String PSEUDOROOT_KEY
- See Also:
- Constant Field Values
-
ONLYINCOMPOUND_KEY
private static final java.lang.String ONLYINCOMPOUND_KEY
- See Also:
- Constant Field Values
-
NUM_FLAG_TYPE
private static final java.lang.String NUM_FLAG_TYPE
- See Also:
- Constant Field Values
-
UTF8_FLAG_TYPE
private static final java.lang.String UTF8_FLAG_TYPE
- See Also:
- Constant Field Values
-
LONG_FLAG_TYPE
private static final java.lang.String LONG_FLAG_TYPE
- See Also:
- Constant Field Values
-
PREFIX_CONDITION_REGEX_PATTERN
private static final java.lang.String PREFIX_CONDITION_REGEX_PATTERN
- See Also:
- Constant Field Values
-
SUFFIX_CONDITION_REGEX_PATTERN
private static final java.lang.String SUFFIX_CONDITION_REGEX_PATTERN
- See Also:
- Constant Field Values
-
patterns
java.util.ArrayList<CharacterRunAutomaton> patterns
-
flagLookup
BytesRefHash flagLookup
-
stripData
char[] stripData
-
stripOffsets
int[] stripOffsets
-
affixData
byte[] affixData
-
currentAffix
private int currentAffix
-
flagParsingStrategy
private Dictionary.FlagParsingStrategy flagParsingStrategy
-
aliases
private java.lang.String[] aliases
-
aliasCount
private int aliasCount
-
morphAliases
private java.lang.String[] morphAliases
-
morphAliasCount
private int morphAliasCount
-
stemExceptions
private java.lang.String[] stemExceptions
-
stemExceptionCount
private int stemExceptionCount
-
hasStemExceptions
boolean hasStemExceptions
-
tempPath
private final java.nio.file.Path tempPath
-
ignoreCase
boolean ignoreCase
-
complexPrefixes
boolean complexPrefixes
-
twoStageAffix
boolean twoStageAffix
-
circumfix
int circumfix
-
keepcase
int keepcase
-
needaffix
int needaffix
-
onlyincompound
int onlyincompound
-
ignore
private char[] ignore
-
needsInputCleaning
boolean needsInputCleaning
-
needsOutputCleaning
boolean needsOutputCleaning
-
fullStrip
boolean fullStrip
-
language
java.lang.String language
-
alternateCasing
boolean alternateCasing
-
ENCODING_PATTERN
static final java.util.regex.Pattern ENCODING_PATTERN
pattern accepts optional BOM + SET + any whitespace
-
CHARSET_ALIASES
static final java.util.Map<java.lang.String,java.lang.String> CHARSET_ALIASES
-
FLAG_SEPARATOR
final char FLAG_SEPARATOR
- See Also:
- Constant Field Values
-
MORPH_SEPARATOR
final char MORPH_SEPARATOR
- See Also:
- Constant Field Values
-
DEFAULT_TEMP_DIR
private static java.nio.file.Path DEFAULT_TEMP_DIR
-
-
Constructor Detail
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionary
- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamsjava.text.ParseException
- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionaries
- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamsjava.text.ParseException
- Can be thrown if the content of the files does not meet expected formats
-
-
Method Detail
-
lookupWord
IntsRef lookupWord(char[] word, int offset, int length)
Looks up Hunspell word forms from the dictionary
-
lookupPrefix
IntsRef lookupPrefix(char[] word, int offset, int length)
-
lookupSuffix
IntsRef lookupSuffix(char[] word, int offset, int length)
-
readAffixFile
private void readAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder) throws java.io.IOException, java.text.ParseException
Reads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream
- InputStream to read the content of the affix file fromdecoder
- CharsetDecoder to decode the content of the file- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamjava.text.ParseException
-
affixFST
private FST<IntsRef> affixFST(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes) throws java.io.IOException
- Throws:
java.io.IOException
-
escapeDash
static java.lang.String escapeDash(java.lang.String re)
-
parseAffix
private void parseAffix(java.util.TreeMap<java.lang.String,java.util.List<java.lang.Integer>> affixes, java.lang.String header, java.io.LineNumberReader reader, java.lang.String conditionPattern, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips) throws java.io.IOException, java.text.ParseException
Parses a specific affix rule putting the result into the provided affix map- Parameters:
affixes
- Map where the result of the parsing will be putheader
- Header line of the affix rulereader
- BufferedReader to read the content of the rule fromconditionPattern
-String.format(String, Object...)
pattern to be used to generate the condition regex patternseenPatterns
- map from condition -> index of patterns, for deduplication.- Throws:
java.io.IOException
- Can be thrown while reading the rulejava.text.ParseException
-
parseConversions
private FST<CharsRef> parseConversions(java.io.LineNumberReader reader, int num) throws java.io.IOException, java.text.ParseException
- Throws:
java.io.IOException
java.text.ParseException
-
getDictionaryEncoding
static java.lang.String getDictionaryEncoding(java.io.InputStream affix) throws java.io.IOException, java.text.ParseException
Parses the encoding specified in the affix file readable through the provided InputStream- Parameters:
affix
- InputStream for reading the affix file- Returns:
- Encoding specified in the affix file
- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamjava.text.ParseException
- Thrown if the first non-empty non-comment line read from the file does not adhere to the formatSET <encoding>
-
getJavaEncoding
private java.nio.charset.CharsetDecoder getJavaEncoding(java.lang.String encoding)
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding
- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
getFlagParsingStrategy
static Dictionary.FlagParsingStrategy getFlagParsingStrategy(java.lang.String flagLine)
Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix file- Parameters:
flagLine
- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
java.lang.String unescapeEntry(java.lang.String entry)
-
morphBoundary
static int morphBoundary(java.lang.String line)
-
indexOfSpaceOrTab
static int indexOfSpaceOrTab(java.lang.String text, int start)
-
readDictionaryFiles
private void readDictionaryFiles(Directory tempDir, java.lang.String tempFileNamePrefix, java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, Builder<IntsRef> words) throws java.io.IOException
Reads the dictionary file through the provided InputStreams, building up the words map- Parameters:
dictionaries
- InputStreams to read the dictionary file throughdecoder
- CharsetDecoder used to decode the contents of the file- Throws:
java.io.IOException
- Can be thrown while reading from the file
-
decodeFlags
static char[] decodeFlags(BytesRef b)
-
encodeFlags
static void encodeFlags(BytesRefBuilder b, char[] flags)
-
parseAlias
private void parseAlias(java.lang.String line)
-
getAliasValue
private java.lang.String getAliasValue(int id)
-
getStemException
java.lang.String getStemException(int id)
-
parseMorphAlias
private void parseMorphAlias(java.lang.String line)
-
parseStemException
private java.lang.String parseStemException(java.lang.String morphData)
-
hasFlag
static boolean hasFlag(char[] flags, char flag)
-
cleanInput
java.lang.CharSequence cleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)
-
caseFold
char caseFold(char c)
folds single character (according to LANG if present)
-
applyMappings
static void applyMappings(FST<CharsRef> fst, java.lang.StringBuilder sb) throws java.io.IOException
- Throws:
java.io.IOException
-
getIgnoreCase
public boolean getIgnoreCase()
Returns true if this dictionary was constructed with theignoreCase
option
-
setDefaultTempDir
public static void setDefaultTempDir(java.nio.file.Path tempDir)
Used by test framework
-
getDefaultTempDir
static java.nio.file.Path getDefaultTempDir() throws java.io.IOException
Returns the default temporary directory. By default, java.io.tmpdir. If not accessible or not available, an IOException is thrown- Throws:
java.io.IOException
-
-