Package org.apache.lucene.analysis.in
Class IndicNormalizer
java.lang.Object
org.apache.lucene.analysis.in.IndicNormalizer
Normalizes the Unicode representation of text in Indian languages.
Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final int[][]
Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.htmlprivate static final IdentityHashMap
<Character.UnicodeBlock, IndicNormalizer.ScriptData> -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate int
compose
(int ch0, Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len) Compose into standard form any compositions in the decompositions table.private static int
int
normalize
(char[] text, int len) Normalizes input text, and returns the new length.
-
Field Details
-
scripts
-
decompositions
private static final int[][] decompositionsDecompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.htmlMost of these are not handled by unicode normalization anyway.
The numbers here represent offsets into the respective codepages, with -1 representing null and 0xFF representing zero-width joiner.
the columns are: ch1, ch2, ch3, res, flags ch1, ch2, and ch3 are the decomposition res is the composition, and flags are the scripts to which it applies.
-
-
Constructor Details
-
IndicNormalizer
public IndicNormalizer()
-
-
Method Details
-
flag
-
normalize
public int normalize(char[] text, int len) Normalizes input text, and returns the new length. The length will always be less than or equal to the existing length.- Parameters:
text
- input textlen
- valid length- Returns:
- normalized length
-
compose
private int compose(int ch0, Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len) Compose into standard form any compositions in the decompositions table.
-