Package com.ibm.icu.impl
Class UCharacterName
java.lang.Object
com.ibm.icu.impl.UCharacterName
Internal class to manage character names.
Since data for names are stored
in an array of char, by default indexes used in this class is referring to
a 2 byte count, unless otherwise stated. Cases where the index is referring
to a byte count, the index is halved and depending on whether the index is
even or odd, the MSB or LSB of the result char at the halved index is
returned. For indexes to an array of int, the index is multiplied by 2,
result char at the multiplied index and its following char is returned as an
int.
UCharacter acts as a public facade for this class
Note : 0 - 0x1F are control characters without names in Unicode 3.0
- Since:
- nov0700
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static final class
Algorithmic name class -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final int
Extended category countprivate static final String
Default name of the name datafileprivate static final int
Mask to retrieve the offset for a particular character within a groupprivate static final int
Shift count to retrieve group informationstatic final UCharacterName
private static final int
Lead surrogate typestatic final int
Number of lines per group 1 invalid input: '<'invalid input: '<' GROUP_SHIFT_private UCharacterName.AlgorithmName[]
int
Maximum number of groupsprivate char[]
private char[]
private char[]
Group use.(package private) int
Size of each groupsprivate byte[]
private int[]
Set of chars used in ISO comments.private int
Maximum ISO comment lengthprivate int
Maximum name lengthprivate int[]
Set of chars used in character names (regular invalid input: '&' 1.0).private byte[]
private char[]
Data used in unames.icuprivate int[]
Utility int bufferprivate StringBuffer
Utility StringBufferprivate static final int
Not a character typeprivate static final int
Position of offsethigh in group information arrayprivate static final int
Position of offsetlow in group information arrayprivate static final int
Double nibble indicator, any nibble > this number has to be combined with its following nibbleprivate static final int
Trail surrogate typeprivate static final String[]
Type names used for extended namesprivate static final String
Unknown type name -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivate
Protected constructor for use in UCharacter. -
Method Summary
Modifier and TypeMethodDescriptionprivate static void
add
(int[] set, char ch) Adds a codepoint into a set of ints.private static int
Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.private static int
add
(int[] set, StringBuffer str) Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.private int
addAlgorithmName
(int maxlength) Adds all algorithmic names into the name set.private int
addExtendedName
(int maxlength) Adds all extended names into the name set.private void
addGroupName
(int maxlength) Adds names of all group to the argument set.private int[]
addGroupName
(int offset, int length, byte[] tokenlength, int[] set) Adds names of a group to the argument set.private static boolean
contains
(int[] set, char ch) Checks if a codepoint is a part of a set of ints.private void
convert
(int[] set, UnicodeSet uset) Converts the char set cset into a Unicode set uset.private String
getAlgName
(int ch, int choice) Gets the algorithmic name for the argument characterint
getAlgorithmEnd
(int index) Gets the end of the rangeint
Get the Algorithm range lengthgetAlgorithmName
(int index, int codepoint) Gets the Algorithmic name of the codepointint
getAlgorithmStart
(int index) Gets the start of the rangeint
getCharFromName
(int choice, String name) Find a character by its name and return its code point valuevoid
Fills set with characters that are used in Unicode character names.static int
getCodepointMSB
(int codepoint) Gets the MSB of the codepointprivate static int
getExtendedChar
(String name, int choice) Getting the character with extended name of the form invalid input: '<'....>.getExtendedName
(int ch) Retrieves the extended namegetExtendedOr10Name
(int ch) Gets the extended and 1.0 name when the most current unicode names failint
getGroup
(int codepoint) Gets the group index for the codepoint, or the group before it.private int
getGroupChar
(int index, char[] length, String name, int choice) Compares and retrieve character if name is found within the argument groupprivate int
getGroupChar
(String name, int choice) Getting the character with the tokenized argument nameint
getGroupLengths
(int index, char[] offsets, char[] lengths) Reads a block of compressed lengths of 32 strings and expands them into offsets and lengths for each string.static int
getGroupLimit
(int msb) Gets the maximum codepoint + 1 of the groupstatic int
getGroupMin
(int msb) Gets the minimum codepoint of the groupstatic int
getGroupMinFromCodepoint
(int codepoint) Gets the minimum codepoint of a groupint
getGroupMSB
(int gindex) Gets the MSB from the group indexgetGroupName
(int ch, int choice) Gets the group name of the charactergetGroupName
(int index, int length, int choice) Gets the name of the argument group index.static int
getGroupOffset
(int codepoint) Gets the offset to a groupvoid
Fills set with characters that are used in Unicode character names.int
Gets the maximum length of any codepoint name.int
Gets the maximum length of any iso comments.getName
(int ch, int choice) Retrieve the name of a Unicode code point.private static int
getType
(int ch) Gets the character extended typeprivate boolean
Sets up the name sets and the calculation of the maximum lengths.(package private) boolean
Set the algorithm name information array(package private) boolean
setGroup
(char[] group, byte[] groupstring) Sets the group name data(package private) boolean
setGroupCountSize
(int count, int size) Sets the number of group and size of each group in number of char(package private) boolean
setToken
(char[] token, byte[] tokenstring) Sets the token data
-
Field Details
-
INSTANCE
-
LINES_PER_GROUP_
public static final int LINES_PER_GROUP_Number of lines per group 1 invalid input: '<'invalid input: '<' GROUP_SHIFT_- See Also:
-
m_groupcount_
public int m_groupcount_Maximum number of groups -
m_groupsize_
int m_groupsize_Size of each groups -
m_tokentable_
private char[] m_tokentable_Data used in unames.icu -
m_tokenstring_
private byte[] m_tokenstring_ -
m_groupinfo_
private char[] m_groupinfo_ -
m_groupstring_
private byte[] m_groupstring_ -
m_algorithm_
-
m_groupoffsets_
private char[] m_groupoffsets_Group use. Note - access must be synchronized. -
m_grouplengths_
private char[] m_grouplengths_ -
FILE_NAME_
Default name of the name datafile- See Also:
-
GROUP_SHIFT_
private static final int GROUP_SHIFT_Shift count to retrieve group information- See Also:
-
GROUP_MASK_
private static final int GROUP_MASK_Mask to retrieve the offset for a particular character within a group- See Also:
-
OFFSET_HIGH_OFFSET_
private static final int OFFSET_HIGH_OFFSET_Position of offsethigh in group information array- See Also:
-
OFFSET_LOW_OFFSET_
private static final int OFFSET_LOW_OFFSET_Position of offsetlow in group information array- See Also:
-
SINGLE_NIBBLE_MAX_
private static final int SINGLE_NIBBLE_MAX_Double nibble indicator, any nibble > this number has to be combined with its following nibble- See Also:
-
m_nameSet_
private int[] m_nameSet_Set of chars used in character names (regular invalid input: '&' 1.0). Chars are platform-dependent (can be EBCDIC). -
m_ISOCommentSet_
private int[] m_ISOCommentSet_Set of chars used in ISO comments. (regular invalid input: '&' 1.0). Chars are platform-dependent (can be EBCDIC). -
m_utilStringBuffer_
Utility StringBuffer -
m_utilIntBuffer_
private int[] m_utilIntBuffer_Utility int buffer -
m_maxISOCommentLength_
private int m_maxISOCommentLength_Maximum ISO comment length -
m_maxNameLength_
private int m_maxNameLength_Maximum name length -
TYPE_NAMES_
Type names used for extended names -
UNKNOWN_TYPE_NAME_
Unknown type name- See Also:
-
NON_CHARACTER_
private static final int NON_CHARACTER_Not a character type- See Also:
-
LEAD_SURROGATE_
private static final int LEAD_SURROGATE_Lead surrogate type- See Also:
-
TRAIL_SURROGATE_
private static final int TRAIL_SURROGATE_Trail surrogate type- See Also:
-
EXTENDED_CATEGORY_
static final int EXTENDED_CATEGORY_Extended category count- See Also:
-
-
Constructor Details
-
UCharacterName
Protected constructor for use in UCharacter.
- Throws:
IOException
- thrown when data reading fails
-
-
Method Details
-
getName
Retrieve the name of a Unicode code point. Depending onchoice
, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'.- Parameters:
ch
- the code point for which to get the name.choice
- Selector for which name to get.- Returns:
- if code point is above 0x1fff, null is returned
-
getCharFromName
Find a character by its name and return its code point value- Parameters:
choice
- selector to indicate if argument name is a Unicode 1.0 or the most current versionname
- the name to search for- Returns:
- code point
-
getGroupLengths
public int getGroupLengths(int index, char[] offsets, char[] lengths) Reads a block of compressed lengths of 32 strings and expands them into offsets and lengths for each string. Lengths are stored with a variable-width encoding in consecutive nibbles: If a nibbleinvalid input: '<'0xc, then it is the length itself (0 = empty string). If a nibble>=0xc, then it forms a length value with the following nibble. The offsets and lengths arrays must be at least 33 (one more) long because there is no check here at the end if the last nibble is still used.- Parameters:
index
- of group string object in arrayoffsets
- array to store the value of the string offsetslengths
- array to store the value of the string length- Returns:
- next index of the data string immediately after the lengths in terms of byte address
-
getGroupName
Gets the name of the argument group index. UnicodeData.txt uses ';' as a field separator, so no field can contain ';' as part of its contents. In unames.icu, it is marked as token[';'] == -1 only if the semicolon is used in the data file - which is iff we have Unicode 1.0 names or ISO comments or aliases. So, it will be token[';'] == -1 if we store U1.0 names/ISO comments/aliases although we know that it will never be part of a name. Equivalent to ICU4C's expandName.- Parameters:
index
- of the group name string in byte countlength
- of the group name stringchoice
- of Unicode 1.0 name or the most current name- Returns:
- name of the group
-
getExtendedName
Retrieves the extended name -
getGroup
public int getGroup(int codepoint) Gets the group index for the codepoint, or the group before it.- Parameters:
codepoint
- The codepoint index.- Returns:
- group index containing codepoint or the group before it.
-
getExtendedOr10Name
Gets the extended and 1.0 name when the most current unicode names fail- Parameters:
ch
- codepoint- Returns:
- name of codepoint extended or 1.0
-
getGroupMSB
public int getGroupMSB(int gindex) Gets the MSB from the group index- Parameters:
gindex
- group index- Returns:
- the MSB of the group if gindex is valid, -1 otherwise
-
getCodepointMSB
public static int getCodepointMSB(int codepoint) Gets the MSB of the codepoint- Parameters:
codepoint
- The codepoint value.- Returns:
- the MSB of the codepoint
-
getGroupLimit
public static int getGroupLimit(int msb) Gets the maximum codepoint + 1 of the group- Parameters:
msb
- most significant byte of the group- Returns:
- limit codepoint of the group
-
getGroupMin
public static int getGroupMin(int msb) Gets the minimum codepoint of the group- Parameters:
msb
- most significant byte of the group- Returns:
- minimum codepoint of the group
-
getGroupOffset
public static int getGroupOffset(int codepoint) Gets the offset to a group- Parameters:
codepoint
- The codepoint value.- Returns:
- offset to a group
-
getGroupMinFromCodepoint
public static int getGroupMinFromCodepoint(int codepoint) Gets the minimum codepoint of a group- Parameters:
codepoint
- The codepoint value.- Returns:
- minimum codepoint in the group which codepoint belongs to
-
getAlgorithmLength
public int getAlgorithmLength()Get the Algorithm range length- Returns:
- Algorithm range length
-
getAlgorithmStart
public int getAlgorithmStart(int index) Gets the start of the range- Parameters:
index
- algorithm index- Returns:
- algorithm range start
-
getAlgorithmEnd
public int getAlgorithmEnd(int index) Gets the end of the range- Parameters:
index
- algorithm index- Returns:
- algorithm range end
-
getAlgorithmName
Gets the Algorithmic name of the codepoint- Parameters:
index
- algorithmic range indexcodepoint
- The codepoint value.- Returns:
- algorithmic name of codepoint
-
getGroupName
Gets the group name of the character- Parameters:
ch
- character to get the group namechoice
- name choice selector to choose a unicode 1.0 or newer name
-
getMaxCharNameLength
public int getMaxCharNameLength()Gets the maximum length of any codepoint name. Equivalent to uprv_getMaxCharNameLength.- Returns:
- the maximum length of any codepoint name
-
getMaxISOCommentLength
public int getMaxISOCommentLength()Gets the maximum length of any iso comments. Equivalent to uprv_getMaxISOCommentLength.- Returns:
- the maximum length of any codepoint name
-
getCharNameCharacters
Fills set with characters that are used in Unicode character names. Equivalent to uprv_getCharNameCharacters.- Parameters:
set
- USet to receive characters. Existing contents are deleted.
-
getISOCommentCharacters
Fills set with characters that are used in Unicode character names. Equivalent to uprv_getISOCommentCharacters.- Parameters:
set
- USet to receive characters. Existing contents are deleted.
-
setToken
boolean setToken(char[] token, byte[] tokenstring) Sets the token data- Parameters:
token
- array of tokenstokenstring
- array of string values of the tokens- Returns:
- false if there is a data error
-
setAlgorithm
Set the algorithm name information array- Parameters:
alg
- Algorithm information array- Returns:
- true if the group string offset has been set correctly
-
setGroupCountSize
boolean setGroupCountSize(int count, int size) Sets the number of group and size of each group in number of char- Parameters:
count
- number of groupssize
- size of group in char- Returns:
- true if group size is set correctly
-
setGroup
boolean setGroup(char[] group, byte[] groupstring) Sets the group name data- Parameters:
group
- index information arraygroupstring
- name information array- Returns:
- false if there is a data error
-
getAlgName
Gets the algorithmic name for the argument character- Parameters:
ch
- character to determine name forchoice
- name choice- Returns:
- the algorithmic name or null if not found
-
getGroupChar
Getting the character with the tokenized argument name- Parameters:
name
- of the character- Returns:
- character with the tokenized argument name or -1 if character is not found
-
getGroupChar
Compares and retrieve character if name is found within the argument group- Parameters:
index
- index where the set of names reside in the group blocklength
- list of lengths of the stringsname
- character name to search forchoice
- of either 1.0 or the most current unicode name- Returns:
- relative character in the group which matches name, otherwise if not found, -1 will be returned
-
getType
private static int getType(int ch) Gets the character extended type- Parameters:
ch
- character to be tested- Returns:
- extended type it is associated with
-
getExtendedChar
Getting the character with extended name of the form invalid input: '<'....>.- Parameters:
name
- of the character to be foundchoice
- name choice- Returns:
- character associated with the name, -1 if such character is not found and -2 if we should continue with the search.
-
add
private static void add(int[] set, char ch) Adds a codepoint into a set of ints. Equivalent to SET_ADD.- Parameters:
set
- set to add toch
- 16 bit char to add
-
contains
private static boolean contains(int[] set, char ch) Checks if a codepoint is a part of a set of ints. Equivalent to SET_CONTAINS.- Parameters:
set
- set to check inch
- 16 bit char to check- Returns:
- true if codepoint is part of the set, false otherwise
-
add
Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.- Parameters:
set
- set to add all chars of str tostr
- string to add
-
add
Adds all characters of the argument str and gets the length Equivalent to calcStringSetLength.- Parameters:
set
- set to add all chars of str tostr
- string to add
-
addAlgorithmName
private int addAlgorithmName(int maxlength) Adds all algorithmic names into the name set. Equivalent to part of calcAlgNameSetsLengths.- Parameters:
maxlength
- length to compare to- Returns:
- the maximum length of any possible algorithmic name if it is > maxlength, otherwise maxlength is returned.
-
addExtendedName
private int addExtendedName(int maxlength) Adds all extended names into the name set. Equivalent to part of calcExtNameSetsLengths.- Parameters:
maxlength
- length to compare to- Returns:
- the maxlength of any possible extended name.
-
addGroupName
private int[] addGroupName(int offset, int length, byte[] tokenlength, int[] set) Adds names of a group to the argument set. Equivalent to calcNameSetLength.- Parameters:
offset
- of the group name string in byte countlength
- of the group name stringtokenlength
- array to store the length of each tokenset
- to add to- Returns:
- the length of the name string and the length of the group string parsed
-
addGroupName
private void addGroupName(int maxlength) Adds names of all group to the argument set. Sets the data member m_max*Length_. Method called only once. Equivalent to calcGroupNameSetsLength.- Parameters:
maxlength
- length to compare to
-
initNameSetsLengths
private boolean initNameSetsLengths()Sets up the name sets and the calculation of the maximum lengths. Equivalent to calcNameSetsLengths. -
convert
Converts the char set cset into a Unicode set uset. Equivalent to charSetToUSet.- Parameters:
set
- Set of 256 bit flags corresponding to a set of chars.uset
- USet to receive characters. Existing contents are deleted.
-