Package com.ibm.icu.charset
Class UTF8
java.lang.Object
com.ibm.icu.charset.UTF8
Partial Java port of ICU4C unicode/utf8.h and ustr_imp.h.
-
Field Summary
FieldsModifier and TypeFieldDescription(package private) static int
4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).private static final int[]
Internal bit vector for 3-byte UTF-8 validity check, for use inisValidLead3AndT1(int, byte)
.private static final int[]
Internal bit vector for 4-byte UTF-8 validity check, for use inisValidLead4AndT1(int, byte)
. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static int
countBytes
(byte leadByte) Counts the bytes of any whole valid sequence for a UTF-8 lead byte.(package private) static int
countTrailBytes
(byte leadByte) Counts the trail bytes for a UTF-8 lead byte.(package private) static boolean
isLead
(byte c) Is this code unit (byte) a UTF-8 lead byte?(package private) static boolean
isSingle
(byte c) Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?(package private) static boolean
isTrail
(byte c) Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)(package private) static boolean
isValidLead3AndT1
(int lead, byte t1) Internal 3-byte UTF-8 validity check.(package private) static boolean
isValidLead4AndT1
(int lead, byte t1) Internal 4-byte UTF-8 validity check.(package private) static boolean
isValidTrail
(int prev, byte t, int i, int length) Is t a valid UTF-8 trail byte?(package private) static int
length
(int c) How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?
-
Field Details
-
U8_LEAD3_T1_BITS
private static final int[] U8_LEAD3_T1_BITSInternal bit vector for 3-byte UTF-8 validity check, for use inisValidLead3AndT1(int, byte)
. Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte E0..EF bits 3..0 are used as data int index, first trail byte bits 7..5 are used as bit index into that int.- See Also:
-
U8_LEAD4_T1_BITS
private static final int[] U8_LEAD4_T1_BITSInternal bit vector for 4-byte UTF-8 validity check, for use inisValidLead4AndT1(int, byte)
. Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte F0..F4 bits 2..0 are used as data int index, first trail byte bits 7..4 are used as bit index into that int.- See Also:
-
MAX_LENGTH
static int MAX_LENGTH4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).
-
-
Constructor Details
-
UTF8
UTF8()
-
-
Method Details
-
countTrailBytes
static int countTrailBytes(byte leadByte) Counts the trail bytes for a UTF-8 lead byte. Returns 0 for 0..0xc1 as well as for 0xf5..0xff.- Parameters:
leadByte
- The first byte of a UTF-8 sequence. Must be 0..0xff.- Returns:
- 0..3
-
countBytes
static int countBytes(byte leadByte) Counts the bytes of any whole valid sequence for a UTF-8 lead byte. Returns 1 for ASCII 0..0x7f. Returns 0 for 0x80..0xc1 as well as for 0xf5..0xff.- Parameters:
leadByte
- The first byte of a UTF-8 sequence. Must be 0..0xff.- Returns:
- 0..4
-
isValidLead3AndT1
static boolean isValidLead3AndT1(int lead, byte t1) Internal 3-byte UTF-8 validity check.- Parameters:
lead
- E0..EFt1
- 00..FF- Returns:
- true if lead byte E0..EF and first trail byte 00..FF start a valid sequence.
-
isValidLead4AndT1
static boolean isValidLead4AndT1(int lead, byte t1) Internal 4-byte UTF-8 validity check.- Parameters:
lead
- F0..F4t1
- 00..FF- Returns:
- true if lead byte F0..F4 and first trail byte 00..FF start a valid sequence.
-
isSingle
static boolean isSingle(byte c) Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is an ASCII byte
-
isLead
static boolean isLead(byte c) Is this code unit (byte) a UTF-8 lead byte?- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is a lead byte
-
isTrail
static boolean isTrail(byte c) Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)- Parameters:
c
- 8-bit code unit (byte)- Returns:
- true if c is a trail byte
-
length
static int length(int c) How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?- Parameters:
c
- 32-bit code point- Returns:
- 1..4, or 0 if c is a surrogate or not a Unicode code point
-
isValidTrail
static boolean isValidTrail(int prev, byte t, int i, int length) Is t a valid UTF-8 trail byte?- Parameters:
prev
- Must be the preceding lead byte if i==1 and length>=3; otherwise ignored.t
- The i-th byte following the lead byte.i
- The index (1..3) of byte t in the byte sequence. 0invalid input: '<'iinvalid input: '<'lengthlength
- The length (2..4) of the byte sequence according to the lead byte.- Returns:
- true if t is a valid trail byte in this context.
-