Class UTF8

java.lang.Object
com.ibm.icu.charset.UTF8

class UTF8 extends Object
Partial Java port of ICU4C unicode/utf8.h and ustr_imp.h.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    (package private) static int
    4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).
    private static final int[]
    Internal bit vector for 3-byte UTF-8 validity check, for use in isValidLead3AndT1(int, byte).
    private static final int[]
    Internal bit vector for 4-byte UTF-8 validity check, for use in isValidLead4AndT1(int, byte).
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    (package private) static int
    countBytes(byte leadByte)
    Counts the bytes of any whole valid sequence for a UTF-8 lead byte.
    (package private) static int
    countTrailBytes(byte leadByte)
    Counts the trail bytes for a UTF-8 lead byte.
    (package private) static boolean
    isLead(byte c)
    Is this code unit (byte) a UTF-8 lead byte?
    (package private) static boolean
    isSingle(byte c)
    Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?
    (package private) static boolean
    isTrail(byte c)
    Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)
    (package private) static boolean
    isValidLead3AndT1(int lead, byte t1)
    Internal 3-byte UTF-8 validity check.
    (package private) static boolean
    isValidLead4AndT1(int lead, byte t1)
    Internal 4-byte UTF-8 validity check.
    (package private) static boolean
    isValidTrail(int prev, byte t, int i, int length)
    Is t a valid UTF-8 trail byte?
    (package private) static int
    length(int c)
    How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • U8_LEAD3_T1_BITS

      private static final int[] U8_LEAD3_T1_BITS
      Internal bit vector for 3-byte UTF-8 validity check, for use in isValidLead3AndT1(int, byte). Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte E0..EF bits 3..0 are used as data int index, first trail byte bits 7..5 are used as bit index into that int.
      See Also:
    • U8_LEAD4_T1_BITS

      private static final int[] U8_LEAD4_T1_BITS
      Internal bit vector for 4-byte UTF-8 validity check, for use in isValidLead4AndT1(int, byte). Each bit indicates whether one lead byte + first trail byte pair starts a valid sequence. Lead byte F0..F4 bits 2..0 are used as data int index, first trail byte bits 7..4 are used as bit index into that int.
      See Also:
    • MAX_LENGTH

      static int MAX_LENGTH
      4: The maximum number of UTF-8 code units (bytes) per Unicode code point (U+0000..U+10ffff).
  • Constructor Details

    • UTF8

      UTF8()
  • Method Details

    • countTrailBytes

      static int countTrailBytes(byte leadByte)
      Counts the trail bytes for a UTF-8 lead byte. Returns 0 for 0..0xc1 as well as for 0xf5..0xff.
      Parameters:
      leadByte - The first byte of a UTF-8 sequence. Must be 0..0xff.
      Returns:
      0..3
    • countBytes

      static int countBytes(byte leadByte)
      Counts the bytes of any whole valid sequence for a UTF-8 lead byte. Returns 1 for ASCII 0..0x7f. Returns 0 for 0x80..0xc1 as well as for 0xf5..0xff.
      Parameters:
      leadByte - The first byte of a UTF-8 sequence. Must be 0..0xff.
      Returns:
      0..4
    • isValidLead3AndT1

      static boolean isValidLead3AndT1(int lead, byte t1)
      Internal 3-byte UTF-8 validity check.
      Parameters:
      lead - E0..EF
      t1 - 00..FF
      Returns:
      true if lead byte E0..EF and first trail byte 00..FF start a valid sequence.
    • isValidLead4AndT1

      static boolean isValidLead4AndT1(int lead, byte t1)
      Internal 4-byte UTF-8 validity check.
      Parameters:
      lead - F0..F4
      t1 - 00..FF
      Returns:
      true if lead byte F0..F4 and first trail byte 00..FF start a valid sequence.
    • isSingle

      static boolean isSingle(byte c)
      Does this code unit (byte) encode a code point by itself (US-ASCII 0..0x7f)?
      Parameters:
      c - 8-bit code unit (byte)
      Returns:
      true if c is an ASCII byte
    • isLead

      static boolean isLead(byte c)
      Is this code unit (byte) a UTF-8 lead byte?
      Parameters:
      c - 8-bit code unit (byte)
      Returns:
      true if c is a lead byte
    • isTrail

      static boolean isTrail(byte c)
      Is this code unit (byte) a UTF-8 trail byte? (0x80..0xBF)
      Parameters:
      c - 8-bit code unit (byte)
      Returns:
      true if c is a trail byte
    • length

      static int length(int c)
      How many code units (bytes) are used for the UTF-8 encoding of this Unicode code point?
      Parameters:
      c - 32-bit code point
      Returns:
      1..4, or 0 if c is a surrogate or not a Unicode code point
    • isValidTrail

      static boolean isValidTrail(int prev, byte t, int i, int length)
      Is t a valid UTF-8 trail byte?
      Parameters:
      prev - Must be the preceding lead byte if i==1 and length>=3; otherwise ignored.
      t - The i-th byte following the lead byte.
      i - The index (1..3) of byte t in the byte sequence. 0invalid input: '<'iinvalid input: '<'length
      length - The length (2..4) of the byte sequence according to the lead byte.
      Returns:
      true if t is a valid trail byte in this context.