Class BOCSU
Binary Ordered Compression Scheme for Unicode
Users are strongly encouraged to read the ICU paper on BOCU before attempting to use this class.
BOCU is used to compress unicode text into a stream of unsigned bytes. For many kinds of text the compression compares favorably to UTF-8, and for some kinds of text (such as CJK) it does better. The resulting bytes will compare in the same order as the original code points. The byte stream does not contain the values 0, 1, or 2.
One example of a use of BOCU is in com.ibm.icu.text.Collator#getCollationKey(String) for a RuleBasedCollator object with collation strength IDENTICAL. The result CollationKey will consist of the collation order of the source string followed by the BOCU result of the source string.
Unlike a UTF encoding, BOCU-compressed text is not suitable for random access.
Method: Slope Detection
Remember the previous code point
(initial 0). For each code point in the string, encode the
difference with the previous one. Similar to a UTF, the length of
the byte sequence is encoded in the lead bytes. Unlike a UTF, the
trail byte values may overlap with lead/single byte values. The
signedness of the difference must be encoded as the most
significant part.
We encode differences with few bytes if their absolute values are small. For correct ordering, we must treat the entire value range -10ffff..+10ffff in ascending order, which forbids encoding the sign and the absolute value separately. Instead, we split the lead byte range in the middle and encode non-negative values going up and negative values going down.
For very small absolute values, the difference is added to a middle byte value for single-byte encoded differences. For somewhat larger absolute values, the difference is divided by the number of byte values available, the modulo is used for one trail byte, and the remainder is added to a lead byte avoiding the single-byte range. For large absolute values, the difference is similarly encoded in three bytes. (Syn Wee, I need examples here.)
BOCU does not use byte values 0, 1, or 2, but uses all other byte values for lead and single bytes, so that the middle range of single bytes is as large as possible.
Note that the lead byte ranges overlap some, but that the sequences as a whole are well ordered. I.e., even if the lead byte is the same for sequences of different lengths, the trail bytes establish correct order. It would be possible to encode slightly larger ranges for each length (>1) by subtracting the lower bound of the range. However, that would also slow down the calculation. (Syn Wee, need an example).
For the actual string encoding, an optimization moves the previous code point value to the middle of its Unicode script block to minimize the differences in same-script text runs. (Syn Wee, need an example.)
- Since:
- release 2.2, May 3rd 2002
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final int
private static final int
private static final int
private static final int
private static final int
private static final int
Do not use byte values 0, 1, 2 because they are separators in sort keys.private static final int
private static final int
private static final int
private static final int
The difference value range for single-byters.private static final int
The difference value range for double-byters.private static final int
The difference value range for 3-byters.private static final int
Number of lead bytes: 1 middle byte for 0 2*80=160 single bytes for !=0 2*42=84 for double-byte values 2*3=6 for 3-byte values 2*1=2 for 4-byte values The sum must be invalid input: '<'=SLOPE_TAIL_COUNT.private static final int
private static final int
private static final int
The lead byte start values.private static final int
private static final int
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate static void
ensureAppendCapacity
(ByteArrayWrapper sink, int minCapacity, int desiredCapacity) private static final long
getNegDivMod
(int number, int factor) Integer division and modulo with negative numerators yields negative modulo results and quotients that are one more than what we need here.private static final int
writeDiff
(int diff, byte[] buffer, int offset) Encode one difference value -0x10ffff..+0x10ffff in 1..4 bytes, preserving lexical orderstatic int
writeIdenticalLevelRun
(int prev, CharSequence s, int i, int length, ByteArrayWrapper sink) Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.
-
Field Details
-
SLOPE_MIN_
private static final int SLOPE_MIN_Do not use byte values 0, 1, 2 because they are separators in sort keys.- See Also:
-
SLOPE_MAX_
private static final int SLOPE_MAX_- See Also:
-
SLOPE_MIDDLE_
private static final int SLOPE_MIDDLE_- See Also:
-
SLOPE_TAIL_COUNT_
private static final int SLOPE_TAIL_COUNT_- See Also:
-
SLOPE_MAX_BYTES_
private static final int SLOPE_MAX_BYTES_- See Also:
-
SLOPE_SINGLE_
private static final int SLOPE_SINGLE_Number of lead bytes: 1 middle byte for 0 2*80=160 single bytes for !=0 2*42=84 for double-byte values 2*3=6 for 3-byte values 2*1=2 for 4-byte values The sum must be invalid input: '<'=SLOPE_TAIL_COUNT. Why these numbers? - There should be >=128 single-byte values to cover 128-blocks with small scripts. - There should be >=20902 single/double-byte values to cover Unihan. - It helps CJK Extension B some if there are 3-byte values that cover the distance between them and Unihan. This also helps to jump among distant places in the BMP. - Four-byte values are necessary to cover the rest of Unicode. Symmetrical lead byte counts are for convenience. With an equal distribution of even and odd differences there is also no advantage to asymmetrical lead byte counts.- See Also:
-
SLOPE_LEAD_2_
private static final int SLOPE_LEAD_2_- See Also:
-
SLOPE_LEAD_3_
private static final int SLOPE_LEAD_3_- See Also:
-
SLOPE_REACH_POS_1_
private static final int SLOPE_REACH_POS_1_The difference value range for single-byters.- See Also:
-
SLOPE_REACH_NEG_1_
private static final int SLOPE_REACH_NEG_1_- See Also:
-
SLOPE_REACH_POS_2_
private static final int SLOPE_REACH_POS_2_The difference value range for double-byters.- See Also:
-
SLOPE_REACH_NEG_2_
private static final int SLOPE_REACH_NEG_2_- See Also:
-
SLOPE_REACH_POS_3_
private static final int SLOPE_REACH_POS_3_The difference value range for 3-byters.- See Also:
-
SLOPE_REACH_NEG_3_
private static final int SLOPE_REACH_NEG_3_- See Also:
-
SLOPE_START_POS_2_
private static final int SLOPE_START_POS_2_The lead byte start values.- See Also:
-
SLOPE_START_POS_3_
private static final int SLOPE_START_POS_3_- See Also:
-
SLOPE_START_NEG_2_
private static final int SLOPE_START_NEG_2_- See Also:
-
SLOPE_START_NEG_3_
private static final int SLOPE_START_NEG_3_- See Also:
-
-
Constructor Details
-
BOCSU
private BOCSU()Constructor private to prevent initialization
-
-
Method Details
-
writeIdenticalLevelRun
public static int writeIdenticalLevelRun(int prev, CharSequence s, int i, int length, ByteArrayWrapper sink) Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.Optimize the difference-taking for runs of Unicode text within small scripts:
Most small scripts are allocated within aligned 128-blocks of Unicode code points. Lexical order is preserved if "prev" is always moved into the middle of such a block.
Additionally, "prev" is moved from anywhere in the Unihan area into the middle of that area. Note that the identical-level run in a sort key is generated from NFD text - there are never Hangul characters included.
-
ensureAppendCapacity
private static void ensureAppendCapacity(ByteArrayWrapper sink, int minCapacity, int desiredCapacity) -
getNegDivMod
private static final long getNegDivMod(int number, int factor) Integer division and modulo with negative numerators yields negative modulo results and quotients that are one more than what we need here.- Parameters:
number
- which operations are to be performed onfactor
- the factor to use for division- Returns:
- (result of division) invalid input: '<'invalid input: '<' 32 | modulo
-
writeDiff
private static final int writeDiff(int diff, byte[] buffer, int offset) Encode one difference value -0x10ffff..+0x10ffff in 1..4 bytes, preserving lexical order- Parameters:
diff
-buffer
- byte buffer to append tooffset
- to the byte buffer to start appending- Returns:
- end offset where the appending stops
-