Class Normalizer2Impl

java.lang.Object
com.ibm.icu.impl.Normalizer2Impl

public final class Normalizer2Impl extends Object
Low-level implementation of the Unicode Normalization Algorithm. For the data structure and details see the documentation at the end of C++ normalizer2impl.h and in the design doc at https://icu.unicode.org/design/normalization/custom
  • Field Details

    • IS_ACCEPTABLE

      private static final Normalizer2Impl.IsAcceptable IS_ACCEPTABLE
    • DATA_FORMAT

      private static final int DATA_FORMAT
      See Also:
    • segmentStarterMapper

      private static final CodePointMap.ValueFilter segmentStarterMapper
    • MIN_YES_YES_WITH_CC

      public static final int MIN_YES_YES_WITH_CC
      See Also:
    • JAMO_VT

      public static final int JAMO_VT
      See Also:
    • MIN_NORMAL_MAYBE_YES

      public static final int MIN_NORMAL_MAYBE_YES
      See Also:
    • JAMO_L

      public static final int JAMO_L
      See Also:
    • INERT

      public static final int INERT
      See Also:
    • HAS_COMP_BOUNDARY_AFTER

      public static final int HAS_COMP_BOUNDARY_AFTER
      See Also:
    • OFFSET_SHIFT

      public static final int OFFSET_SHIFT
      See Also:
    • DELTA_TCCC_0

      public static final int DELTA_TCCC_0
      See Also:
    • DELTA_TCCC_1

      public static final int DELTA_TCCC_1
      See Also:
    • DELTA_TCCC_GT_1

      public static final int DELTA_TCCC_GT_1
      See Also:
    • DELTA_TCCC_MASK

      public static final int DELTA_TCCC_MASK
      See Also:
    • DELTA_SHIFT

      public static final int DELTA_SHIFT
      See Also:
    • MAX_DELTA

      public static final int MAX_DELTA
      See Also:
    • IX_NORM_TRIE_OFFSET

      public static final int IX_NORM_TRIE_OFFSET
      See Also:
    • IX_EXTRA_DATA_OFFSET

      public static final int IX_EXTRA_DATA_OFFSET
      See Also:
    • IX_SMALL_FCD_OFFSET

      public static final int IX_SMALL_FCD_OFFSET
      See Also:
    • IX_RESERVED3_OFFSET

      public static final int IX_RESERVED3_OFFSET
      See Also:
    • IX_TOTAL_SIZE

      public static final int IX_TOTAL_SIZE
      See Also:
    • IX_MIN_DECOMP_NO_CP

      public static final int IX_MIN_DECOMP_NO_CP
      See Also:
    • IX_MIN_COMP_NO_MAYBE_CP

      public static final int IX_MIN_COMP_NO_MAYBE_CP
      See Also:
    • IX_MIN_YES_NO

      public static final int IX_MIN_YES_NO
      Mappings invalid input: '&' compositions in [minYesNo..minYesNoMappingsOnly[.
      See Also:
    • IX_MIN_NO_NO

      public static final int IX_MIN_NO_NO
      Mappings are comp-normalized.
      See Also:
    • IX_LIMIT_NO_NO

      public static final int IX_LIMIT_NO_NO
      See Also:
    • IX_MIN_MAYBE_YES

      public static final int IX_MIN_MAYBE_YES
      See Also:
    • IX_MIN_YES_NO_MAPPINGS_ONLY

      public static final int IX_MIN_YES_NO_MAPPINGS_ONLY
      Mappings only in [minYesNoMappingsOnly..minNoNo[.
      See Also:
    • IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE

      public static final int IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE
      Mappings are not comp-normalized but have a comp boundary before.
      See Also:
    • IX_MIN_NO_NO_COMP_NO_MAYBE_CC

      public static final int IX_MIN_NO_NO_COMP_NO_MAYBE_CC
      Mappings do not have a comp boundary before.
      See Also:
    • IX_MIN_NO_NO_EMPTY

      public static final int IX_MIN_NO_NO_EMPTY
      Mappings to the empty string.
      See Also:
    • IX_MIN_LCCC_CP

      public static final int IX_MIN_LCCC_CP
      See Also:
    • IX_COUNT

      public static final int IX_COUNT
      See Also:
    • MAPPING_HAS_CCC_LCCC_WORD

      public static final int MAPPING_HAS_CCC_LCCC_WORD
      See Also:
    • MAPPING_HAS_RAW_MAPPING

      public static final int MAPPING_HAS_RAW_MAPPING
      See Also:
    • MAPPING_LENGTH_MASK

      public static final int MAPPING_LENGTH_MASK
      See Also:
    • COMP_1_LAST_TUPLE

      public static final int COMP_1_LAST_TUPLE
      See Also:
    • COMP_1_TRIPLE

      public static final int COMP_1_TRIPLE
      See Also:
    • COMP_1_TRAIL_LIMIT

      public static final int COMP_1_TRAIL_LIMIT
      See Also:
    • COMP_1_TRAIL_MASK

      public static final int COMP_1_TRAIL_MASK
      See Also:
    • COMP_1_TRAIL_SHIFT

      public static final int COMP_1_TRAIL_SHIFT
      See Also:
    • COMP_2_TRAIL_SHIFT

      public static final int COMP_2_TRAIL_SHIFT
      See Also:
    • COMP_2_TRAIL_MASK

      public static final int COMP_2_TRAIL_MASK
      See Also:
    • dataVersion

      private VersionInfo dataVersion
    • minDecompNoCP

      private int minDecompNoCP
    • minCompNoMaybeCP

      private int minCompNoMaybeCP
    • minLcccCP

      private int minLcccCP
    • minYesNo

      private int minYesNo
    • minYesNoMappingsOnly

      private int minYesNoMappingsOnly
    • minNoNo

      private int minNoNo
    • minNoNoCompBoundaryBefore

      private int minNoNoCompBoundaryBefore
    • minNoNoCompNoMaybeCC

      private int minNoNoCompNoMaybeCC
    • minNoNoEmpty

      private int minNoNoEmpty
    • limitNoNo

      private int limitNoNo
    • centerNoNoDelta

      private int centerNoNoDelta
    • minMaybeYes

      private int minMaybeYes
    • normTrie

      private CodePointTrie.Fast16 normTrie
    • maybeYesCompositions

      private String maybeYesCompositions
    • extraData

      private String extraData
    • smallFCD

      private byte[] smallFCD
    • canonIterData

      private CodePointTrie canonIterData
    • canonStartSets

      private ArrayList<UnicodeSet> canonStartSets
    • CANON_NOT_SEGMENT_STARTER

      private static final int CANON_NOT_SEGMENT_STARTER
      See Also:
    • CANON_HAS_COMPOSITIONS

      private static final int CANON_HAS_COMPOSITIONS
      See Also:
    • CANON_HAS_SET

      private static final int CANON_HAS_SET
      See Also:
    • CANON_VALUE_MASK

      private static final int CANON_VALUE_MASK
      See Also:
  • Constructor Details

    • Normalizer2Impl

      public Normalizer2Impl()
  • Method Details

    • load

      public Normalizer2Impl load(ByteBuffer bytes)
    • load

      public Normalizer2Impl load(String name)
    • addLcccChars

      public void addLcccChars(UnicodeSet set)
    • addPropertyStarts

      public void addPropertyStarts(UnicodeSet set)
    • addCanonIterPropertyStarts

      public void addCanonIterPropertyStarts(UnicodeSet set)
    • ensureCanonIterData

      public Normalizer2Impl ensureCanonIterData()
      Builds the canonical-iterator data for this instance. This is required before any of isCanonSegmentStarter(int) or getCanonStartSet(int, UnicodeSet) are called, or else they crash.
      Returns:
      this
    • getNorm16

      public int getNorm16(int c)
    • getRawNorm16

      public int getRawNorm16(int c)
    • getCompQuickCheck

      public int getCompQuickCheck(int norm16)
    • isAlgorithmicNoNo

      public boolean isAlgorithmicNoNo(int norm16)
    • isCompNo

      public boolean isCompNo(int norm16)
    • isDecompYes

      public boolean isDecompYes(int norm16)
    • getCC

      public int getCC(int norm16)
    • getCCFromNormalYesOrMaybe

      public static int getCCFromNormalYesOrMaybe(int norm16)
    • getCCFromYesOrMaybe

      public static int getCCFromYesOrMaybe(int norm16)
    • getCCFromYesOrMaybeCP

      public int getCCFromYesOrMaybeCP(int c)
    • getFCD16

      public int getFCD16(int c)
      Returns the FCD data for code point c.
      Parameters:
      c - A Unicode code point.
      Returns:
      The lccc(c) in bits 15..8 and tccc(c) in bits 7..0.
    • singleLeadMightHaveNonZeroFCD16

      public boolean singleLeadMightHaveNonZeroFCD16(int lead)
      Returns true if the single-or-lead code unit c might have non-zero FCD data.
    • getFCD16FromNormData

      public int getFCD16FromNormData(int c)
      Gets the FCD value from the regular normalization data.
    • getDecomposition

      public String getDecomposition(int c)
      Gets the decomposition for one code point.
      Parameters:
      c - code point
      Returns:
      c's decomposition, if it has one; returns null if it does not have a decomposition
    • getRawDecomposition

      public String getRawDecomposition(int c)
      Gets the raw decomposition for one code point.
      Parameters:
      c - code point
      Returns:
      c's raw decomposition, if it has one; returns null if it does not have a decomposition
    • isCanonSegmentStarter

      public boolean isCanonSegmentStarter(int c)
      Returns true if code point c starts a canonical-iterator string segment. ensureCanonIterData() must have been called before this method, or else this method will crash.
      Parameters:
      c - A Unicode code point.
      Returns:
      true if c starts a canonical-iterator string segment.
    • getCanonStartSet

      public boolean getCanonStartSet(int c, UnicodeSet set)
      Returns true if there are characters whose decomposition starts with c. If so, then the set is cleared and then filled with those characters. ensureCanonIterData() must have been called before this method, or else this method will crash.
      Parameters:
      c - A Unicode code point.
      set - A UnicodeSet to receive the characters whose decompositions start with c, if there are any.
      Returns:
      true if there are characters whose decomposition starts with c.
    • decompose

      public Appendable decompose(CharSequence s, StringBuilder dest)
    • decompose

      public void decompose(CharSequence s, int src, int limit, StringBuilder dest, int destLengthEstimate)
      Decomposes s[src, limit[ and writes the result to dest. limit can be NULL if src is NUL-terminated. destLengthEstimate is the initial dest buffer capacity and can be -1.
    • decompose

      public int decompose(CharSequence s, int src, int limit, Normalizer2Impl.ReorderingBuffer buffer)
    • decomposeAndAppend

      public void decomposeAndAppend(CharSequence s, boolean doDecompose, Normalizer2Impl.ReorderingBuffer buffer)
    • compose

      public boolean compose(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doCompose, Normalizer2Impl.ReorderingBuffer buffer)
    • composeQuickCheck

      public int composeQuickCheck(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doSpan)
      Very similar to compose(): Make the same changes in both places if relevant. doSpan: spanQuickCheckYes (ignore bit 0 of the return value) !doSpan: quickCheck
      Returns:
      bits 31..1: spanQuickCheckYes (==s.length() if "yes") and bit 0: set if "maybe"; otherwise, if the span length<s.length() then the quick check result is "no"
    • composeAndAppend

      public void composeAndAppend(CharSequence s, boolean doCompose, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer)
    • makeFCD

      public int makeFCD(CharSequence s, int src, int limit, Normalizer2Impl.ReorderingBuffer buffer)
    • makeFCDAndAppend

      public void makeFCDAndAppend(CharSequence s, boolean doMakeFCD, Normalizer2Impl.ReorderingBuffer buffer)
    • hasDecompBoundaryBefore

      public boolean hasDecompBoundaryBefore(int c)
    • norm16HasDecompBoundaryBefore

      public boolean norm16HasDecompBoundaryBefore(int norm16)
    • hasDecompBoundaryAfter

      public boolean hasDecompBoundaryAfter(int c)
    • norm16HasDecompBoundaryAfter

      public boolean norm16HasDecompBoundaryAfter(int norm16)
    • isDecompInert

      public boolean isDecompInert(int c)
    • hasCompBoundaryBefore

      public boolean hasCompBoundaryBefore(int c)
    • hasCompBoundaryAfter

      public boolean hasCompBoundaryAfter(int c, boolean onlyContiguous)
    • isCompInert

      public boolean isCompInert(int c, boolean onlyContiguous)
    • hasFCDBoundaryBefore

      public boolean hasFCDBoundaryBefore(int c)
    • hasFCDBoundaryAfter

      public boolean hasFCDBoundaryAfter(int c)
    • isFCDInert

      public boolean isFCDInert(int c)
    • isMaybe

      private boolean isMaybe(int norm16)
    • isMaybeOrNonZeroCC

      private boolean isMaybeOrNonZeroCC(int norm16)
    • isInert

      private static boolean isInert(int norm16)
    • isJamoL

      private static boolean isJamoL(int norm16)
    • isJamoVT

      private static boolean isJamoVT(int norm16)
    • hangulLVT

      private int hangulLVT()
    • isHangulLV

      private boolean isHangulLV(int norm16)
    • isHangulLVT

      private boolean isHangulLVT(int norm16)
    • isCompYesAndZeroCC

      private boolean isCompYesAndZeroCC(int norm16)
    • isDecompYesAndZeroCC

      private boolean isDecompYesAndZeroCC(int norm16)
    • isMostDecompYesAndZeroCC

      private boolean isMostDecompYesAndZeroCC(int norm16)
      A little faster and simpler than isDecompYesAndZeroCC() but does not include the MaybeYes which combine-forward and have ccc=0. (Standard Unicode 10 normalization does not have such characters.)
    • isDecompNoAlgorithmic

      private boolean isDecompNoAlgorithmic(int norm16)
    • getCCFromNoNo

      private int getCCFromNoNo(int norm16)
    • getTrailCCFromCompYesAndZeroCC

      int getTrailCCFromCompYesAndZeroCC(int norm16)
    • mapAlgorithmic

      private int mapAlgorithmic(int c, int norm16)
    • getCompositionsListForDecompYes

      private int getCompositionsListForDecompYes(int norm16)
      Returns:
      index into maybeYesCompositions, or -1
    • getCompositionsListForComposite

      private int getCompositionsListForComposite(int norm16)
      Returns:
      index into maybeYesCompositions
    • getCompositionsListForMaybe

      private int getCompositionsListForMaybe(int norm16)
    • getCompositionsList

      private int getCompositionsList(int norm16)
      Parameters:
      c - code point must have compositions
      Returns:
      index into maybeYesCompositions
    • decomposeShort

      private int decomposeShort(CharSequence s, int src, int limit, boolean stopAtCompBoundary, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer)
    • decompose

      private void decompose(int c, int norm16, Normalizer2Impl.ReorderingBuffer buffer)
    • combine

      private static int combine(String compositions, int list, int trail)
      Finds the recomposition result for a forward-combining "lead" character, specified with a pointer to its compositions list, and a backward-combining "trail" character.

      If the lead and trail characters combine, then this function returns the following "compositeAndFwd" value:

       Bits 21..1  composite character
       Bit      0  set if the composite is a forward-combining starter
       
      otherwise it returns -1.

      The compositions list has (trail, compositeAndFwd) pair entries, encoded as either pairs or triples of 16-bit units. The last entry has the high bit of its first unit set.

      The list is sorted by ascending trail characters (there are no duplicates). A linear search is used.

      See normalizer2impl.h for a more detailed description of the compositions list format.

    • addComposites

      private void addComposites(int list, UnicodeSet set)
      Parameters:
      list - some character's compositions list
      set - recursively receives the composites from these compositions
    • recompose

      private void recompose(Normalizer2Impl.ReorderingBuffer buffer, int recomposeStartIndex, boolean onlyContiguous)
    • composePair

      public int composePair(int a, int b)
    • hasCompBoundaryBefore

      private boolean hasCompBoundaryBefore(int c, int norm16)
      Does c have a composition boundary before it? True if its decomposition begins with a character that has ccc=0 invalid input: '&'invalid input: '&' NFC_QC=Yes (isCompYesAndZeroCC()). As a shortcut, this is true if c itself has ccc=0 invalid input: '&'invalid input: '&' NFC_QC=Yes (isCompYesAndZeroCC()) so we need not decompose.
    • norm16HasCompBoundaryBefore

      private boolean norm16HasCompBoundaryBefore(int norm16)
    • hasCompBoundaryBefore

      private boolean hasCompBoundaryBefore(CharSequence s, int src, int limit)
    • norm16HasCompBoundaryAfter

      private boolean norm16HasCompBoundaryAfter(int norm16, boolean onlyContiguous)
    • hasCompBoundaryAfter

      private boolean hasCompBoundaryAfter(CharSequence s, int start, int p, boolean onlyContiguous)
    • isTrailCC01ForCompBoundaryAfter

      private boolean isTrailCC01ForCompBoundaryAfter(int norm16)
      For FCC: Given norm16 HAS_COMP_BOUNDARY_AFTER, does it have tcccinvalid input: '<'=1?
    • findPreviousCompBoundary

      private int findPreviousCompBoundary(CharSequence s, int p, boolean onlyContiguous)
    • findNextCompBoundary

      private int findNextCompBoundary(CharSequence s, int p, int limit, boolean onlyContiguous)
    • findPreviousFCDBoundary

      private int findPreviousFCDBoundary(CharSequence s, int p)
    • findNextFCDBoundary

      private int findNextFCDBoundary(CharSequence s, int p, int limit)
    • getPreviousTrailCC

      private int getPreviousTrailCC(CharSequence s, int start, int p)
    • addToStartSet

      private void addToStartSet(MutableCodePointTrie mutableTrie, int origin, int decompLead)