Class DaitchMokotoffSoundex

java.lang.Object
org.apache.commons.codec.language.DaitchMokotoffSoundex
All Implemented Interfaces:
Encoder, StringEncoder

public class DaitchMokotoffSoundex extends Object implements StringEncoder
Encodes a string into a Daitch-Mokotoff Soundex value.

The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavish and Yiddish surnames with similar pronunciation but differences in spelling.

The main differences compared to the other soundex variants are:

  • coded names are 6 digits long
  • the initial character of the name is coded
  • rules to encoded multi-character n-grams
  • multiple possible encodings for the same name (branching)

This implementation supports branching, depending on the used method:

  • encode(String) - branching disabled, only the first code will be returned
  • soundex(String) - branching enabled, all codes will be returned, separated by '|'

Note: this implementation has additional branching rules compared to the original description of the algorithm. The rules can be customized by overriding the default rules contained in the resource file org/apache/commons/codec/language/dmrules.txt.

This class is thread-safe.

Since:
1.10
See Also:
  • Field Details

  • Constructor Details

    • DaitchMokotoffSoundex

      public DaitchMokotoffSoundex()
      Creates a new instance with ASCII-folding enabled.
    • DaitchMokotoffSoundex

      public DaitchMokotoffSoundex(boolean folding)
      Creates a new instance.

      With ASCII-folding enabled, certain accented characters will be transformed to equivalent ASCII characters, e.g. รจ -> e.

      Parameters:
      folding - if ASCII-folding shall be performed before encoding
  • Method Details

    • parseRules

      private static void parseRules(Scanner scanner, String location, Map<Character,List<DaitchMokotoffSoundex.Rule>> ruleMapping, Map<Character,Character> asciiFoldings)
    • stripQuotes

      private static String stripQuotes(String str)
    • cleanup

      private String cleanup(String input)
      Performs a cleanup of the input string before the actual soundex transformation.

      Removes all whitespace characters and performs ASCII folding if enabled.

      Parameters:
      input - the input string to clean up
      Returns:
      a cleaned up string
    • encode

      public Object encode(Object obj) throws EncoderException
      Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.

      This method is provided in order to satisfy the requirements of the Encoder interface, and will throw an EncoderException if the supplied object is not of type java.lang.String.

      Specified by:
      encode in interface Encoder
      Parameters:
      obj - Object to encode
      Returns:
      An object (of type java.lang.String) containing the DM soundex code, which corresponds to the String supplied.
      Throws:
      EncoderException - if the parameter supplied is not of type java.lang.String
      IllegalArgumentException - if a character is not mapped
      See Also:
    • encode

      public String encode(String source)
      Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.
      Specified by:
      encode in interface StringEncoder
      Parameters:
      source - A String object to encode
      Returns:
      A DM Soundex code corresponding to the String supplied
      Throws:
      IllegalArgumentException - if a character is not mapped
      See Also:
    • soundex

      public String soundex(String source)
      Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.

      In case a string is encoded into multiple codes (see branching rules), the result will contain all codes, separated by '|'.

      Example: the name "AUERBACH" is encoded as both

      • 097400
      • 097500

      Thus the result will be "097400|097500".

      Parameters:
      source - A String object to encode
      Returns:
      A string containing a set of DM Soundex codes corresponding to the String supplied
      Throws:
      IllegalArgumentException - if a character is not mapped
    • soundex

      private String[] soundex(String source, boolean branching)
      Perform the actual DM Soundex algorithm on the input string.
      Parameters:
      source - A String object to encode
      branching - If branching shall be performed
      Returns:
      A string array containing all DM Soundex codes corresponding to the String supplied depending on the selected branching mode