Class MergingDigest

All Implemented Interfaces:
Serializable

public class MergingDigest extends AbstractTDigest
Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged into a sorted array that contains previously computed centroids.

This can be very fast because the cost of sorting and merging is amortized over several insertion. If we keep N centroids total and have the input array is k long, then the amortized cost is something like

N/k + log k

These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an algorithm. For different values of compression factor, the following table shows estimated asymptotic values of N and suggested values of k:

CompressionNk
507825
10015742
20031473

The virtues of this kind of t-digest implementation include:

  • No allocation is required after initialization
  • The data structure automatically compresses existing centroids when possible
  • No Java object overhead is incurred for centroids since data is kept in primitive arrays

The current implementation takes the liberty of using ping-pong buffers for implementing the merge resulting in a substantial memory penalty, but the complexity of an in place merge was not considered as worthwhile since even with the overhead, the memory cost is less than 40 bytes per centroid which is much less than half what the AVLTreeDigest uses. Speed tests are still not complete so it is uncertain whether the merge strategy is faster than the tree strategy.

See Also:
  • Field Details

    • compression

      private final double compression
    • lastUsedCell

      private int lastUsedCell
    • totalWeight

      private double totalWeight
    • weight

      private final double[] weight
    • mean

      private final double[] mean
    • data

      private List<List<Double>> data
    • unmergedWeight

      private double unmergedWeight
    • tempUsed

      private int tempUsed
    • tempWeight

      private final double[] tempWeight
    • tempMean

      private final double[] tempMean
    • tempData

      private List<List<Double>> tempData
    • order

      private final int[] order
    • usePieceWiseApproximation

      private static boolean usePieceWiseApproximation
    • useWeightLimit

      private static boolean useWeightLimit
  • Constructor Details

    • MergingDigest

      public MergingDigest(double compression)
      Allocates a buffer merging t-digest. This is the normally used constructor that allocates default sized internal arrays. Other versions are available, but should only be used for special cases.
      Parameters:
      compression - The compression factor
    • MergingDigest

      public MergingDigest(double compression, int bufferSize)
      If you know the size of the temporary buffer for incoming points, you can use this entry point.
      Parameters:
      compression - Compression factor for t-digest. Same as 1/\delta in the paper.
      bufferSize - How many samples to retain before merging.
    • MergingDigest

      public MergingDigest(double compression, int bufferSize, int size)
      Fully specified constructor. Normally only used for deserializing a buffer t-digest.
      Parameters:
      compression - Compression factor
      bufferSize - Number of temporary centroids
      size - Size of main buffer
  • Method Details

    • recordAllData

      public TDigest recordAllData()
      Turns on internal data recording.
      Overrides:
      recordAllData in class AbstractTDigest
      Returns:
      This TDigest so that configurations can be done in fluent style.
    • add

      void add(double x, int w, Centroid base)
      Specified by:
      add in class AbstractTDigest
    • add

      public void add(double x, int w)
      Description copied from class: TDigest
      Adds a sample to a histogram.
      Specified by:
      add in class TDigest
      Parameters:
      x - The value to add.
      w - The weight of this point.
    • add

      private void add(double x, int w, List<Double> history)
    • add

      private void add(double[] m, double[] w, int count, List<List<Double>> data)
    • add

      public void add(List<? extends TDigest> others)
      Specified by:
      add in class TDigest
    • mergeNewValues

      private void mergeNewValues()
    • merge

      private void merge(double[] incomingMean, double[] incomingWeight, int incomingCount, List<List<Double>> incomingData, int[] incomingOrder, double unmergedWeight)
    • checkWeights

      int checkWeights()
      Exposed for testing.
    • checkWeights

      private int checkWeights(double[] w, double total, int last)
    • integratedLocation

      private double integratedLocation(double q)
      Converts a quantile into a centroid scale value. The centroid scale is nominally the number k of the centroid that a quantile point q should belong to. Due to round-offs, however, we can't align things perfectly without splitting points and centroids. We don't want to do that, so we have to allow for offsets. In the end, the criterion is that any quantile range that spans a centroid scale range more than one should be split across more than one centroid if possible. This won't be possible if the quantile range refers to a single point or an already existing centroid.

      This mapping is steep near q=0 or q=1 so each centroid there will correspond to less q range. Near q=0.5, the mapping is flatter so that centroids there will represent a larger chunk of quantiles.

      Parameters:
      q - The quantile scale value to be mapped.
      Returns:
      The centroid scale value corresponding to q.
    • integratedQ

      private double integratedQ(double k)
    • asinApproximation

      static double asinApproximation(double x)
    • eval

      private static double eval(double[] model, double[] vars)
    • bound

      private static double bound(double v)
    • compress

      public void compress()
      Description copied from class: TDigest
      Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically invalid input: '<' 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe.
      Specified by:
      compress in class TDigest
    • size

      public long size()
      Description copied from class: TDigest
      Returns the number of points that have been added to this TDigest.
      Specified by:
      size in class TDigest
      Returns:
      The sum of the weights on all centroids.
    • cdf

      public double cdf(double x)
      Description copied from class: TDigest
      Returns the fraction of all points added which are invalid input: '<'= x.
      Specified by:
      cdf in class TDigest
    • quantile

      public double quantile(double q)
      Description copied from class: TDigest
      Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
      Specified by:
      quantile in class TDigest
      Parameters:
      q - The desired fraction
      Returns:
      The value x such that cdf(x) == q
    • centroidCount

      public int centroidCount()
      Specified by:
      centroidCount in class TDigest
    • centroids

      public Collection<Centroid> centroids()
      Description copied from class: TDigest
      A Collection that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
      Specified by:
      centroids in class TDigest
      Returns:
      The centroids in the form of a Collection.
    • compression

      public double compression()
      Description copied from class: TDigest
      Returns the current compression factor.
      Specified by:
      compression in class TDigest
      Returns:
      The compression factor originally used to set up the TDigest.
    • byteSize

      public int byteSize()
      Description copied from class: TDigest
      Returns the number of bytes required to encode this TDigest using #asBytes().
      Specified by:
      byteSize in class TDigest
      Returns:
      The number of bytes required.
    • smallByteSize

      public int smallByteSize()
      Description copied from class: TDigest
      Returns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.
      Specified by:
      smallByteSize in class TDigest
      Returns:
      The number of bytes required.
    • asBytes

      public void asBytes(ByteBuffer buf)
      Description copied from class: TDigest
      Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
      Specified by:
      asBytes in class TDigest
      Parameters:
      buf - The byte buffer into which the TDigest should be serialized.
    • asSmallBytes

      public void asSmallBytes(ByteBuffer buf)
      Description copied from class: TDigest
      Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
      Specified by:
      asSmallBytes in class TDigest
      Parameters:
      buf - The byte buffer into which the TDigest should be serialized.
    • fromBytes

      public static MergingDigest fromBytes(ByteBuffer buf)