Class KMeansPlusPlusClusterer<T extends Clusterable>

  • Type Parameters:
    T - type of the points to cluster

    public class KMeansPlusPlusClusterer<T extends Clusterable>
    extends Clusterer<T>
    Clustering algorithm based on David Arthur and Sergei Vassilvitski k-means++ algorithm.
    Since:
    3.2
    See Also:
    K-means++ (wikipedia)
    • Field Detail

      • k

        private final int k
        The number of clusters.
      • maxIterations

        private final int maxIterations
        The maximum number of iterations.
      • random

        private final RandomGenerator random
        Random generator for choosing initial centers.
    • Constructor Detail

      • KMeansPlusPlusClusterer

        public KMeansPlusPlusClusterer​(int k)
        Build a clusterer.

        The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

        The euclidean distance will be used as default distance measure.

        Parameters:
        k - the number of clusters to split the data into
      • KMeansPlusPlusClusterer

        public KMeansPlusPlusClusterer​(int k,
                                       int maxIterations)
        Build a clusterer.

        The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

        The euclidean distance will be used as default distance measure.

        Parameters:
        k - the number of clusters to split the data into
        maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
      • KMeansPlusPlusClusterer

        public KMeansPlusPlusClusterer​(int k,
                                       int maxIterations,
                                       DistanceMeasure measure)
        Build a clusterer.

        The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

        Parameters:
        k - the number of clusters to split the data into
        maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
        measure - the distance measure to use
      • KMeansPlusPlusClusterer

        public KMeansPlusPlusClusterer​(int k,
                                       int maxIterations,
                                       DistanceMeasure measure,
                                       RandomGenerator random)
        Build a clusterer.

        The default strategy for handling empty clusters that may appear during algorithm iterations is to split the cluster with largest distance variance.

        Parameters:
        k - the number of clusters to split the data into
        maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
        measure - the distance measure to use
        random - random generator to use for choosing initial centers
      • KMeansPlusPlusClusterer

        public KMeansPlusPlusClusterer​(int k,
                                       int maxIterations,
                                       DistanceMeasure measure,
                                       RandomGenerator random,
                                       KMeansPlusPlusClusterer.EmptyClusterStrategy emptyStrategy)
        Build a clusterer.
        Parameters:
        k - the number of clusters to split the data into
        maxIterations - the maximum number of iterations to run the algorithm for. If negative, no maximum will be used.
        measure - the distance measure to use
        random - random generator to use for choosing initial centers
        emptyStrategy - strategy to use for handling empty clusters that may appear during algorithm iterations
    • Method Detail

      • getK

        public int getK()
        Return the number of clusters this instance will use.
        Returns:
        the number of clusters
      • getMaxIterations

        public int getMaxIterations()
        Returns the maximum number of iterations this instance will use.
        Returns:
        the maximum number of iterations, or -1 if no maximum is set
      • getRandomGenerator

        public RandomGenerator getRandomGenerator()
        Returns the random generator this instance will use.
        Returns:
        the random generator
      • assignPointsToClusters

        private int assignPointsToClusters​(java.util.List<CentroidCluster<T>> clusters,
                                           java.util.Collection<T> points,
                                           int[] assignments)
        Adds the given points to the closest Cluster.
        Parameters:
        clusters - the Clusters to add the points to
        points - the points to add to the given Clusters
        assignments - points assignments to clusters
        Returns:
        the number of points assigned to different clusters as the iteration before
      • chooseInitialCenters

        private java.util.List<CentroidCluster<T>> chooseInitialCenters​(java.util.Collection<T> points)
        Use K-means++ to choose the initial centers.
        Parameters:
        points - the points to choose the initial centers from
        Returns:
        the initial centers
      • getPointFromLargestVarianceCluster

        private T getPointFromLargestVarianceCluster​(java.util.Collection<CentroidCluster<T>> clusters)
                                              throws ConvergenceException
        Get a random point from the Cluster with the largest distance variance.
        Parameters:
        clusters - the Clusters to search
        Returns:
        a random point from the selected cluster
        Throws:
        ConvergenceException - if clusters are all empty
      • getPointFromLargestNumberCluster

        private T getPointFromLargestNumberCluster​(java.util.Collection<? extends Cluster<T>> clusters)
                                            throws ConvergenceException
        Get a random point from the Cluster with the largest number of points
        Parameters:
        clusters - the Clusters to search
        Returns:
        a random point from the selected cluster
        Throws:
        ConvergenceException - if clusters are all empty
      • getFarthestPoint

        private T getFarthestPoint​(java.util.Collection<CentroidCluster<T>> clusters)
                            throws ConvergenceException
        Get the point farthest to its cluster center
        Parameters:
        clusters - the Clusters to search
        Returns:
        point farthest to its cluster center
        Throws:
        ConvergenceException - if clusters are all empty
      • getNearestCluster

        private int getNearestCluster​(java.util.Collection<CentroidCluster<T>> clusters,
                                      T point)
        Returns the nearest Cluster to the given point
        Parameters:
        clusters - the Clusters to search
        point - the point to find the nearest Cluster for
        Returns:
        the index of the nearest Cluster to the given point
      • centroidOf

        private Clusterable centroidOf​(java.util.Collection<T> points,
                                       int dimension)
        Computes the centroid for a set of points.
        Parameters:
        points - the set of points
        dimension - the point dimension
        Returns:
        the computed centroid for the set of points