Package

org.apache.spark.mllib

clustering

Permalink

package clustering

Visibility
  1. Public
  2. All

Type Members

  1. class BisectingKMeans extends Logging

    Permalink

    A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.

    A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

    Annotations
    @Since( "1.6.0" )
    See also

    Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.

  2. class BisectingKMeansModel extends Serializable with Saveable with Logging

    Permalink

    Clustering model produced by BisectingKMeans.

    Clustering model produced by BisectingKMeans. The prediction is done level-by-level from the root node to a leaf node, and at each node among its children the closest to the input point is selected.

    Annotations
    @Since( "1.6.0" )
  3. class DistributedLDAModel extends LDAModel

    Permalink

    Distributed LDA model.

    Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.

    Annotations
    @Since( "1.3.0" )
  4. final class EMLDAOptimizer extends LDAOptimizer

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.

    Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.

    References:

    • Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
      • This class implements their "smoothed" LDA model.
    • Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.
    Annotations
    @Since( "1.4.0" ) @DeveloperApi()
  5. class GaussianMixture extends Serializable

    Permalink

    This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).

    This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

    Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

    Annotations
    @Since( "1.3.0" )
    Note

    This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

  6. class GaussianMixtureModel extends Serializable with Saveable

    Permalink

    Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.

    Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.

    Annotations
    @Since( "1.3.0" )
  7. class KMeans extends Serializable with Logging

    Permalink

    K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).

    K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).

    This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.

    Annotations
    @Since( "0.8.0" )
  8. class KMeansModel extends Saveable with Serializable with PMMLExportable

    Permalink

    A clustering model for K-means.

    A clustering model for K-means. Each point belongs to the cluster with the closest center.

    Annotations
    @Since( "0.8.0" )
  9. class LDA extends Logging

    Permalink

    Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

    Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

    Terminology:

    • "word" = "term": an element of the vocabulary
    • "token": instance of a term appearing in a document
    • "topic": multinomial distribution over words representing some concept

    References:

    • Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
    Annotations
    @Since( "1.3.0" )
    See also

    Latent Dirichlet allocation (Wikipedia)

  10. abstract class LDAModel extends Saveable

    Permalink

    Latent Dirichlet Allocation (LDA) model.

    Latent Dirichlet Allocation (LDA) model.

    This abstraction permits for different underlying representations, including local and distributed data structures.

    Annotations
    @Since( "1.3.0" )
  11. trait LDAOptimizer extends AnyRef

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.

    Annotations
    @Since( "1.4.0" ) @DeveloperApi()
  12. class LocalLDAModel extends LDAModel with Serializable

    Permalink

    Local LDA model.

    Local LDA model. This model stores only the inferred topics.

    Annotations
    @Since( "1.3.0" )
  13. final class OnlineLDAOptimizer extends LDAOptimizer

    Permalink

    :: DeveloperApi ::

    :: DeveloperApi ::

    An online optimizer for LDA. The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.

    Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.

    Annotations
    @Since( "1.4.0" ) @DeveloperApi()
  14. class PowerIterationClustering extends Serializable

    Permalink

    Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.

    Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

    Annotations
    @Since( "1.3.0" )
    See also

    Spectral clustering (Wikipedia)

  15. class PowerIterationClusteringModel extends Saveable with Serializable

    Permalink

    Model produced by PowerIterationClustering.

    Model produced by PowerIterationClustering.

    Annotations
    @Since( "1.3.0" )
  16. class StreamingKMeans extends Logging with Serializable

    Permalink

    StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.

    StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.

    Use a builder pattern to construct a streaming k-means analysis in an application, like:

    val model = new StreamingKMeans()
      .setDecayFactor(0.5)
      .setK(3)
      .setRandomCenters(5, 100.0)
      .trainOn(DStream)
    Annotations
    @Since( "1.2.0" )
  17. class StreamingKMeansModel extends KMeansModel with Logging

    Permalink

    StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

    StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.

    The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:

    $$ \begin{align} c_t+1 &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_t+t &= n_t * a + m_t \end{align} $$

    Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.

    The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.

    Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.

    Annotations
    @Since( "1.2.0" )

Value Members

  1. object BisectingKMeansModel extends Loader[BisectingKMeansModel] with Serializable

    Permalink
    Annotations
    @Since( "2.0.0" )
  2. object DistributedLDAModel extends Loader[DistributedLDAModel]

    Permalink

    Distributed model fitted by LDA.

    Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).

    This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.

    Annotations
    @Since( "1.5.0" )
  3. object GaussianMixtureModel extends Loader[GaussianMixtureModel] with Serializable

    Permalink
    Annotations
    @Since( "1.4.0" )
  4. object KMeans extends Serializable

    Permalink

    Top-level methods for calling K-means clustering.

    Top-level methods for calling K-means clustering.

    Annotations
    @Since( "0.8.0" )
  5. object KMeansModel extends Loader[KMeansModel] with Serializable

    Permalink
    Annotations
    @Since( "1.4.0" )
  6. object LocalLDAModel extends Loader[LocalLDAModel] with Serializable

    Permalink

    Local (non-distributed) model fitted by LDA.

    Local (non-distributed) model fitted by LDA.

    This model stores the inferred topics only; it does not store info about the training dataset.

    Annotations
    @Since( "1.5.0" )
  7. object PowerIterationClustering extends Logging with Serializable

    Permalink
    Annotations
    @Since( "1.3.0" )
  8. object PowerIterationClusteringModel extends Loader[PowerIterationClusteringModel] with Serializable

    Permalink
    Annotations
    @Since( "1.4.0" )

Ungrouped