public class KMeans extends Object implements scala.Serializable, Logging
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.
Constructor and Description |
---|
KMeans()
Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,
initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4}.
|
Modifier and Type | Method and Description |
---|---|
static double |
fastSquaredDistance(VectorWithNorm v1,
VectorWithNorm v2)
Returns the squared Euclidean distance between two vectors computed by
MLUtils.fastSquaredDistance(org.apache.spark.mllib.linalg.Vector, double, org.apache.spark.mllib.linalg.Vector, double, double) . |
static scala.Tuple2<Object,Object> |
findClosest(scala.collection.TraversableOnce<VectorWithNorm> centers,
VectorWithNorm point)
Returns the index of the closest center to the given point, as well as the squared distance.
|
static String |
K_MEANS_PARALLEL() |
static double |
pointCost(scala.collection.TraversableOnce<VectorWithNorm> centers,
VectorWithNorm point)
Returns the K-means cost of a given point against the given cluster centers.
|
static String |
RANDOM() |
KMeansModel |
run(RDD<Vector> data)
Train a K-means model on the given set of points;
data should be cached for high
performance, because this is an iterative algorithm. |
KMeans |
setEpsilon(double epsilon)
Set the distance threshold within which we've consider centers to have converged.
|
KMeans |
setInitializationMode(String initializationMode)
Set the initialization algorithm.
|
KMeans |
setInitializationSteps(int initializationSteps)
Set the number of steps for the k-means|| initialization mode.
|
KMeans |
setK(int k)
Set the number of clusters to create (k).
|
KMeans |
setMaxIterations(int maxIterations)
Set maximum number of iterations to run.
|
KMeans |
setRuns(int runs)
:: Experimental ::
Set the number of runs of the algorithm to execute in parallel.
|
static KMeansModel |
train(RDD<Vector> data,
int k,
int maxIterations)
Trains a k-means model using specified parameters and the default values for unspecified.
|
static KMeansModel |
train(RDD<Vector> data,
int k,
int maxIterations,
int runs)
Trains a k-means model using specified parameters and the default values for unspecified.
|
static KMeansModel |
train(RDD<Vector> data,
int k,
int maxIterations,
int runs,
String initializationMode)
Trains a k-means model using the given set of parameters.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public KMeans()
public static String RANDOM()
public static String K_MEANS_PARALLEL()
public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, int runs, String initializationMode)
data
- training points stored as RDD[Array[Double}
k
- number of clustersmaxIterations
- max number of iterationsruns
- number of parallel runs, defaults to 1. The best model is returned.initializationMode
- initialization model, either "random" or "k-means||" (default).public static KMeansModel train(RDD<Vector> data, int k, int maxIterations)
public static KMeansModel train(RDD<Vector> data, int k, int maxIterations, int runs)
public static scala.Tuple2<Object,Object> findClosest(scala.collection.TraversableOnce<VectorWithNorm> centers, VectorWithNorm point)
public static double pointCost(scala.collection.TraversableOnce<VectorWithNorm> centers, VectorWithNorm point)
public static double fastSquaredDistance(VectorWithNorm v1, VectorWithNorm v2)
MLUtils.fastSquaredDistance(org.apache.spark.mllib.linalg.Vector, double, org.apache.spark.mllib.linalg.Vector, double, double)
.public KMeans setK(int k)
public KMeans setMaxIterations(int maxIterations)
public KMeans setInitializationMode(String initializationMode)
public KMeans setRuns(int runs)
public KMeans setInitializationSteps(int initializationSteps)
public KMeans setEpsilon(double epsilon)
public KMeansModel run(RDD<Vector> data)
data
should be cached for high
performance, because this is an iterative algorithm.