public class Word2Vec extends java.lang.Object implements scala.Serializable, Logging
We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation.
For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Constructor and Description |
---|
Word2Vec() |
Modifier and Type | Method and Description |
---|---|
<S extends java.lang.Iterable<java.lang.String>> |
fit(JavaRDD<S> dataset)
Computes the vector representation of each word in vocabulary (Java version).
|
<S extends scala.collection.Iterable<java.lang.String>> |
fit(RDD<S> dataset)
Computes the vector representation of each word in vocabulary.
|
Word2Vec |
setLearningRate(double learningRate)
Sets initial learning rate (default: 0.025).
|
Word2Vec |
setMinCount(int minCount)
Sets minCount, the minimum number of times a token must appear to be included in the word2vec
model's vocabulary (default: 5).
|
Word2Vec |
setNumIterations(int numIterations)
Sets number of iterations (default: 1), which should be smaller than or equal to number of
partitions.
|
Word2Vec |
setNumPartitions(int numPartitions)
Sets number of partitions (default: 1).
|
Word2Vec |
setSeed(long seed)
Sets random seed (default: a random long integer).
|
Word2Vec |
setVectorSize(int vectorSize)
Sets vector size (default: 100).
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning
public Word2Vec setVectorSize(int vectorSize)
vectorSize
- (undocumented)public Word2Vec setLearningRate(double learningRate)
learningRate
- (undocumented)public Word2Vec setNumPartitions(int numPartitions)
numPartitions
- (undocumented)public Word2Vec setNumIterations(int numIterations)
numIterations
- (undocumented)public Word2Vec setSeed(long seed)
seed
- (undocumented)public Word2Vec setMinCount(int minCount)
minCount
- (undocumented)public <S extends scala.collection.Iterable<java.lang.String>> Word2VecModel fit(RDD<S> dataset)
dataset
- an RDD of wordspublic <S extends java.lang.Iterable<java.lang.String>> Word2VecModel fit(JavaRDD<S> dataset)
dataset
- a JavaRDD of words