org.apache.spark.ml.feature (Spark 3.0.3 JavaDoc)

Interface Summary
Interface	Description
BucketedRandomProjectionLSHParams	Params for `BucketedRandomProjectionLSH`.
ChiSqSelectorParams	Params for `ChiSqSelector` and `ChiSqSelectorModel`.
CountVectorizerParams	Params for `CountVectorizer` and `CountVectorizerModel`.
IDFBase	Params for `IDF` and `IDFModel`.
ImputerParams	Params for `Imputer` and `ImputerModel`.
InteractableTerm	A term that may be part of an interaction, e.g.
LSHParams	Params for `LSH`.
MaxAbsScalerParams	Params for `MaxAbsScaler` and `MaxAbsScalerModel`.
MinMaxScalerParams	Params for `MinMaxScaler` and `MinMaxScalerModel`.
OneHotEncoderBase	Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel
PCAParams	Params for `PCA` and `PCAModel`.
QuantileDiscretizerBase	Params for `QuantileDiscretizer`.
RFormulaBase	Base trait for `RFormula` and `RFormulaModel`.
RobustScalerParams	Params for `RobustScaler` and `RobustScalerModel`.
StandardScalerParams	Params for `StandardScaler` and `StandardScalerModel`.
StringIndexerBase	Base trait for `StringIndexer` and `StringIndexerModel`.
Term	R formula terms.
VectorIndexerParams	Private trait for params for VectorIndexer and VectorIndexerModel
Word2VecBase	Params for `Word2Vec` and `Word2VecModel`.

Class Summary
Class	Description
Binarizer	Binarize a column of continuous features given a threshold.
BucketedRandomProjectionLSH	This `BucketedRandomProjectionLSH` implements Locality Sensitive Hashing functions for Euclidean distance metrics.
BucketedRandomProjectionLSHModel	Model produced by `BucketedRandomProjectionLSH`, where multiple random vectors are stored.
Bucketizer	`Bucketizer` maps a column of continuous features to a column of feature buckets.
ChiSqSelector	Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label.
ChiSqSelectorModel	Model fitted by `ChiSqSelector`.
ColumnPruner	Utility transformer for removing temporary columns from a DataFrame.
CountVectorizer	Extracts a vocabulary from document collections and generates a `CountVectorizerModel`.
CountVectorizerModel	Converts a text document to a sparse vector of token counts.
DCT	A feature transformer that takes the 1D discrete cosine transform of a real vector.
Dot
ElementwiseProduct	Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector.
EmptyTerm	Placeholder term for the result of undefined interactions, e.g.
FeatureHasher	Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space).
HashingTF	Maps a sequence of terms to their term frequencies using the hashing trick.
IDF	Compute the Inverse Document Frequency (IDF) given a collection of documents.
IDFModel	Model fitted by `IDF`.
Imputer	Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located.
ImputerModel	Model fitted by `Imputer`.
IndexToString	A `Transformer` that maps a column of indices back to a new column of corresponding string values.
Interaction	Implements the feature interaction transform.
LabeledPoint	Class that represents the features and label of a data point.
MaxAbsScaler	Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.
MaxAbsScalerModel	Model fitted by `MaxAbsScaler`.
MinHashLSH	LSH class for Jaccard distance.
MinHashLSHModel	Model produced by `MinHashLSH`, where multiple hash functions are stored.
MinMaxScaler	Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
MinMaxScalerModel	Model fitted by `MinMaxScaler`.
NGram	A feature transformer that converts the input array of strings into an array of n-grams.
Normalizer	Normalize a vector to have unit norm using the given p-norm.
OneHotEncoder	A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.
OneHotEncoderCommon	Provides some helper methods used by `OneHotEncoder`.
OneHotEncoderModel	param: categorySizes Original number of categories for each feature being encoded.
PCA	PCA trains a model to project vectors to a lower dimensional space of the top `PCA!.k` principal components.
PCAModel	Model fitted by `PCA`.
PolynomialExpansion	Perform feature expansion in a polynomial space.
QuantileDiscretizer	`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features.
RegexTokenizer	A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if `gaps` is false).
RFormula	Implements the transforms required for fitting a dataset against an R model formula.
RFormulaModel	Model fitted by `RFormula`.
RFormulaParser	Limited implementation of R formula parsing.
RobustScaler	Scale features using statistics that are robust to outliers.
RobustScalerModel	Model fitted by `RobustScaler`.
SQLTransformer	Implements the transformations which are defined by SQL statement.
StandardScaler	Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
StandardScalerModel	Model fitted by `StandardScaler`.
StopWordsRemover	A feature transformer that filters out stop words from input.
StringIndexer	A label indexer that maps string column(s) of labels to ML column(s) of label indices.
StringIndexerAggregator	A SQL `Aggregator` used by `StringIndexer` to count labels in string columns during fitting.
StringIndexerModel	Model fitted by `StringIndexer`.
Tokenizer	A tokenizer that converts the input string to lowercase and then splits it by white spaces.
VectorAssembler	A feature transformer that merges multiple columns into a vector column.
VectorAttributeRewriter	Utility transformer that rewrites Vector attribute names via prefix replacement.
VectorIndexer	Class for indexing categorical feature columns in a dataset of `Vector`.
VectorIndexerModel	Model fitted by `VectorIndexer`.
VectorSizeHint	A feature transformer that adds size information to the metadata of a vector column.
VectorSlicer	This class takes a feature vector and outputs a new feature vector with a subarray of the original features.
Word2Vec	Word2Vec trains a model of `Map(String, Vector)`, i.e.
Word2VecModel	Model fitted by `Word2Vec`.
Word2VecModel.Word2VecModelWriter$

Package org.apache.spark.ml.feature Description

Feature transformers The `ml.feature` package provides common feature transformers that help convert raw data or features into more suitable forms for model fitting. Most feature transformers are implemented as Transformers, which transforms one Dataset into another, e.g., HashingTF. Some feature transformers are implemented as Estimator}s, because the transformation requires some aggregated information of the dataset, e.g., document frequencies in IDF. For those feature transformers, calling

Estimator.fit(org.apache.spark.sql.Dataset<?>, org.apache.spark.ml.param.ParamPair<?>, org.apache.spark.ml.param.ParamPair<?>...)

is required to obtain the model first, e.g., IDFModel, in order to apply transformation. The transformation is usually done by appending new columns to the input Dataset, so all input columns are carried over. We try to make each transformer minimal, so it becomes flexible to assemble feature transformation pipelines. Pipeline can be used to chain feature transformers, and VectorAssembler can be used to combine multiple feature transformations, for example:

 
   import java.util.Arrays;

   import org.apache.spark.api.java.JavaRDD;
   import static org.apache.spark.sql.types.DataTypes.*;
   import org.apache.spark.sql.types.StructType;
   import org.apache.spark.sql.Dataset;
   import org.apache.spark.sql.RowFactory;
   import org.apache.spark.sql.Row;

   import org.apache.spark.ml.feature.*;
   import org.apache.spark.ml.Pipeline;
   import org.apache.spark.ml.PipelineStage;
   import org.apache.spark.ml.PipelineModel;

  // a DataFrame with three columns: id (integer), text (string), and rating (double).
  StructType schema = createStructType(
    Arrays.asList(
      createStructField("id", IntegerType, false),
      createStructField("text", StringType, false),
      createStructField("rating", DoubleType, false)));
  JavaRDD<Row> rowRDD = jsc.parallelize(
    Arrays.asList(
      RowFactory.create(0, "Hi I heard about Spark", 3.0),
      RowFactory.create(1, "I wish Java could use case classes", 4.0),
      RowFactory.create(2, "Logistic regression models are neat", 4.0)));
  Dataset<Row> dataset = jsql.createDataFrame(rowRDD, schema);
  // define feature transformers
  RegexTokenizer tok = new RegexTokenizer()
    .setInputCol("text")
    .setOutputCol("words");
  StopWordsRemover sw = new StopWordsRemover()
    .setInputCol("words")
    .setOutputCol("filtered_words");
  HashingTF tf = new HashingTF()
    .setInputCol("filtered_words")
    .setOutputCol("tf")
    .setNumFeatures(10000);
  IDF idf = new IDF()
    .setInputCol("tf")
    .setOutputCol("tf_idf");
  VectorAssembler assembler = new VectorAssembler()
    .setInputCols(new String[] {"tf_idf", "rating"})
    .setOutputCol("features");

  // assemble and fit the feature transformation pipeline
  Pipeline pipeline = new Pipeline()
    .setStages(new PipelineStage[] {tok, sw, tf, idf, assembler});
  PipelineModel model = pipeline.fit(dataset);

  // save transformed features with raw data
  model.transform(dataset)
    .select("id", "text", "rating", "features")
    .write().format("parquet").save("/output/path");

Some feature transformers implemented in MLlib are inspired by those implemented in scikit-learn. The major difference is that most scikit-learn feature transformers operate eagerly on the entire input dataset, while MLlib's feature transformers operate lazily on individual columns, which is more efficient and flexible to handle large and complex datasets.

See Also:: scikit-learn.preprocessing