Package org.apache.spark.ml.feature
Interface RFormulaBase
- All Superinterfaces:
HasFeaturesCol
,HasHandleInvalid
,HasLabelCol
,Identifiable
,Params
,Serializable
- All Known Implementing Classes:
RFormula
,RFormulaModel
Base trait for
RFormula
and RFormulaModel
.-
Method Summary
Modifier and TypeMethodDescriptionForce to index label whether it is numeric or string type.formula()
R formula parameter.boolean
Param for how to handle invalid data (unseen or NULL values) in features and label column of string type.boolean
hasLabelCol
(StructType schema) Param for how to order categories of a string FEATURE column used byStringIndexer
.Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
featuresCol, getFeaturesCol
Methods inherited from interface org.apache.spark.ml.param.shared.HasHandleInvalid
getHandleInvalid
Methods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelCol, labelCol
Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString, uid
Methods inherited from interface org.apache.spark.ml.param.Params
clear, copy, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
-
Method Details
-
forceIndexLabel
BooleanParam forceIndexLabel()Force to index label whether it is numeric or string type. Usually we index label only when it is string type. If the formula was used by classification algorithms, we can force to index label even it is numeric type by setting this param with true. Default: false.- Returns:
- (undocumented)
-
formula
R formula parameter. The formula is provided in string form.- Returns:
- (undocumented)
-
getForceIndexLabel
boolean getForceIndexLabel() -
getFormula
String getFormula() -
getStringIndexerOrderType
String getStringIndexerOrderType() -
handleInvalid
Param for how to handle invalid data (unseen or NULL values) in features and label column of string type. Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), or 'keep' (put invalid data in a special additional bucket, at index numLabels). Default: "error"- Specified by:
handleInvalid
in interfaceHasHandleInvalid
- Returns:
- (undocumented)
-
hasLabelCol
-
stringIndexerOrderType
Param for how to order categories of a string FEATURE column used byStringIndexer
. The last category after ordering is dropped when encoding strings. Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc',RFormula
drops the same category as R when encoding strings.The options are explained using an example
'b', 'a', 'b', 'a', 'c', 'b'
:
Note that this ordering option is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in+-----------------+---------------------------------------+----------------------------------+ | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula | +-----------------+---------------------------------------+----------------------------------+ | 'frequencyDesc' | most frequent category ('b') | least frequent category ('c') | | 'frequencyAsc' | least frequent category ('c') | most frequent category ('b') | | 'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')| | 'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c') | +-----------------+---------------------------------------+----------------------------------+
StringIndexer
.- Returns:
- (undocumented)
-