org.apache.spark.sql

Dataset

class Dataset[T] extends Queryable with Serializable

:: Experimental :: A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations.

A Dataset differs from an RDD in the following ways:

A Dataset can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic Row container. A DataFrame can be transformed into specific Dataset by calling df.as[ElementType]. Similarly you can transform a strongly-typed Dataset to a generic DataFrame by calling ds.toDF().

COMPATIBILITY NOTE: Long term we plan to make DataFrame extend Dataset[Row]. However, making this change to the class hierarchy would break the function signatures for the existing functional operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6.

Annotations
@Experimental()
Source
Dataset.scala
Since

1.6.0

Linear Supertypes
Serializable, Serializable, Queryable, AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By inheritance
Inherited
  1. Dataset
  2. Serializable
  3. Serializable
  4. Queryable
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. def as(alias: String): Dataset[T]

    Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.

    Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.

    Since

    1.6.0

  7. def as[U](implicit arg0: Encoder[U]): Dataset[U]

    Returns a new Dataset where each record has been mapped on to the specified type.

    Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:

    • When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive)
    • When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1).
    • When U is a primitive type (i.e. String, Int, etc). then the first column of the DataFrame will be used.

    If the schema of the DataFrame does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.

    Since

    1.6.0

  8. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  9. def cache(): Dataset.this.type

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Since

    1.6.0

  10. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. def coalesce(numPartitions: Int): Dataset[T]

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

    Since

    1.6.0

  12. def collect(): Array[T]

    Returns an array that contains all the elements in this Dataset.

    Returns an array that contains all the elements in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Since

    1.6.0

  13. def collectAsList(): List[T]

    Returns an array that contains all the elements in this Dataset.

    Returns an array that contains all the elements in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Since

    1.6.0

  14. def count(): Long

    Returns the number of elements in the Dataset.

    Returns the number of elements in the Dataset.

    Since

    1.6.0

  15. def distinct: Dataset[T]

    Returns a new Dataset that contains only the unique elements of this Dataset.

    Returns a new Dataset that contains only the unique elements of this Dataset.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  16. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  17. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  18. def explain(): Unit

    Prints the physical plan to the console for debugging purposes.

    Prints the physical plan to the console for debugging purposes.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  19. def explain(extended: Boolean): Unit

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  20. def filter(func: FilterFunction[T]): Dataset[T]

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    Since

    1.6.0

  21. def filter(func: (T) ⇒ Boolean): Dataset[T]

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    Since

    1.6.0

  22. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  23. def first(): T

    Returns the first element in this Dataset.

    Returns the first element in this Dataset.

    Since

    1.6.0

  24. def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Since

    1.6.0

  25. def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Since

    1.6.0

  26. def foreach(func: ForeachFunction[T]): Unit

    (Java-specific) Runs func on each element of this Dataset.

    (Java-specific) Runs func on each element of this Dataset.

    Since

    1.6.0

  27. def foreach(func: (T) ⇒ Unit): Unit

    (Scala-specific) Runs func on each element of this Dataset.

    (Scala-specific) Runs func on each element of this Dataset.

    Since

    1.6.0

  28. def foreachPartition(func: ForeachPartitionFunction[T]): Unit

    (Java-specific) Runs func on each partition of this Dataset.

    (Java-specific) Runs func on each partition of this Dataset.

    Since

    1.6.0

  29. def foreachPartition(func: (Iterator[T]) ⇒ Unit): Unit

    (Scala-specific) Runs func on each partition of this Dataset.

    (Scala-specific) Runs func on each partition of this Dataset.

    Since

    1.6.0

  30. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  31. def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T]

    (Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    (Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    Since

    1.6.0

  32. def groupBy(cols: Column*): GroupedDataset[Row, T]

    Returns a GroupedDataset where the data is grouped by the given Column expressions.

    Returns a GroupedDataset where the data is grouped by the given Column expressions.

    Annotations
    @varargs()
    Since

    1.6.0

  33. def groupBy[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): GroupedDataset[K, T]

    (Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    (Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.

    Since

    1.6.0

  34. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  35. def intersect(other: Dataset[T]): Dataset[T]

    Returns a new Dataset that contains only the elements of this Dataset that are also present in other.

    Returns a new Dataset that contains only the elements of this Dataset that are also present in other.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  36. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  37. def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    other

    Right side of the join.

    condition

    Join expression.

    Since

    1.6.0

  38. def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.

    This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

    other

    Right side of the join.

    condition

    Join expression.

    joinType

    One of: inner, outer, left_outer, right_outer, leftsemi.

    Since

    1.6.0

  39. def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    Since

    1.6.0

  40. def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    Since

    1.6.0

  41. def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

    (Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

    (Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Since

    1.6.0

  42. def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Since

    1.6.0

  43. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  44. final def notify(): Unit

    Definition Classes
    AnyRef
  45. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  46. def persist(newLevel: StorageLevel): Dataset.this.type

    Persist this Dataset with the given storage level.

    Persist this Dataset with the given storage level.

    newLevel

    One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Since

    1.6.0

  47. def persist(): Dataset.this.type

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Since

    1.6.0

  48. def printSchema(): Unit

    Prints the schema of the underlying Dataset to the console in a nice tree format.

    Prints the schema of the underlying Dataset to the console in a nice tree format.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  49. val queryExecution: QueryExecution

    Definition Classes
    Dataset → Queryable
  50. def rdd: RDD[T]

    Converts this Dataset to an RDD.

    Converts this Dataset to an RDD.

    Since

    1.6.0

  51. def reduce(func: ReduceFunction[T]): T

    (Java-specific) Reduces the elements of this Dataset using the specified binary function.

    (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Since

    1.6.0

  52. def reduce(func: (T, T) ⇒ T): T

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function.

    (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Since

    1.6.0

  53. def repartition(numPartitions: Int): Dataset[T]

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions.

    Since

    1.6.0

  54. def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

    Returns a new Dataset by sampling a fraction of records, using a random seed.

    Returns a new Dataset by sampling a fraction of records, using a random seed.

    Since

    1.6.0

  55. def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

    Returns a new Dataset by sampling a fraction of records.

    Returns a new Dataset by sampling a fraction of records.

    Since

    1.6.0

  56. def schema: StructType

    Returns the schema of the encoded form of the objects in this Dataset.

    Returns the schema of the encoded form of the objects in this Dataset.

    Definition Classes
    Dataset → Queryable
    Since

    1.6.0

  57. def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  58. def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  59. def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  60. def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

    Returns a new Dataset by computing the given Column expressions for each element.

    Returns a new Dataset by computing the given Column expressions for each element.

    Since

    1.6.0

  61. def select[U1](c1: TypedColumn[T, U1])(implicit arg0: Encoder[U1]): Dataset[U1]

    Returns a new Dataset by computing the given Column expression for each element.

    Returns a new Dataset by computing the given Column expression for each element.

    val ds = Seq(1, 2, 3).toDS()
    val newDS = ds.select(expr("value + 1").as[Int])
    Since

    1.6.0

  62. def select(cols: Column*): DataFrame

    Returns a new DataFrame by selecting a set of column based expressions.

    Returns a new DataFrame by selecting a set of column based expressions.

    df.select($"colA", $"colB" + 1)
    Attributes
    protected
    Annotations
    @varargs()
    Since

    1.6.0

  63. def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

    Internal helper function for building typed selects that return tuples.

    Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

    Attributes
    protected
  64. def show(numRows: Int, truncate: Boolean): Unit

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.6.0

  65. def show(truncate: Boolean): Unit

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Since

    1.6.0

  66. def show(): Unit

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Since

    1.6.0

  67. def show(numRows: Int): Unit

    Displays the content of this Dataset in a tabular form.

    Displays the content of this Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Since

    1.6.0

  68. val sqlContext: SQLContext

    Definition Classes
    Dataset → Queryable
  69. def subtract(other: Dataset[T]): Dataset[T]

    Returns a new Dataset where any elements present in other have been removed.

    Returns a new Dataset where any elements present in other have been removed.

    Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

    Since

    1.6.0

  70. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  71. def take(num: Int): Array[T]

    Returns the first num elements of this Dataset as an array.

    Returns the first num elements of this Dataset as an array.

    Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

    Since

    1.6.0

  72. def takeAsList(num: Int): List[T]

    Returns the first num elements of this Dataset as an array.

    Returns the first num elements of this Dataset as an array.

    Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

    Since

    1.6.0

  73. def toDF(): DataFrame

    Converts this strongly typed collection of data to generic Dataframe.

    Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.

  74. def toDS(): Dataset[T]

    Returns this Dataset.

    Returns this Dataset.

    Since

    1.6.0

  75. def toString(): String

    Definition Classes
    Queryable → AnyRef → Any
  76. def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]

    Concise syntax for chaining custom transformations.

    Concise syntax for chaining custom transformations.

    def featurize(ds: Dataset[T]) = ...
    
    dataset
      .transform(featurize)
      .transform(...)
    Since

    1.6.0

  77. def union(other: Dataset[T]): Dataset[T]

    Returns a new Dataset that contains the elements of both this and the other Dataset combined.

    Returns a new Dataset that contains the elements of both this and the other Dataset combined.

    Note that, this function is not a typical set union operation, in that it does not eliminate duplicate items. As such, it is analogous to UNION ALL in SQL.

    Since

    1.6.0

  78. def unpersist(): Dataset.this.type

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Since

    1.6.0

  79. def unpersist(blocking: Boolean): Dataset.this.type

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    blocking

    Whether to block until all blocks are deleted.

    Since

    1.6.0

  80. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  81. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  82. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Queryable

Inherited from AnyRef

Inherited from Any

basic

Ungrouped