sql

Type Members

class AnalysisException extends Exception with Serializable

:: DeveloperApi :: Thrown when a query fails to analyze, usually because the query itself is invalid.
:: DeveloperApi :: Thrown when a query fails to analyze, usually because the query itself is invalid.

Annotations
@DeveloperApi()

class Column extends Logging

:: Experimental :: A column that will be computed based on the data in a DataFrame.

A new column is constructed based on the input columns present in a dataframe:

df("columnName")            // On a specific DataFrame.
col("columnName")           // A generic column no yet associated with a DataFrame.
col("columnName.field")     // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"               // Scala short hand for a named column.
expr("a + 1")               // A column that is constructed from a parsed SQL Expression.
lit("abc")                  // A column that produces a literal (constant) value.

Column objects can be composed to form complex expressions:

$"a" + 1
$"a" === $"b"

Annotations: @Experimental()
Since: 1.3.0

class ColumnName extends Column

:: Experimental :: A convenient class used for constructing schema.
:: Experimental :: A convenient class used for constructing schema.

Annotations
@Experimental()
Since
1.3.0
trait ContinuousQuery extends AnyRef

:: Experimental :: A handle to a query that is executing continuously in the background as new data arrives.
:: Experimental :: A handle to a query that is executing continuously in the background as new data arrives. All these methods are thread-safe.

Annotations
@Experimental()
Since
2.0.0
class ContinuousQueryException extends Exception

:: Experimental :: Exception that stopped a ContinuousQuery.
:: Experimental :: Exception that stopped a ContinuousQuery.

Annotations
@Experimental()
Since
2.0.0
class ContinuousQueryManager extends AnyRef

:: Experimental :: A class to manage all the ContinuousQueries active on a SparkSession.
:: Experimental :: A class to manage all the ContinuousQueries active on a SparkSession.

Annotations
@Experimental()
Since
2.0.0
type DataFrame = Dataset[Row]
final class DataFrameNaFunctions extends AnyRef

:: Experimental :: Functionality for working with missing data in DataFrames.
:: Experimental :: Functionality for working with missing data in DataFrames.

Annotations
@Experimental()
Since
1.3.1
class DataFrameReader extends Logging

Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems, key-value stores, etc) or data streams. Use SparkSession.read to access this.

Since
1.4.0
final class DataFrameStatFunctions extends AnyRef

:: Experimental :: Statistic functions for DataFrames.
:: Experimental :: Statistic functions for DataFrames.

Annotations
@Experimental()
Since
1.4.0
final class DataFrameWriter extends AnyRef

Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc) or data streams. Use Dataset.write to access this.

Since
1.4.0
class Dataset[T] extends Serializable

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.
```
val people = spark.read.parquet("...").as[Person]  // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)  // Java
```
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
```
val names = people.map(_.name)  // in Scala; names is a Dataset[String]
Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING))  // in Java 8
```
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply method in Scala and col in Java.
```
val ageCol = people("age")  // in Scala
Column ageCol = people.col("age")  // in Java
```
Note that the Column type can also be manipulated through its various functions.
```
// The following creates a new column that increases everybody's age by 10.
people("age") + 10  // in Scala
people.col("age").plus(10);  // in Java
```
A more concrete example in Scala:
```
// To create Dataset[Row] using SQLContext
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))
```
and in Java:
```
// To create Dataset using SQLContext
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");

people.filter("age".gt(30))
  .join(department, people.col("deptId").equalTo(department("id")))
  .groupBy(department.col("name"), "gender")
  .agg(avg(people.col("salary")), max(people.col("age")));
```
Since
1.6.0
case class DatasetHolder[T] extends Product with Serializable

A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
```
import sqlContext.implicits._
```
Since
1.6.0
trait Encoder[T] extends Serializable

:: Experimental :: Used to convert a JVM object of type T to and from the internal Spark SQL representation.
:: Experimental :: Used to convert a JVM object of type T to and from the internal Spark SQL representation.
Scala
Encoders are generally created automatically through implicits from a SparkSession, or can be explicitly created by calling static methods on Encoders.
```
import spark.implicits._

val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
```
Java
Encoders are specified by calling static methods on Encoders.
```
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> ds = context.createDataset(data, Encoders.STRING());
```
Encoders can be composed into tuples:
```
Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING());
List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a");
Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);
```
Or constructed from Java Beans:
```
Encoders.bean(MyClass.class);
```
Implementation
- Encoders are not required to be thread-safe and thus they do not need to use locks to guard against concurrent access if they reuse internal buffers to improve performance.
Annotations
@Experimental() @implicitNotFound( ... )
Since
1.6.0
class ExperimentalMethods extends AnyRef

:: Experimental :: Holder for experimental methods for the bravest.
:: Experimental :: Holder for experimental methods for the bravest. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
```
spark.experimental.extraStrategies += ...
```
Annotations
@Experimental()
Since
1.3.0
class KeyValueGroupedDataset[K, V] extends Serializable

:: Experimental :: A Dataset has been logically grouped by a user specified grouping key.
:: Experimental :: A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupBy on an existing Dataset.

Annotations
@Experimental()
Since
2.0.0
case class ProcessingTime(intervalMs: Long) extends Trigger with Product with Serializable

:: Experimental :: A trigger that runs a query periodically based on the processing time.
:: Experimental :: A trigger that runs a query periodically based on the processing time. If interval is 0, the query will run as fast as possible.
Scala Example:
```
df.write.trigger(ProcessingTime("10 seconds"))

import scala.concurrent.duration._
df.write.trigger(ProcessingTime(10.seconds))
```
Java Example:
```
df.write.trigger(ProcessingTime.create("10 seconds"))

import java.util.concurrent.TimeUnit
df.write.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
```
Annotations
@Experimental()
class RelationalGroupedDataset extends AnyRef

A set of methods for aggregations on a DataFrame, created by Dataset.groupBy.
A set of methods for aggregations on a DataFrame, created by Dataset.groupBy.
The main method is the agg function, which has multiple variants. This class also contains convenience some first order statistics such as mean, sum for convenience.

Since
2.0.0
trait Row extends Serializable

Represents one row of output from a relational operator.
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access.
It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in Scala.
A Row object can be constructed by providing field values. Example:
```
import org.apache.spark.sql._

// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
```
A value of a row can be accessed through both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. An example of generic access by ordinal:
```
import org.apache.spark.sql._

val row = Row(1, true, "a string", null)
// row: Row = [1,true,a string,null]
val firstValue = row(0)
// firstValue: Any = 1
val fourthValue = row(3)
// fourthValue: Any = null
```
For native primitive access, it is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null. An example of native primitive access:
```
// using the row from the previous example.
val firstValue = row.getInt(0)
// firstValue: Int = 1
val isNull = row.isNullAt(3)
// isNull: Boolean = true
```
In Scala, fields in a Row object can be extracted in a pattern match. Example:
```
import org.apache.spark.sql._

val pairs = sql("SELECT key, value FROM src").rdd.map {
  case Row(key: Int, value: String) =>
    key -> value
}
```
class RowFactory extends AnyRef
class RuntimeConfig extends AnyRef

Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use SparkSession.conf.
Options set here are automatically propagated to the Hadoop configuration during I/O.

Since
2.0.0
class SQLContext extends Logging with Serializable

The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.

Since
1.0.0
abstract class SQLImplicits extends AnyRef

A collection of implicit methods for converting common Scala objects into DataFrames.
A collection of implicit methods for converting common Scala objects into DataFrames.

Since
1.6.0
final class SaveMode extends Enum[SaveMode]
class SinkStatus extends AnyRef

:: Experimental :: Status and metrics of a streaming Sink.
:: Experimental :: Status and metrics of a streaming Sink.

Annotations
@Experimental()
Since
2.0.0
class SourceStatus extends AnyRef

:: Experimental :: Status and metrics of a streaming Source.
:: Experimental :: Status and metrics of a streaming Source.

Annotations
@Experimental()
Since
2.0.0
class SparkSession extends Serializable with Logging

The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
To create a SparkSession, use the following builder pattern:
```
SparkSession.builder()
  .master("local")
  .appName("Word Count")
  .config("spark.some.config.option", "some-value").
  .getOrCreate()
```
type Strategy = GenericStrategy[SparkPlan]

Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources

Annotations
@DeveloperApi()
sealed trait Trigger extends AnyRef

:: Experimental :: Used to indicate how often results should be produced by a ContinuousQuery.
:: Experimental :: Used to indicate how often results should be produced by a ContinuousQuery.

Annotations
@Experimental()
class TypedColumn[-T, U] extends Column

A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type. To create a TypedColumn, use the as function on a Column.
T
The input type expected for this expression. Can be Any if the expression is type checked by the analyzer instead of the compiler (i.e. expr("sum(...)")).
U
The output type of this column.

Since
1.6.0
class UDFRegistration extends Logging

Functions for registering user-defined functions.
Functions for registering user-defined functions. Use SQLContext.udf to access this.

Since
1.3.0

Value Members

object Encoders

:: Experimental :: Methods for creating an Encoder.
:: Experimental :: Methods for creating an Encoder.

Annotations
@Experimental()
Since
1.6.0
object ProcessingTime extends Serializable

:: Experimental :: Used to create ProcessingTime triggers for ContinuousQuerys.
:: Experimental :: Used to create ProcessingTime triggers for ContinuousQuerys.

Annotations
@Experimental()
object Row extends Serializable
object SQLContext extends Serializable

This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
It also provides utility functions to support preference for threads in multiple sessions scenario, setActive could set a SQLContext for current thread, which will be returned by getOrCreate instead of the global one.
object SparkSession extends Serializable
package api

Contains API classes that are specific to a single language (i.e.
Contains API classes that are specific to a single language (i.e. Java).
package catalog
package expressions
object functions

:: Experimental :: Functions available for DataFrame.
:: Experimental :: Functions available for DataFrame.

Annotations
@Experimental()
Since
1.3.0
package hive

Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation).
Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation). Supported Hive features include:
- Using HiveQL to express queries.
- Reading metadata from the Hive Metastore using HiveSerDes.
- Hive UDFs, UDAs, UDTs
Users that would like access to this functionality should create a HiveContext instead of a SQLContext.
package internal

All classes in this package are considered an internal API to Spark and are subject to change between minor releases.
package jdbc
package sources

A set of APIs for adding data sources to Spark SQL.
package types

Contains a type system for attributes produced by relations, including complex types like structs, arrays and maps.
package util

package sql

Type Members

class AnalysisException extends Exception with Serializable

class Column extends Logging

class ColumnName extends Column

trait ContinuousQuery extends AnyRef

class ContinuousQueryException extends Exception

class ContinuousQueryManager extends AnyRef

type DataFrame = Dataset[Row]

final class DataFrameNaFunctions extends AnyRef

class DataFrameReader extends Logging

final class DataFrameStatFunctions extends AnyRef

final class DataFrameWriter extends AnyRef

class Dataset[T] extends Serializable

case class DatasetHolder[T] extends Product with Serializable

trait Encoder[T] extends Serializable

Scala

Java

Implementation

class ExperimentalMethods extends AnyRef

class KeyValueGroupedDataset[K, V] extends Serializable

case class ProcessingTime(intervalMs: Long) extends Trigger with Product with Serializable

class RelationalGroupedDataset extends AnyRef

trait Row extends Serializable

class RowFactory extends AnyRef

class RuntimeConfig extends AnyRef

class SQLContext extends Logging with Serializable

abstract class SQLImplicits extends AnyRef

final class SaveMode extends Enum[SaveMode]

class SinkStatus extends AnyRef

class SourceStatus extends AnyRef

class SparkSession extends Serializable with Logging

type Strategy = GenericStrategy[SparkPlan]

sealed trait Trigger extends AnyRef

class TypedColumn[-T, U] extends Column

class UDFRegistration extends Logging

Value Members

object Encoders

object ProcessingTime extends Serializable

object Row extends Serializable

object SQLContext extends Serializable

object SparkSession extends Serializable

package api

package catalog

package expressions

object functions

package hive

package internal

package jdbc

package sources

package types

package util

Inherited from AnyRef

Inherited from Any

Row

Ungrouped