org.apache.spark.sql.sources

HadoopFsRelation

abstract class HadoopFsRelation extends BaseRelation with FileRelation with Logging

::Experimental:: A BaseRelation that provides much of the common code required for relations that store their data to an HDFS compatible filesystem.

For the read path, similar to PrunedFilteredScan, it can eliminate unneeded columns and filter using selected predicates before producing an RDD containing all matching tuples as Row objects. In addition, when reading from Hive style partitioned tables stored in file systems, it's able to discover partitioning information from the paths of input directories, and perform partition pruning before start reading the data. Subclasses of HadoopFsRelation() must override one of the four buildScan methods to implement the read path.

For the write path, it provides the ability to write to both non-partitioned and partitioned tables. Directory layout of the partitioned tables is compatible with Hive.

Annotations
@Experimental()
Source
interfaces.scala
Since

1.4.0

Linear Supertypes
Logging, FileRelation, BaseRelation, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. HadoopFsRelation
  2. Logging
  3. FileRelation
  4. BaseRelation
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new HadoopFsRelation(parameters: Map[String, String])

  2. new HadoopFsRelation()

Abstract Value Members

  1. abstract def dataSchema: StructType

    Specifies schema of actual data files.

    Specifies schema of actual data files. For partitioned relations, if one or more partitioned columns are contained in the data files, they should also appear in dataSchema.

    Since

    1.4.0

  2. abstract def paths: Array[String]

    Paths of this relation.

    Paths of this relation. For partitioned relations, it should be root directories of all partition directories.

    Since

    1.4.0

  3. abstract def prepareJobForWrite(job: Job): OutputWriterFactory

    Prepares a write job and returns an OutputWriterFactory.

    Prepares a write job and returns an OutputWriterFactory. Client side job preparation can be put here. For example, user defined output committer can be configured here by setting the output committer class in the conf of spark.sql.sources.outputCommitterClass.

    Note that the only side effect expected here is mutating job via its setters. Especially, Spark SQL caches BaseRelation instances for performance, mutating relation internal states may cause unexpected behaviors.

    Since

    1.4.0

  4. abstract def sqlContext: SQLContext

    Definition Classes
    BaseRelation

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def buildScan(requiredColumns: Array[String], filters: Array[Filter], inputFiles: Array[FileStatus]): RDD[Row]

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

    requiredColumns

    Required columns.

    filters

    Candidate filters to be pushed down. The actual filter should be the conjunction of all filters. The pushed down filters are currently purely an optimization as they will all be evaluated again. This means it is safe to use them with methods that produce false positives such as filtering partitions based on a bloom filter.

    inputFiles

    For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

    Since

    1.4.0

  8. def buildScan(requiredColumns: Array[String], inputFiles: Array[FileStatus]): RDD[Row]

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

    requiredColumns

    Required columns.

    inputFiles

    For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

    Since

    1.4.0

  9. def buildScan(inputFiles: Array[FileStatus]): RDD[Row]

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation.

    For a non-partitioned relation, this method builds an RDD[Row] containing all rows within this relation. For partitioned relations, this method is called for each selected partition, and builds an RDD[Row] containing all rows within that single partition.

    inputFiles

    For a non-partitioned relation, it contains paths of all data files in the relation. For a partitioned relation, it contains paths of all data files in a single selected partition.

    Since

    1.4.0

  10. def cachedLeafStatuses(): LinkedHashSet[FileStatus]

    Attributes
    protected
  11. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  12. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  13. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  14. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  16. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  17. def inputFiles: Array[String]

    Definition Classes
    HadoopFsRelation → FileRelation
  18. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  19. def isTraceEnabled(): Boolean

    Attributes
    protected
    Definition Classes
    Logging
  20. def log: Logger

    Attributes
    protected
    Definition Classes
    Logging
  21. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  22. def logDebug(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  23. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  24. def logError(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  25. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  26. def logInfo(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  27. def logName: String

    Attributes
    protected
    Definition Classes
    Logging
  28. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  29. def logTrace(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  30. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Attributes
    protected
    Definition Classes
    Logging
  31. def logWarning(msg: ⇒ String): Unit

    Attributes
    protected
    Definition Classes
    Logging
  32. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  33. def needConversion: Boolean

    Whether does it need to convert the objects in Row to internal representation, for example: java.

    Whether does it need to convert the objects in Row to internal representation, for example: java.lang.String -> UTF8String java.lang.Decimal -> Decimal

    If needConversion is false, buildScan() should return an RDD of InternalRow

    Note: The internal representation is not stable across releases and thus data sources outside of Spark SQL should leave this as true.

    Definition Classes
    BaseRelation
    Since

    1.4.0

  34. final def notify(): Unit

    Definition Classes
    AnyRef
  35. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  36. final def partitionColumns: StructType

    Partition columns.

    Partition columns. Can be either defined by userDefinedPartitionColumns or automatically discovered. Note that they should always be nullable.

    Since

    1.4.0

  37. lazy val schema: StructType

    Schema of this relation.

    Schema of this relation. It consists of columns appearing in dataSchema and all partition columns not appearing in dataSchema.

    Definition Classes
    HadoopFsRelationBaseRelation
    Since

    1.4.0

  38. def sizeInBytes: Long

    Returns an estimated size of this relation in bytes.

    Returns an estimated size of this relation in bytes. This information is used by the planner to decide when it is safe to broadcast a relation and can be overridden by sources that know the size ahead of time. By default, the system will assume that tables are too large to broadcast. This method will be called multiple times during query planning and thus should not perform expensive operations for each invocation.

    Note that it is always better to overestimate size than underestimate, because underestimation could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).

    Definition Classes
    HadoopFsRelationBaseRelation
    Since

    1.3.0

  39. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  40. def toString(): String

    Definition Classes
    HadoopFsRelation → AnyRef → Any
  41. def unhandledFilters(filters: Array[Filter]): Array[Filter]

    Returns the list of Filters that this datasource may not be able to handle.

    Returns the list of Filters that this datasource may not be able to handle. These returned Filters will be evaluated by Spark SQL after data is output by a scan. By default, this function will return all filters, as it is always safe to double evaluate a Filter. However, specific implementations can override this function to avoid double filtering when they are capable of processing a filter internally.

    Definition Classes
    BaseRelation
    Since

    1.6.0

  42. def userDefinedPartitionColumns: Option[StructType]

    Optional user defined partition columns.

    Optional user defined partition columns.

    Since

    1.4.0

  43. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  44. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  45. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Logging

Inherited from FileRelation

Inherited from BaseRelation

Inherited from AnyRef

Inherited from Any

Ungrouped