ParquetRelation2 (Spark 1.3.1 JavaDoc)

Object
- org.apache.spark.sql.sources.BaseRelation
- - org.apache.spark.sql.parquet.ParquetRelation2

All Implemented Interfaces:

java.io.Serializable, Logging, SparkHadoopMapReduceUtil, CatalystScan, InsertableRelation, scala.Equals, scala.Product
```
public class ParquetRelation2
extends BaseRelation
implements CatalystScan, InsertableRelation, SparkHadoopMapReduceUtil, Logging, scala.Product, scala.Serializable
```
An alternative to ParquetRelation that plugs in using the data sources API. This class is intended as a full replacement of the Parquet support in Spark SQL. The old implementation will be deprecated and eventually removed once this version is proved to be stable enough.
Compared with the old implementation, this class has the following notable differences:
- Partitioning discovery: Hive style multi-level partitions are auto discovered. - Metadata discovery: Parquet is a format comes with schema evolving support. This data source can detect and merge schemas from all Parquet part-files as long as they are compatible. Also, metadata and FileStatuses are cached for better performance. - Statistics: Statistics for the size of the table are automatically populated during schema discovery.

See Also:
Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class ParquetRelation2.PartitionValues

static class ParquetRelation2.PartitionValues$

Nested Classes
Modifier and Type	Class and Description
`static class`	`ParquetRelation2.PartitionValues`
`static class`	`ParquetRelation2.PartitionValues$`

Constructor Summary

Constructors
Constructor and Description
`ParquetRelation2(scala.collection.Seq<String> paths, scala.collection.immutable.Map<String,String> parameters, scala.Option<org.apache.spark.sql.types.StructType> maybeSchema, scala.Option<PartitionSpec> maybePartitionSpec, SQLContext sqlContext)`

Method Summary

Methods
Modifier and Type	Method and Description
`RDD<org.apache.spark.sql.Row>`	`buildScan(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> output, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates)`
`static String`	`DEFAULT_PARTITION_NAME()`
`boolean`	`equals(Object other)`
`static org.apache.spark.sql.catalyst.expressions.Literal`	`inferPartitionColumnValue(String raw, String defaultPartitionName)` Converts a string to a `Literal` with automatic type inference.
`void`	`insert(DataFrame data, boolean overwrite)`
`boolean`	`isPartitioned()`
`scala.Option<PartitionSpec>`	`maybePartitionSpec()`
`scala.Option<org.apache.spark.sql.types.StructType>`	`maybeSchema()`
`static String`	`MERGE_SCHEMA()`
`static org.apache.spark.sql.types.StructType`	`mergeMetastoreParquetSchema(org.apache.spark.sql.types.StructType metastoreSchema, org.apache.spark.sql.types.StructType parquetSchema)` Reconciles Hive Metastore case insensitivity issue and data type conflicts between Metastore schema and Parquet schema.
`static String`	`METASTORE_SCHEMA()`
`scala.collection.immutable.Map<String,String>`	`parameters()`
`static ParquetRelation2.PartitionValues`	`parsePartition(org.apache.hadoop.fs.Path path, String defaultPartitionName)` Parses a single partition, returns column names and values of each partition column.
`static PartitionSpec`	`parsePartitions(scala.collection.Seq<org.apache.hadoop.fs.Path> paths, String defaultPartitionName)` Given a group of qualified paths, tries to parse them and returns a partition specification.
`org.apache.spark.sql.types.StructType`	`partitionColumns()`
`scala.collection.Seq<Partition>`	`partitions()`
`PartitionSpec`	`partitionSpec()`
`scala.collection.Seq<String>`	`paths()`
`static scala.Option<org.apache.spark.sql.types.StructType>`	`readSchema(scala.collection.Seq<parquet.hadoop.Footer> footers, SQLContext sqlContext)`
`static scala.collection.Seq<ParquetRelation2.PartitionValues>`	`resolvePartitions(scala.collection.Seq<ParquetRelation2.PartitionValues> values)` Resolves possible type conflicts between partitions by up-casting "lower" types.
`org.apache.spark.sql.types.StructType`	`schema()`
`long`	`sizeInBytes()` Returns an estimated size of this relation in bytes.
`SparkContext`	`sparkContext()`
`SQLContext`	`sqlContext()`

Methods inherited from class Object
getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.mapreduce.SparkHadoopMapReduceUtil
firstAvailableClass, newJobContext, newTaskAttemptContext, newTaskAttemptID

Methods inherited from interface org.apache.spark.Logging
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

Methods inherited from interface scala.Product
productArity, productElement, productIterator, productPrefix

Methods inherited from interface scala.Equals
canEqual

Constructor Detail

ParquetRelation2

public ParquetRelation2(scala.collection.Seq<String> paths,
                scala.collection.immutable.Map<String,String> parameters,
                scala.Option<org.apache.spark.sql.types.StructType> maybeSchema,
                scala.Option<PartitionSpec> maybePartitionSpec,
                SQLContext sqlContext)

Method Detail

MERGE_SCHEMA
```
public static String MERGE_SCHEMA()
```

DEFAULT_PARTITION_NAME

public static String DEFAULT_PARTITION_NAME()

METASTORE_SCHEMA

public static String METASTORE_SCHEMA()

readSchema

public static scala.Option<org.apache.spark.sql.types.StructType> readSchema(scala.collection.Seq<parquet.hadoop.Footer> footers,
                                                             SQLContext sqlContext)

mergeMetastoreParquetSchema
```
public static org.apache.spark.sql.types.StructType mergeMetastoreParquetSchema(org.apache.spark.sql.types.StructType metastoreSchema,
                                                                org.apache.spark.sql.types.StructType parquetSchema)
```
Reconciles Hive Metastore case insensitivity issue and data type conflicts between Metastore schema and Parquet schema.
Hive doesn't retain case information, while Parquet is case sensitive. On the other hand, the schema read from Parquet files may be incomplete (e.g. older versions of Parquet doesn't distinguish binary and string). This method generates a correct schema by merging Metastore schema data types and Parquet schema field names.

parsePartitions

public static PartitionSpec parsePartitions(scala.collection.Seq<org.apache.hadoop.fs.Path> paths,
                            String defaultPartitionName)

Given a group of qualified paths, tries to parse them and returns a partition specification. For example, given:


   hdfs://<host>:<port>/path/to/partition/a=1/b=hello/c=3.14
   hdfs://<host>:<port>/path/to/partition/a=2/b=world/c=6.28

it returns:


   PartitionSpec(
     partitionColumns = StructType(
       StructField(name = "a", dataType = IntegerType, nullable = true),
       StructField(name = "b", dataType = StringType, nullable = true),
       StructField(name = "c", dataType = DoubleType, nullable = true)),
     partitions = Seq(
       Partition(
         values = Row(1, "hello", 3.14),
         path = "hdfs://<host>:<port>/path/to/partition/a=1/b=hello/c=3.14"),
       Partition(
         values = Row(2, "world", 6.28),
         path = "hdfs://<host>:<port>/path/to/partition/a=2/b=world/c=6.28")))

parsePartition

public static ParquetRelation2.PartitionValues parsePartition(org.apache.hadoop.fs.Path path,
                                              String defaultPartitionName)

Parses a single partition, returns column names and values of each partition column. For example, given:


   path = hdfs://<host>:<port>/path/to/partition/a=42/b=hello/c=3.14

it returns:


   PartitionValues(
     Seq("a", "b", "c"),
     Seq(
       Literal(42, IntegerType),
       Literal("hello", StringType),
       Literal(3.14, FloatType)))

resolvePartitions

public static scala.collection.Seq<ParquetRelation2.PartitionValues> resolvePartitions(scala.collection.Seq<ParquetRelation2.PartitionValues> values)

Resolves possible type conflicts between partitions by up-casting "lower" types. The up- casting order is:


   NullType ->
   IntegerType -> LongType ->
   FloatType -> DoubleType -> DecimalType.Unlimited ->
   StringType

inferPartitionColumnValue

public static org.apache.spark.sql.catalyst.expressions.Literal inferPartitionColumnValue(String raw,
                                                                          String defaultPartitionName)

Converts a string to a Literal with automatic type inference. Currently only supports IntegerType, LongType, FloatType, DoubleType, DecimalType.Unlimited, and StringType.

paths

public scala.collection.Seq<String> paths()

parameters

public scala.collection.immutable.Map<String,String> parameters()

maybeSchema

public scala.Option<org.apache.spark.sql.types.StructType> maybeSchema()

maybePartitionSpec

public scala.Option<PartitionSpec> maybePartitionSpec()

sqlContext
```
public SQLContext sqlContext()
```
Specified by:

sqlContext in class BaseRelation

equals
```
public boolean equals(Object other)
```
Specified by:

equals in interface scala.Equals

Overrides:

equals in class Object

sparkContext
```
public SparkContext sparkContext()
```

partitionSpec
```
public PartitionSpec partitionSpec()
```

partitionColumns

public org.apache.spark.sql.types.StructType partitionColumns()

partitions

public scala.collection.Seq<Partition> partitions()

isPartitioned
```
public boolean isPartitioned()
```

schema

public org.apache.spark.sql.types.StructType schema()

Specified by:: schema in class BaseRelation

sizeInBytes
```
public long sizeInBytes()
```
Description copied from class: BaseRelation

Returns an estimated size of this relation in bytes. This information is used by the planner to decided when it is safe to broadcast a relation and can be overridden by sources that know the size ahead of time. By default, the system will assume that tables are too large to broadcast. This method will be called multiple times during query planning and thus should not perform expensive operations for each invocation.
Note that it is always better to overestimate size than underestimate, because underestimation could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).

Overrides:

sizeInBytes in class BaseRelation

buildScan

public RDD<org.apache.spark.sql.Row> buildScan(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> output,
                                      scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates)

Specified by:: buildScan in interface CatalystScan

insert

public void insert(DataFrame data,
          boolean overwrite)

Specified by:: insert in interface InsertableRelation

Class ParquetRelation2

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface org.apache.spark.mapreduce.SparkHadoopMapReduceUtil

Methods inherited from interface org.apache.spark.Logging

Methods inherited from interface scala.Product

Methods inherited from interface scala.Equals