public interface HiveInspectors
Decimal
Array[Byte]
java.sql.Date
java.sql.Timestamp
Complex Types =>
Map: MapData
List: ArrayData
Struct: InternalRow
Union: NOT SUPPORTED YET
The Complex types plays as a container, which can hold arbitrary data types.
In Hive, the native data types are various, in UDF/UDAF/UDTF, and associated with
Object Inspectors, in Hive expression evaluation framework, the underlying data are
Primitive Type
Java Boxed Primitives:
org.apache.hadoop.hive.common.type.HiveVarchar
org.apache.hadoop.hive.common.type.HiveChar
java.lang.String
java.lang.Integer
java.lang.Boolean
java.lang.Float
java.lang.Double
java.lang.Long
java.lang.Short
java.lang.Byte
org.apache.hadoop.hive.common.type
.HiveDecimal
byte[]
java.sql.Date
java.sql.Timestamp
Writables:
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable
org.apache.hadoop.hive.serde2.io.HiveCharWritable
org.apache.hadoop.io.Text
org.apache.hadoop.io.IntWritable
org.apache.hadoop.hive.serde2.io.DoubleWritable
org.apache.hadoop.io.BooleanWritable
org.apache.hadoop.io.LongWritable
org.apache.hadoop.io.FloatWritable
org.apache.hadoop.hive.serde2.io.ShortWritable
org.apache.hadoop.hive.serde2.io.ByteWritable
org.apache.hadoop.io.BytesWritable
org.apache.hadoop.hive.serde2.io.DateWritable
org.apache.hadoop.hive.serde2.io.TimestampWritable
org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
Complex Type
List: Object[] / java.util.List
Map: java.util.Map
Struct: Object[] / java.util.List / java POJO
Union: class StandardUnion { byte tag; Object object }
NOTICE: HiveVarchar/HiveChar is not supported by catalyst, it will be simply considered as String type.
2. Hive ObjectInspector is a group of flexible APIs to inspect value in different data representation, and developers can extend those API as needed, so technically, object inspector supports arbitrary data type in java.
Fortunately, only few built-in Hive Object Inspectors are used in generic udf/udaf/udtf evaluation. 1) Primitive Types (PrimitiveObjectInspector & its sub classes)
public interface PrimitiveObjectInspector {
// Java Primitives (java.lang.Integer, java.lang.String etc.)
Object getPrimitiveJavaObject(Object o);
// Writables (hadoop.io.IntWritable, hadoop.io.Text etc.)
Object getPrimitiveWritableObject(Object o);
// ObjectInspector only inspect the `writable` always return true, we need to check it
// before invoking the methods above.
boolean preferWritable();
...
}
2) Complex Types:
ListObjectInspector: inspects java array or List
MapObjectInspector: inspects Map
Struct.StructObjectInspector: inspects java array, List
and
even a normal java object (POJO)
UnionObjectInspector: (tag: Int, object data) (TODO: not supported by SparkSQL yet)
3) ConstantObjectInspector: Constant object inspector can be either primitive type or Complex type, and it bundles a constant value as its property, usually the value is created when the constant object inspector constructed.
public interface ConstantObjectInspector extends ObjectInspector {
Object getWritableConstantValue();
...
}
Hive provides 3 built-in constant object inspectors:
Primitive Object Inspectors:
WritableConstantStringObjectInspector
WritableConstantHiveVarcharObjectInspector
WritableConstantHiveCharObjectInspector
WritableConstantHiveDecimalObjectInspector
WritableConstantTimestampObjectInspector
WritableConstantIntObjectInspector
WritableConstantDoubleObjectInspector
WritableConstantBooleanObjectInspector
WritableConstantLongObjectInspector
WritableConstantFloatObjectInspector
WritableConstantShortObjectInspector
WritableConstantByteObjectInspector
WritableConstantBinaryObjectInspector
WritableConstantDateObjectInspector
Map Object Inspector:
StandardConstantMapObjectInspector
List Object Inspector:
StandardConstantListObjectInspector}
Struct Object Inspector: Hive doesn't provide the built-in constant object inspector for Struct
Union Object Inspector: Hive doesn't provide the built-in constant object inspector for Union
3. This trait facilitates: Data Unwrapping: Hive Data => Catalyst Data (unwrap) Data Wrapping: Catalyst Data => Hive Data (wrap) Binding the Object Inspector for Catalyst Data (toInspector) Retrieving the Catalyst Data Type from Object Inspector (inspectorToDataType)
4. Future Improvement (TODO) This implementation is quite ugly and inefficient: a. Pattern matching in runtime b. Small objects creation in catalyst data => writable c. Unnecessary unwrap / wrap for nested UDF invoking: e.g. date_add(printf("%s-%s-%s", a,b,c), 3) We don't need to unwrap the data for printf and wrap it again and passes in data_add
Modifier and Type | Interface and Description |
---|---|
static class |
HiveInspectors.typeInfoConversions |
Modifier and Type | Method and Description |
---|---|
DecimalType |
decimalTypeInfoToCatalyst(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector inspector) |
org.apache.hadoop.io.BytesWritable |
getBinaryWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getBinaryWritableConstantObjectInspector(Object value) |
org.apache.hadoop.io.BooleanWritable |
getBooleanWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getBooleanWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.io.ByteWritable |
getByteWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getByteWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.io.DateWritable |
getDateWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getDateWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.io.HiveDecimalWritable |
getDecimalWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getDecimalWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.io.DoubleWritable |
getDoubleWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getDoubleWritableConstantObjectInspector(Object value) |
org.apache.hadoop.io.FloatWritable |
getFloatWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getFloatWritableConstantObjectInspector(Object value) |
org.apache.hadoop.io.IntWritable |
getIntWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getIntWritableConstantObjectInspector(Object value) |
org.apache.hadoop.io.LongWritable |
getLongWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getLongWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getPrimitiveNullWritableConstantObjectInspector() |
org.apache.hadoop.hive.serde2.io.ShortWritable |
getShortWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getShortWritableConstantObjectInspector(Object value) |
org.apache.hadoop.io.Text |
getStringWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getStringWritableConstantObjectInspector(Object value) |
org.apache.hadoop.hive.serde2.io.TimestampWritable |
getTimestampWritable(Object value) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
getTimestampWritableConstantObjectInspector(Object value) |
DataType |
inspectorToDataType(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector inspector) |
boolean |
isSubClassOf(java.lang.reflect.Type t,
Class<?> parent) |
DataType |
javaTypeToDataType(java.lang.reflect.Type clz) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
toInspector(DataType dataType) |
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector |
toInspector(org.apache.spark.sql.catalyst.expressions.Expression expr)
Map the catalyst expression to ObjectInspector, however,
if the expression is
Literal or foldable, a constant writable object inspector returns;
Otherwise, we always get the object inspector according to its data type(in catalyst) |
scala.Function1<Object,Object> |
unwrapperFor(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector objectInspector)
Builds unwrappers ahead of time according to object inspector
types to avoid pattern matching and branching costs per row.
|
scala.Function3<Object,org.apache.spark.sql.catalyst.InternalRow,Object,scala.runtime.BoxedUnit> |
unwrapperFor(org.apache.hadoop.hive.serde2.objectinspector.StructField field)
Builds unwrappers ahead of time according to object inspector
types to avoid pattern matching and branching costs per row.
|
scala.Function1<Object,Object> |
withNullSafe(scala.Function1<Object,Object> f) |
Object[] |
wrap(org.apache.spark.sql.catalyst.InternalRow row,
scala.Function1<Object,Object>[] wrappers,
Object[] cache,
DataType[] dataTypes) |
Object |
wrap(Object a,
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector oi,
DataType dataType) |
Object[] |
wrap(scala.collection.Seq<Object> row,
scala.Function1<Object,Object>[] wrappers,
Object[] cache,
DataType[] dataTypes) |
scala.Function1<Object,Object> |
wrapperFor(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector oi,
DataType dataType)
Wraps with Hive types based on object inspector.
|
DecimalType decimalTypeInfoToCatalyst(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector inspector)
org.apache.hadoop.io.BytesWritable getBinaryWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getBinaryWritableConstantObjectInspector(Object value)
org.apache.hadoop.io.BooleanWritable getBooleanWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getBooleanWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.io.ByteWritable getByteWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getByteWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.io.DateWritable getDateWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getDateWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.io.HiveDecimalWritable getDecimalWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getDecimalWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.io.DoubleWritable getDoubleWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getDoubleWritableConstantObjectInspector(Object value)
org.apache.hadoop.io.FloatWritable getFloatWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getFloatWritableConstantObjectInspector(Object value)
org.apache.hadoop.io.IntWritable getIntWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getIntWritableConstantObjectInspector(Object value)
org.apache.hadoop.io.LongWritable getLongWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getLongWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getPrimitiveNullWritableConstantObjectInspector()
org.apache.hadoop.hive.serde2.io.ShortWritable getShortWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getShortWritableConstantObjectInspector(Object value)
org.apache.hadoop.io.Text getStringWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getStringWritableConstantObjectInspector(Object value)
org.apache.hadoop.hive.serde2.io.TimestampWritable getTimestampWritable(Object value)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector getTimestampWritableConstantObjectInspector(Object value)
DataType inspectorToDataType(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector inspector)
boolean isSubClassOf(java.lang.reflect.Type t, Class<?> parent)
DataType javaTypeToDataType(java.lang.reflect.Type clz)
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector toInspector(DataType dataType)
dataType
- Catalyst data typeorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector toInspector(org.apache.spark.sql.catalyst.expressions.Expression expr)
Literal
or foldable, a constant writable object inspector returns;
Otherwise, we always get the object inspector according to its data type(in catalyst)expr
- Catalyst expression to be mappedscala.Function1<Object,Object> unwrapperFor(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector objectInspector)
Strictly follows the following order in unwrapping (constant OI has the higher priority):
Constant Null object inspector =>
return null
Constant object inspector =>
extract the value from constant object inspector
If object inspector prefers writable =>
extract writable from data
and then get the catalyst type from the writable
Extract the java object directly from the object inspector
NOTICE: the complex data type requires recursive unwrapping.
objectInspector
- the ObjectInspector used to create an unwrapper.scala.Function3<Object,org.apache.spark.sql.catalyst.InternalRow,Object,scala.runtime.BoxedUnit> unwrapperFor(org.apache.hadoop.hive.serde2.objectinspector.StructField field)
field
- The HiveStructField to create an unwrapper for.scala.Function1<Object,Object> withNullSafe(scala.Function1<Object,Object> f)
Object wrap(Object a, org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector oi, DataType dataType)
Object[] wrap(org.apache.spark.sql.catalyst.InternalRow row, scala.Function1<Object,Object>[] wrappers, Object[] cache, DataType[] dataTypes)
Object[] wrap(scala.collection.Seq<Object> row, scala.Function1<Object,Object>[] wrappers, Object[] cache, DataType[] dataTypes)
scala.Function1<Object,Object> wrapperFor(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector oi, DataType dataType)
oi
- (undocumented)dataType
- (undocumented)