pyspark.SparkContext.newAPIHadoopFile

SparkContext.newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)[source]

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for SparkContext.sequenceFile().

A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java

Parameters:
pathstr

path to Hadoop file

inputFormatClassstr

fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)

keyClassstr

fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)

valueClassstr

fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)

keyConverterstr, optional

fully qualified name of a function returning key WritableConverter None by default

valueConverterstr, optional

fully qualified name of a function returning value WritableConverter None by default

confdict, optional

Hadoop configuration, passed in as a dict None by default

batchSizeint, optional

The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)