pyspark.RDD.saveAsHadoopFile#

RDD.saveAsHadoopFile(path, outputFormatClass, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, conf=None, compressionCodecClass=None)[source]#

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Key and value types will be inferred if not specified. Keys and values are converted for output using either user specified converters or “org.apache.spark.api.python.JavaToWritableConverter”. The conf is applied on top of the base Hadoop conf associated with the SparkContext of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.

New in version 1.1.0.

Parameters

pathstr: path to Hadoop file
outputFormatClassstr: fully qualified classname of Hadoop OutputFormat (e.g. “org.apache.hadoop.mapred.SequenceFileOutputFormat”)
keyClassstr, optional: fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.IntWritable”, None by default)
valueClassstr, optional: fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.Text”, None by default)
keyConverterstr, optional: fully qualified classname of key converter (None by default)
valueConverterstr, optional: fully qualified classname of value converter (None by default)
confdict, optional: (None by default)
compressionCodecClassstr: fully qualified classname of the compression codec class i.e. “org.apache.hadoop.io.compress.GzipCodec” (None by default)

See also

SparkContext.hadoopFile()
RDD.saveAsNewAPIHadoopFile()
RDD.saveAsHadoopDataset()
RDD.saveAsNewAPIHadoopDataset()
RDD.saveAsSequenceFile()

Examples

>>> import os
>>> import tempfile

Set the related classes

>>> output_format_class = "org.apache.hadoop.mapred.TextOutputFormat"
>>> input_format_class = "org.apache.hadoop.mapred.TextInputFormat"
>>> key_class = "org.apache.hadoop.io.IntWritable"
>>> value_class = "org.apache.hadoop.io.Text"

>>> with tempfile.TemporaryDirectory(prefix="saveAsHadoopFile") as d:
...     path = os.path.join(d, "old_hadoop_file")
...
...     # Write a temporary Hadoop file
...     rdd = sc.parallelize([(1, ""), (1, "a"), (3, "x")])
...     rdd.saveAsHadoopFile(path, output_format_class, key_class, value_class)
...
...     # Load this Hadoop file as an RDD
...     loaded = sc.hadoopFile(path, input_format_class, key_class, value_class)
...     sorted(loaded.collect())
[(0, '1\t'), (0, '1\ta'), (0, '3\tx')]