@@ -145,6 +145,79 @@ conf = (SparkConf()
145145sc = SparkContext(conf = conf)
146146{% endhighlight %}
147147
148+ # SequenceFile and Hadoop InputFormats
149+
150+ In addition to reading text files, PySpark supports reading Hadoop SequenceFile and arbitrary InputFormats.
151+
152+ ## Writable Support
153+
154+ PySpark SequenceFile support loads an RDD within Java, and pickles the resulting Java objects using
155+ [ Pyrolite] ( https://github.com/irmen/Pyrolite/ ) . The following Writables are automatically converted:
156+
157+
158+ <table class =" table " >
159+ <tr ><th >Writable Type</th ><th >Scala Type</th ><th >Python Type</th ></tr >
160+ <tr ><td >Text</td ><td >String</td ><td >unicode str</td ></tr >
161+ <tr ><td >IntWritable</td ><td >Int</td ><td >int</td ></tr >
162+ <tr ><td >FloatWritable</td ><td >Float</td ><td >float</td ></tr >
163+ <tr ><td >DoubleWritable</td ><td >Double</td ><td >float</td ></tr >
164+ <tr ><td >BooleanWritable</td ><td >Boolean</td ><td >bool</td ></tr >
165+ <tr ><td >BytesWritable</td ><td >Array[Byte]</td ><td >bytearray</td ></tr >
166+ <tr ><td >NullWritable</td ><td >null</td ><td >None</td ></tr >
167+ <tr ><td >ArrayWritable</td ><td >Array[T]</td ><td >list of primitives, or tuple of objects</td ></tr >
168+ <tr ><td >MapWritable</td ><td >java.util.Map[K, V]</td ><td >dict</td ></tr >
169+ <tr ><td >Custom Class</td ><td >Custom Class conforming to Java Bean conventions</td >
170+ <td>dict of public properties (via JavaBean getters and setters) + __class__ for the class type</td></tr>
171+ </table >
172+
173+ ## Loading SequenceFiles
174+
175+ Similarly to text files, SequenceFiles can be loaded by specifying the path. The key and value
176+ classes can be specified, but for standard Writables it should work without requiring this.
177+
178+ {% highlight python %}
179+ >>> rdd = sc.sequenceFile("path/to/sequencefile/of/doubles")
180+ >>> rdd.collect() # this example has DoubleWritable keys and Text values
181+ [ (1.0, u'aa'),
182+ (2.0, u'bb'),
183+ (2.0, u'aa'),
184+ (3.0, u'cc'),
185+ (2.0, u'bb'),
186+ (1.0, u'aa')]
187+ >>> help(sc.sequenceFile) # Show sequencefile documentation
188+ {% endhighlight %}
189+
190+ ## Loading Arbitrary Hadoop InputFormats
191+
192+ PySpark can also read any Hadoop InputFormat, for both 'new' and 'old' Hadoop APIs. If required,
193+ a Hadoop configuration can be passed in as a Python dict. Here is an example using the
194+ Elasticsearch ESInputFormat:
195+
196+ {% highlight python %}
197+ $ SPARK_CLASSPATH=/path/to/elasticsearch-hadoop.jar ./bin/pyspark
198+ >>> conf = {"es.resource" : "index/type"} # assume Elasticsearch is running on localhost defaults
199+ >>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",\
200+ "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
201+ >>> rdd.first() # the result is a MapWritable that is converted to a Python dict
202+ (u'Elasticsearch ID',
203+ {u'field1': True,
204+ u'field2': u'Some Text',
205+ u'field3': 12345})
206+ >>> help(sc.newAPIHadoopRDD) # Show help for new API Hadoop RDD
207+ {% endhighlight %}
208+
209+ Note that, if the InputFormat simply depends on a Hadoop configuration and/or input path, and
210+ the key and value classes can easily be converted according to the above table,
211+ then this approach should work well for such cases.
212+
213+ If you have custom serialized binary data (like pulling data from Cassandra / HBase) or custom
214+ classes that don't conform to the JavaBean requirements, then you will probably have to first
215+ transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
216+
217+ Future support for 'wrapper' functions for keys/values that allows this to be written in Java/Scala,
218+ and called from Python, as well as support for writing data out as SequenceFile format
219+ and other OutputFormats, is forthcoming.
220+
148221# API Docs
149222
150223[ API documentation] ( api/pyspark/index.html ) for PySpark is available as Epydoc.
0 commit comments