Skip to content

Commit 78978d9

Browse files
committed
Add doc for SequenceFile and InputFormat support to Python programming guide
1 parent 64eb051 commit 78978d9

File tree

1 file changed

+73
-0
lines changed

1 file changed

+73
-0
lines changed

docs/python-programming-guide.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,79 @@ conf = (SparkConf()
145145
sc = SparkContext(conf = conf)
146146
{% endhighlight %}
147147

148+
# SequenceFile and Hadoop InputFormats
149+
150+
In addition to reading text files, PySpark supports reading Hadoop SequenceFile and arbitrary InputFormats.
151+
152+
## Writable Support
153+
154+
PySpark SequenceFile support loads an RDD within Java, and pickles the resulting Java objects using
155+
[Pyrolite](https://github.com/irmen/Pyrolite/). The following Writables are automatically converted:
156+
157+
158+
<table class="table">
159+
<tr><th>Writable Type</th><th>Scala Type</th><th>Python Type</th></tr>
160+
<tr><td>Text</td><td>String</td><td>unicode str</td></tr>
161+
<tr><td>IntWritable</td><td>Int</td><td>int</td></tr>
162+
<tr><td>FloatWritable</td><td>Float</td><td>float</td></tr>
163+
<tr><td>DoubleWritable</td><td>Double</td><td>float</td></tr>
164+
<tr><td>BooleanWritable</td><td>Boolean</td><td>bool</td></tr>
165+
<tr><td>BytesWritable</td><td>Array[Byte]</td><td>bytearray</td></tr>
166+
<tr><td>NullWritable</td><td>null</td><td>None</td></tr>
167+
<tr><td>ArrayWritable</td><td>Array[T]</td><td>list of primitives, or tuple of objects</td></tr>
168+
<tr><td>MapWritable</td><td>java.util.Map[K, V]</td><td>dict</td></tr>
169+
<tr><td>Custom Class</td><td>Custom Class conforming to Java Bean conventions</td>
170+
<td>dict of public properties (via JavaBean getters and setters) + __class__ for the class type</td></tr>
171+
</table>
172+
173+
## Loading SequenceFiles
174+
175+
Similarly to text files, SequenceFiles can be loaded by specifying the path. The key and value
176+
classes can be specified, but for standard Writables it should work without requiring this.
177+
178+
{% highlight python %}
179+
>>> rdd = sc.sequenceFile("path/to/sequencefile/of/doubles")
180+
>>> rdd.collect() # this example has DoubleWritable keys and Text values
181+
[(1.0, u'aa'),
182+
(2.0, u'bb'),
183+
(2.0, u'aa'),
184+
(3.0, u'cc'),
185+
(2.0, u'bb'),
186+
(1.0, u'aa')]
187+
>>> help(sc.sequenceFile) # Show sequencefile documentation
188+
{% endhighlight %}
189+
190+
## Loading Arbitrary Hadoop InputFormats
191+
192+
PySpark can also read any Hadoop InputFormat, for both 'new' and 'old' Hadoop APIs. If required,
193+
a Hadoop configuration can be passed in as a Python dict. Here is an example using the
194+
Elasticsearch ESInputFormat:
195+
196+
{% highlight python %}
197+
$ SPARK_CLASSPATH=/path/to/elasticsearch-hadoop.jar ./bin/pyspark
198+
>>> conf = {"es.resource" : "index/type"} # assume Elasticsearch is running on localhost defaults
199+
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",\
200+
"org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
201+
>>> rdd.first() # the result is a MapWritable that is converted to a Python dict
202+
(u'Elasticsearch ID',
203+
{u'field1': True,
204+
u'field2': u'Some Text',
205+
u'field3': 12345})
206+
>>> help(sc.newAPIHadoopRDD) # Show help for new API Hadoop RDD
207+
{% endhighlight %}
208+
209+
Note that, if the InputFormat simply depends on a Hadoop configuration and/or input path, and
210+
the key and value classes can easily be converted according to the above table,
211+
then this approach should work well for such cases.
212+
213+
If you have custom serialized binary data (like pulling data from Cassandra / HBase) or custom
214+
classes that don't conform to the JavaBean requirements, then you will probably have to first
215+
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
216+
217+
Future support for 'wrapper' functions for keys/values that allows this to be written in Java/Scala,
218+
and called from Python, as well as support for writing data out as SequenceFile format
219+
and other OutputFormats, is forthcoming.
220+
148221
# API Docs
149222

150223
[API documentation](api/pyspark/index.html) for PySpark is available as Epydoc.

0 commit comments

Comments
 (0)