Skip to content

Commit 04b6688

Browse files
mateizpdeyhim
authored andcommitted
SPARK-1421. Make MLlib work on Python 2.6
The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well. Author: Matei Zaharia <[email protected]> Closes apache#335 from mateiz/mllib-python-2.6 and squashes the following commits: f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7 a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
1 parent 1de4cd6 commit 04b6688

File tree

4 files changed

+13
-9
lines changed

4 files changed

+13
-9
lines changed

docs/mllib-guide.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,5 @@ depends on native Fortran routines. You may need to install the
3838
if it is not already present on your nodes. MLlib will throw a linking error if it cannot
3939
detect these libraries automatically.
4040

41-
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer
42-
and Python 2.7.
41+
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer.
4342

docs/python-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ Many of the methods also contain [doctests](http://docs.python.org/2/library/doc
152152
# Libraries
153153

154154
[MLlib](mllib-guide.html) is also available in PySpark. To use it, you'll need
155-
[NumPy](http://www.numpy.org) version 1.7 or newer, and Python 2.7. The [MLlib guide](mllib-guide.html) contains
155+
[NumPy](http://www.numpy.org) version 1.7 or newer. The [MLlib guide](mllib-guide.html) contains
156156
some example applications.
157157

158158
# Where to Go from Here

python/pyspark/mllib/__init__.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,7 @@
1919
Python bindings for MLlib.
2020
"""
2121

22-
# MLlib currently needs Python 2.7+ and NumPy 1.7+, so complain if lower
23-
24-
import sys
25-
if sys.version_info[0:2] < (2, 7):
26-
raise Exception("MLlib requires Python 2.7+")
22+
# MLlib currently needs and NumPy 1.7+, so complain if lower
2723

2824
import numpy
2925
if numpy.version.version < '1.7':

python/pyspark/serializers.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@
6464
from itertools import chain, izip, product
6565
import marshal
6666
import struct
67+
import sys
6768
from pyspark import cloudpickle
6869

6970

@@ -113,6 +114,11 @@ class FramedSerializer(Serializer):
113114
where C{length} is a 32-bit integer and data is C{length} bytes.
114115
"""
115116

117+
def __init__(self):
118+
# On Python 2.6, we can't write bytearrays to streams, so we need to convert them
119+
# to strings first. Check if the version number is that old.
120+
self._only_write_strings = sys.version_info[0:2] <= (2, 6)
121+
116122
def dump_stream(self, iterator, stream):
117123
for obj in iterator:
118124
self._write_with_length(obj, stream)
@@ -127,7 +133,10 @@ def load_stream(self, stream):
127133
def _write_with_length(self, obj, stream):
128134
serialized = self.dumps(obj)
129135
write_int(len(serialized), stream)
130-
stream.write(serialized)
136+
if self._only_write_strings:
137+
stream.write(str(serialized))
138+
else:
139+
stream.write(serialized)
131140

132141
def _read_with_length(self, stream):
133142
length = read_int(stream)

0 commit comments

Comments
 (0)