Skip to content

Commit fdff7fc

Browse files
committed
Merge remote-tracking branch 'apache/master' into config-cleanup
Conflicts: docs/configuration.md
2 parents 3289ea4 + b6d22af commit fdff7fc

File tree

20 files changed

+228
-183
lines changed

20 files changed

+228
-183
lines changed

core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,12 @@ import org.apache.spark.rdd.RDD
4141
* [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
4242
*/
4343
class JavaSparkContext(val sc: SparkContext) extends JavaSparkContextVarargsWorkaround {
44+
/**
45+
* Create a JavaSparkContext that loads settings from system properties (for instance, when
46+
* launching with ./bin/spark-submit).
47+
*/
48+
def this() = this(new SparkContext())
49+
4450
/**
4551
* @param conf a [[org.apache.spark.SparkConf]] object specifying Spark parameters
4652
*/

docs/README.md

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,31 @@
11
Welcome to the Spark documentation!
22

3-
This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at http://spark.apache.org/documentation.html.
3+
This readme will walk you through navigating and building the Spark documentation, which is included
4+
here with the Spark source code. You can also find documentation specific to release versions of
5+
Spark at http://spark.apache.org/documentation.html.
46

5-
Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control.
7+
Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the
8+
documentation yourself. Why build it yourself? So that you have the docs that corresponds to
9+
whichever version of Spark you currently have checked out of revision control.
610

711
## Generating the Documentation HTML
812

9-
We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.
13+
We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as
14+
the github wiki, as the definitive documentation) to enable the documentation to evolve along with
15+
the source code and be captured by revision control (currently git). This way the code automatically
16+
includes the version of the documentation that is relevant regardless of which version or release
17+
you have checked out or downloaded.
1018

11-
In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.
19+
In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can
20+
read those text files directly if you want. Start with index.md.
1221

13-
The markdown code can be compiled to HTML using the
14-
[Jekyll tool](http://jekyllrb.com).
22+
The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com).
1523
To use the `jekyll` command, you will need to have Jekyll installed.
1624
The easiest way to do this is via a Ruby Gem, see the
1725
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
1826
If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
19-
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
20-
`_site` containing index.html as well as the rest of the compiled files.
27+
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory
28+
called `_site` containing index.html as well as the rest of the compiled files.
2129

2230
You can modify the default Jekyll build as follows:
2331

@@ -30,9 +38,11 @@ You can modify the default Jekyll build as follows:
3038

3139
## Pygments
3240

33-
We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.
41+
We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages,
42+
so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.
3443

35-
To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile phase, use the following sytax:
44+
To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile
45+
phase, use the following sytax:
3646

3747
{% highlight scala %}
3848
// Your scala code goes here, you can replace scala with many other
@@ -43,8 +53,15 @@ To mark a block of code in your markdown to be syntax highlighted by jekyll duri
4353

4454
You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.
4555

46-
Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
56+
Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the
57+
SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as
58+
public in `__init__.py`.
4759

48-
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
60+
When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various
61+
Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a
62+
jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it
63+
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
64+
PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
4965

50-
NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
66+
NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1
67+
jekyll`.

docs/configuration.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ title: Spark Configuration
66
* This will become a table of contents (this text will be scraped).
77
{:toc}
88

9+
Spark provides several locations to configure the system:
10+
911
# Spark Properties
1012

1113
Spark properties control most application settings and are configured separately for each

examples/src/main/python/als.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,22 +29,25 @@
2929
LAMBDA = 0.01 # regularization
3030
np.random.seed(42)
3131

32+
3233
def rmse(R, ms, us):
3334
diff = R - ms * us.T
3435
return np.sqrt(np.sum(np.power(diff, 2)) / M * U)
3536

37+
3638
def update(i, vec, mat, ratings):
3739
uu = mat.shape[0]
3840
ff = mat.shape[1]
39-
41+
4042
XtX = mat.T * mat
4143
Xty = mat.T * ratings[i, :].T
42-
44+
4345
for j in range(ff):
44-
XtX[j,j] += LAMBDA * uu
45-
46+
XtX[j, j] += LAMBDA * uu
47+
4648
return np.linalg.solve(XtX, Xty)
4749

50+
4851
if __name__ == "__main__":
4952
"""
5053
Usage: als [M] [U] [F] [iterations] [slices]"
@@ -57,10 +60,10 @@ def update(i, vec, mat, ratings):
5760
slices = int(sys.argv[5]) if len(sys.argv) > 5 else 2
5861

5962
print "Running ALS with M=%d, U=%d, F=%d, iters=%d, slices=%d\n" % \
60-
(M, U, F, ITERATIONS, slices)
63+
(M, U, F, ITERATIONS, slices)
6164

6265
R = matrix(rand(M, F)) * matrix(rand(U, F).T)
63-
ms = matrix(rand(M ,F))
66+
ms = matrix(rand(M, F))
6467
us = matrix(rand(U, F))
6568

6669
Rb = sc.broadcast(R)
@@ -71,8 +74,9 @@ def update(i, vec, mat, ratings):
7174
ms = sc.parallelize(range(M), slices) \
7275
.map(lambda x: update(x, msb.value[x, :], usb.value, Rb.value)) \
7376
.collect()
74-
ms = matrix(np.array(ms)[:, :, 0]) # collect() returns a list, so array ends up being
75-
# a 3-d array, we take the first 2 dims for the matrix
77+
# collect() returns a list, so array ends up being
78+
# a 3-d array, we take the first 2 dims for the matrix
79+
ms = matrix(np.array(ms)[:, :, 0])
7680
msb = sc.broadcast(ms)
7781

7882
us = sc.parallelize(range(U), slices) \

examples/src/main/python/kmeans.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def closestPoint(p, centers):
5959

6060
while tempDist > convergeDist:
6161
closest = data.map(
62-
lambda p : (closestPoint(p, kPoints), (p, 1)))
62+
lambda p: (closestPoint(p, kPoints), (p, 1)))
6363
pointStats = closest.reduceByKey(
6464
lambda (x1, y1), (x2, y2): (x1 + x2, y1 + y2))
6565
newPoints = pointStats.map(

examples/src/main/python/logistic_regression.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,8 @@ def readPointBatch(iterator):
6060

6161
# Compute logistic regression gradient for a matrix of data points
6262
def gradient(matrix, w):
63-
Y = matrix[:,0] # point labels (first column of input file)
64-
X = matrix[:,1:] # point coordinates
63+
Y = matrix[:, 0] # point labels (first column of input file)
64+
X = matrix[:, 1:] # point coordinates
6565
# For each point (x, y), compute gradient function, then sum these up
6666
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
6767

examples/src/main/python/pagerank.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,8 @@
1515
# limitations under the License.
1616
#
1717

18-
#!/usr/bin/env python
19-
20-
import re, sys
18+
import re
19+
import sys
2120
from operator import add
2221

2322
from pyspark import SparkContext
@@ -26,7 +25,8 @@
2625
def computeContribs(urls, rank):
2726
"""Calculates URL contributions to the rank of other URLs."""
2827
num_urls = len(urls)
29-
for url in urls: yield (url, rank / num_urls)
28+
for url in urls:
29+
yield (url, rank / num_urls)
3030

3131

3232
def parseNeighbors(urls):
@@ -59,8 +59,8 @@ def parseNeighbors(urls):
5959
# Calculates and updates URL ranks continuously using PageRank algorithm.
6060
for iteration in xrange(int(sys.argv[2])):
6161
# Calculates URL contributions to the rank of other URLs.
62-
contribs = links.join(ranks).flatMap(lambda (url, (urls, rank)):
63-
computeContribs(urls, rank))
62+
contribs = links.join(ranks).flatMap(
63+
lambda (url, (urls, rank)): computeContribs(urls, rank))
6464

6565
# Re-calculates URL ranks based on neighbor contributions.
6666
ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)

examples/src/main/python/pi.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,11 @@
2929
sc = SparkContext(appName="PythonPi")
3030
slices = int(sys.argv[1]) if len(sys.argv) > 1 else 2
3131
n = 100000 * slices
32+
3233
def f(_):
3334
x = random() * 2 - 1
3435
y = random() * 2 - 1
3536
return 1 if x ** 2 + y ** 2 < 1 else 0
37+
3638
count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
3739
print "Pi is roughly %f" % (4.0 * count / n)

examples/src/main/python/sort.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727
sc = SparkContext(appName="PythonSort")
2828
lines = sc.textFile(sys.argv[1], 1)
2929
sortedCount = lines.flatMap(lambda x: x.split(' ')) \
30-
.map(lambda x: (int(x), 1)) \
31-
.sortByKey(lambda x: x)
30+
.map(lambda x: (int(x), 1)) \
31+
.sortByKey(lambda x: x)
3232
# This is just a demo on how to bring all the sorted data back to a single node.
3333
# In reality, we wouldn't want to collect all the data to the driver node.
3434
output = sortedCount.collect()

python/pyspark/mllib/_common.py

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,8 @@
5656
#
5757
# Sparse double vector format:
5858
#
59-
# [1-byte 2] [4-byte length] [4-byte nonzeros] [nonzeros*4 bytes of indices] [nonzeros*8 bytes of values]
59+
# [1-byte 2] [4-byte length] [4-byte nonzeros] [nonzeros*4 bytes of indices] \
60+
# [nonzeros*8 bytes of values]
6061
#
6162
# Double matrix format:
6263
#
@@ -110,18 +111,18 @@ def _serialize_double_vector(v):
110111
return _serialize_sparse_vector(v)
111112
else:
112113
raise TypeError("_serialize_double_vector called on a %s; "
113-
"wanted ndarray or SparseVector" % type(v))
114+
"wanted ndarray or SparseVector" % type(v))
114115

115116

116117
def _serialize_dense_vector(v):
117118
"""Serialize a dense vector given as a NumPy array."""
118119
if v.ndim != 1:
119120
raise TypeError("_serialize_double_vector called on a %ddarray; "
120-
"wanted a 1darray" % v.ndim)
121+
"wanted a 1darray" % v.ndim)
121122
if v.dtype != float64:
122123
if numpy.issubdtype(v.dtype, numpy.complex):
123124
raise TypeError("_serialize_double_vector called on an ndarray of %s; "
124-
"wanted ndarray of float64" % v.dtype)
125+
"wanted ndarray of float64" % v.dtype)
125126
v = v.astype(float64)
126127
length = v.shape[0]
127128
ba = bytearray(5 + 8 * length)
@@ -158,10 +159,10 @@ def _deserialize_double_vector(ba):
158159
"""
159160
if type(ba) != bytearray:
160161
raise TypeError("_deserialize_double_vector called on a %s; "
161-
"wanted bytearray" % type(ba))
162+
"wanted bytearray" % type(ba))
162163
if len(ba) < 5:
163164
raise TypeError("_deserialize_double_vector called on a %d-byte array, "
164-
"which is too short" % len(ba))
165+
"which is too short" % len(ba))
165166
if ba[0] == DENSE_VECTOR_MAGIC:
166167
return _deserialize_dense_vector(ba)
167168
elif ba[0] == SPARSE_VECTOR_MAGIC:
@@ -175,7 +176,7 @@ def _deserialize_dense_vector(ba):
175176
"""Deserialize a dense vector into a numpy array."""
176177
if len(ba) < 5:
177178
raise TypeError("_deserialize_dense_vector called on a %d-byte array, "
178-
"which is too short" % len(ba))
179+
"which is too short" % len(ba))
179180
length = ndarray(shape=[1], buffer=ba, offset=1, dtype=int32)[0]
180181
if len(ba) != 8 * length + 5:
181182
raise TypeError("_deserialize_dense_vector called on bytearray "
@@ -187,7 +188,7 @@ def _deserialize_sparse_vector(ba):
187188
"""Deserialize a sparse vector into a MLlib SparseVector object."""
188189
if len(ba) < 9:
189190
raise TypeError("_deserialize_sparse_vector called on a %d-byte array, "
190-
"which is too short" % len(ba))
191+
"which is too short" % len(ba))
191192
header = ndarray(shape=[2], buffer=ba, offset=1, dtype=int32)
192193
size = header[0]
193194
nonzeros = header[1]
@@ -205,7 +206,7 @@ def _serialize_double_matrix(m):
205206
if m.dtype != float64:
206207
if numpy.issubdtype(m.dtype, numpy.complex):
207208
raise TypeError("_serialize_double_matrix called on an ndarray of %s; "
208-
"wanted ndarray of float64" % m.dtype)
209+
"wanted ndarray of float64" % m.dtype)
209210
m = m.astype(float64)
210211
rows = m.shape[0]
211212
cols = m.shape[1]
@@ -225,10 +226,10 @@ def _deserialize_double_matrix(ba):
225226
"""Deserialize a double matrix from a mutually understood format."""
226227
if type(ba) != bytearray:
227228
raise TypeError("_deserialize_double_matrix called on a %s; "
228-
"wanted bytearray" % type(ba))
229+
"wanted bytearray" % type(ba))
229230
if len(ba) < 9:
230231
raise TypeError("_deserialize_double_matrix called on a %d-byte array, "
231-
"which is too short" % len(ba))
232+
"which is too short" % len(ba))
232233
if ba[0] != DENSE_MATRIX_MAGIC:
233234
raise TypeError("_deserialize_double_matrix called on bytearray "
234235
"with wrong magic")
@@ -267,7 +268,7 @@ def _copyto(array, buffer, offset, shape, dtype):
267268
def _get_unmangled_rdd(data, serializer):
268269
dataBytes = data.map(serializer)
269270
dataBytes._bypass_serializer = True
270-
dataBytes.cache() # TODO: users should unpersist() this later!
271+
dataBytes.cache() # TODO: users should unpersist() this later!
271272
return dataBytes
272273

273274

@@ -293,14 +294,14 @@ def _linear_predictor_typecheck(x, coeffs):
293294
if type(x) == ndarray:
294295
if x.ndim == 1:
295296
if x.shape != coeffs.shape:
296-
raise RuntimeError("Got array of %d elements; wanted %d"
297-
% (numpy.shape(x)[0], coeffs.shape[0]))
297+
raise RuntimeError("Got array of %d elements; wanted %d" % (
298+
numpy.shape(x)[0], coeffs.shape[0]))
298299
else:
299300
raise RuntimeError("Bulk predict not yet supported.")
300301
elif type(x) == SparseVector:
301302
if x.size != coeffs.shape[0]:
302-
raise RuntimeError("Got sparse vector of size %d; wanted %d"
303-
% (x.size, coeffs.shape[0]))
303+
raise RuntimeError("Got sparse vector of size %d; wanted %d" % (
304+
x.size, coeffs.shape[0]))
304305
elif (type(x) == RDD):
305306
raise RuntimeError("Bulk predict not yet supported.")
306307
else:
@@ -315,7 +316,7 @@ def _get_initial_weights(initial_weights, data):
315316
if type(initial_weights) == ndarray:
316317
if initial_weights.ndim != 1:
317318
raise TypeError("At least one data element has "
318-
+ initial_weights.ndim + " dimensions, which is not 1")
319+
+ initial_weights.ndim + " dimensions, which is not 1")
319320
initial_weights = numpy.zeros([initial_weights.shape[0]])
320321
elif type(initial_weights) == SparseVector:
321322
initial_weights = numpy.zeros([initial_weights.size])
@@ -333,10 +334,10 @@ def _regression_train_wrapper(sc, train_func, klass, data, initial_weights):
333334
raise RuntimeError("JVM call result had unexpected length")
334335
elif type(ans[0]) != bytearray:
335336
raise RuntimeError("JVM call result had first element of type "
336-
+ type(ans[0]).__name__ + " which is not bytearray")
337+
+ type(ans[0]).__name__ + " which is not bytearray")
337338
elif type(ans[1]) != float:
338339
raise RuntimeError("JVM call result had second element of type "
339-
+ type(ans[0]).__name__ + " which is not float")
340+
+ type(ans[0]).__name__ + " which is not float")
340341
return klass(_deserialize_double_vector(ans[0]), ans[1])
341342

342343

@@ -450,8 +451,7 @@ def _test():
450451
import doctest
451452
globs = globals().copy()
452453
globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
453-
(failure_count, test_count) = doctest.testmod(globs=globs,
454-
optionflags=doctest.ELLIPSIS)
454+
(failure_count, test_count) = doctest.testmod(globs=globs, optionflags=doctest.ELLIPSIS)
455455
globs['sc'].stop()
456456
if failure_count:
457457
exit(-1)

0 commit comments

Comments
 (0)