Skip to content

Commit 897da63

Browse files
committed
Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
2 parents 8cade06 + 983609a commit 897da63

File tree

70 files changed

+1575
-290
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+1575
-290
lines changed

.gitignore

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,10 @@ out/
1717
third_party/libmesos.so
1818
third_party/libmesos.dylib
1919
conf/java-opts
20-
conf/spark-env.sh
21-
conf/streaming-env.sh
22-
conf/log4j.properties
23-
conf/spark-defaults.conf
24-
conf/hive-site.xml
20+
conf/*.sh
21+
conf/*.properties
22+
conf/*.conf
23+
conf/*.xml
2524
docs/_site
2625
docs/api
2726
target/
@@ -52,7 +51,6 @@ unit-tests.log
5251
/lib/
5352
rat-results.txt
5453
scalastyle.txt
55-
conf/*.conf
5654
scalastyle-output.xml
5755

5856
# For Hive

CONTRIBUTING.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## Contributing to Spark
2+
3+
Contributions via GitHub pull requests are gladly accepted from their original
4+
author. Along with any pull requests, please state that the contribution is
5+
your original work and that you license the work to the project under the
6+
project's open source license. Whether or not you state this explicitly, by
7+
submitting any copyrighted material via pull request, email, or other means
8+
you agree to license the material under the project's open source license and
9+
warrant that you have the legal authority to do so.
10+
11+
Please see [Contributing to Spark wiki page](https://cwiki.apache.org/SPARK/Contributing+to+Spark)
12+
for more information.

README.md

Lines changed: 16 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,19 @@ and Spark Streaming for stream processing.
1313
## Online Documentation
1414

1515
You can find the latest Spark documentation, including a programming
16-
guide, on the project webpage at <http://spark.apache.org/documentation.html>.
16+
guide, on the [project web page](http://spark.apache.org/documentation.html).
1717
This README file only contains basic setup instructions.
1818

1919
## Building Spark
2020

21-
Spark is built on Scala 2.10. To build Spark and its example programs, run:
21+
Spark is built using [Apache Maven](http://maven.apache.org/).
22+
To build Spark and its example programs, run:
2223

23-
./sbt/sbt assembly
24+
mvn -DskipTests clean package
2425

2526
(You do not need to do this if you downloaded a pre-built package.)
27+
More detailed documentation is available from the project site, at
28+
["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
2629

2730
## Interactive Scala Shell
2831

@@ -71,73 +74,24 @@ can be run using:
7174

7275
./dev/run-tests
7376

77+
Please see the guidance on how to
78+
[run all automated tests](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting).
79+
7480
## A Note About Hadoop Versions
7581

7682
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
7783
storage systems. Because the protocols have changed in different versions of
7884
Hadoop, you must build Spark against the same version that your cluster runs.
79-
You can change the version by setting `-Dhadoop.version` when building Spark.
80-
81-
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
82-
versions without YARN, use:
83-
84-
# Apache Hadoop 1.2.1
85-
$ sbt/sbt -Dhadoop.version=1.2.1 assembly
86-
87-
# Cloudera CDH 4.2.0 with MapReduce v1
88-
$ sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.2.0 assembly
89-
90-
For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
91-
with YARN, also set `-Pyarn`:
92-
93-
# Apache Hadoop 2.0.5-alpha
94-
$ sbt/sbt -Dhadoop.version=2.0.5-alpha -Pyarn assembly
95-
96-
# Cloudera CDH 4.2.0 with MapReduce v2
97-
$ sbt/sbt -Dhadoop.version=2.0.0-cdh4.2.0 -Pyarn assembly
98-
99-
# Apache Hadoop 2.2.X and newer
100-
$ sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly
101-
102-
When developing a Spark application, specify the Hadoop version by adding the
103-
"hadoop-client" artifact to your project's dependencies. For example, if you're
104-
using Hadoop 1.2.1 and build your application using SBT, add this entry to
105-
`libraryDependencies`:
106-
107-
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
10885

109-
If your project is built with Maven, add this to your POM file's `<dependencies>` section:
110-
111-
<dependency>
112-
<groupId>org.apache.hadoop</groupId>
113-
<artifactId>hadoop-client</artifactId>
114-
<version>1.2.1</version>
115-
</dependency>
116-
117-
118-
## A Note About Thrift JDBC server and CLI for Spark SQL
119-
120-
Spark SQL supports Thrift JDBC server and CLI.
121-
See sql-programming-guide.md for more information about using the JDBC server and CLI.
122-
You can use those features by setting `-Phive` when building Spark as follows.
123-
124-
$ sbt/sbt -Phive assembly
86+
Please refer to the build documentation at
87+
["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
88+
for detailed guidance on building for a particular distribution of Hadoop, including
89+
building for particular Hive and Hive Thriftserver distributions. See also
90+
["Third Party Hadoop Distributions"](http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html)
91+
for guidance on building a Spark application that works with a particular
92+
distribution.
12593

12694
## Configuration
12795

12896
Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
12997
in the online documentation for an overview on how to configure Spark.
130-
131-
132-
## Contributing to Spark
133-
134-
Contributions via GitHub pull requests are gladly accepted from their original
135-
author. Along with any pull requests, please state that the contribution is
136-
your original work and that you license the work to the project under the
137-
project's open source license. Whether or not you state this explicitly, by
138-
submitting any copyrighted material via pull request, email, or other means
139-
you agree to license the material under the project's open source license and
140-
warrant that you have the legal authority to do so.
141-
142-
Please see [Contributing to Spark wiki page](https://cwiki.apache.org/SPARK/Contributing+to+Spark)
143-
for more information.

bin/spark-class

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ else
105105
exit 1
106106
fi
107107
fi
108-
JAVA_VERSION=$("$RUNNER" -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
108+
JAVA_VERSION=$("$RUNNER" -version 2>&1 | sed 's/.* version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
109109

110110
# Set JAVA_OPTS to be able to load native libraries and to set heap size
111111
if [ "$JAVA_VERSION" -ge 18 ]; then

core/src/main/scala/org/apache/spark/SparkContext.scala

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1072,11 +1072,8 @@ class SparkContext(config: SparkConf) extends Logging {
10721072
val callSite = getCallSite
10731073
val cleanedFunc = clean(func)
10741074
logInfo("Starting job: " + callSite.shortForm)
1075-
val start = System.nanoTime
10761075
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
10771076
resultHandler, localProperties.get)
1078-
logInfo(
1079-
"Job finished: " + callSite.shortForm + ", took " + (System.nanoTime - start) / 1e9 + " s")
10801077
rdd.doCheckpoint()
10811078
}
10821079

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -776,7 +776,7 @@ private[spark] object PythonRDD extends Logging {
776776
}
777777

778778
/**
779-
* Convert and RDD of Java objects to and RDD of serialized Python objects, that is usable by
779+
* Convert an RDD of Java objects to an RDD of serialized Python objects, that is usable by
780780
* PySpark.
781781
*/
782782
def javaToPython(jRDD: JavaRDD[Any]): JavaRDD[Array[Byte]] = {

core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717

1818
package org.apache.spark.api.python
1919

20+
import java.nio.ByteOrder
21+
2022
import scala.collection.JavaConversions._
2123
import scala.util.Failure
2224
import scala.util.Try
@@ -28,6 +30,55 @@ import org.apache.spark.rdd.RDD
2830

2931
/** Utilities for serialization / deserialization between Python and Java, using Pickle. */
3032
private[python] object SerDeUtil extends Logging {
33+
// Unpickle array.array generated by Python 2.6
34+
class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor {
35+
// /* Description of types */
36+
// static struct arraydescr descriptors[] = {
37+
// {'c', sizeof(char), c_getitem, c_setitem},
38+
// {'b', sizeof(char), b_getitem, b_setitem},
39+
// {'B', sizeof(char), BB_getitem, BB_setitem},
40+
// #ifdef Py_USING_UNICODE
41+
// {'u', sizeof(Py_UNICODE), u_getitem, u_setitem},
42+
// #endif
43+
// {'h', sizeof(short), h_getitem, h_setitem},
44+
// {'H', sizeof(short), HH_getitem, HH_setitem},
45+
// {'i', sizeof(int), i_getitem, i_setitem},
46+
// {'I', sizeof(int), II_getitem, II_setitem},
47+
// {'l', sizeof(long), l_getitem, l_setitem},
48+
// {'L', sizeof(long), LL_getitem, LL_setitem},
49+
// {'f', sizeof(float), f_getitem, f_setitem},
50+
// {'d', sizeof(double), d_getitem, d_setitem},
51+
// {'\0', 0, 0, 0} /* Sentinel */
52+
// };
53+
// TODO: support Py_UNICODE with 2 bytes
54+
// FIXME: unpickle array of float is wrong in Pyrolite, so we reverse the
55+
// machine code for float/double here to workaround it.
56+
// we should fix this after Pyrolite fix them
57+
val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) {
58+
Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9,
59+
'L' -> 11, 'l' -> 13, 'f' -> 14, 'd' -> 16, 'u' -> 21
60+
)
61+
} else {
62+
Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8,
63+
'L' -> 10, 'l' -> 12, 'f' -> 15, 'd' -> 17, 'u' -> 20
64+
)
65+
}
66+
override def construct(args: Array[Object]): Object = {
67+
if (args.length == 1) {
68+
construct(args ++ Array(""))
69+
} else if (args.length == 2 && args(1).isInstanceOf[String]) {
70+
val typecode = args(0).asInstanceOf[String].charAt(0)
71+
val data: String = args(1).asInstanceOf[String]
72+
construct(typecode, machineCodes(typecode), data.getBytes("ISO-8859-1"))
73+
} else {
74+
super.construct(args)
75+
}
76+
}
77+
}
78+
79+
def initialize() = {
80+
Unpickler.registerConstructor("array", "array", new ArrayConstructor())
81+
}
3182

3283
private def checkPickle(t: (Any, Any)): (Boolean, Boolean) = {
3384
val pickle = new Pickler

core/src/main/scala/org/apache/spark/network/ManagedBuffer.scala

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package org.apache.spark.network
1919

2020
import java.io.{FileInputStream, RandomAccessFile, File, InputStream}
2121
import java.nio.ByteBuffer
22+
import java.nio.channels.FileChannel
2223
import java.nio.channels.FileChannel.MapMode
2324

2425
import com.google.common.io.ByteStreams
@@ -66,8 +67,15 @@ final class FileSegmentManagedBuffer(val file: File, val offset: Long, val lengt
6667
override def size: Long = length
6768

6869
override def nioByteBuffer(): ByteBuffer = {
69-
val channel = new RandomAccessFile(file, "r").getChannel
70-
channel.map(MapMode.READ_ONLY, offset, length)
70+
var channel: FileChannel = null
71+
try {
72+
channel = new RandomAccessFile(file, "r").getChannel
73+
channel.map(MapMode.READ_ONLY, offset, length)
74+
} finally {
75+
if (channel != null) {
76+
channel.close()
77+
}
78+
}
7179
}
7280

7381
override def inputStream(): InputStream = {

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -507,11 +507,16 @@ class DAGScheduler(
507507
resultHandler: (Int, U) => Unit,
508508
properties: Properties = null)
509509
{
510+
val start = System.nanoTime
510511
val waiter = submitJob(rdd, func, partitions, callSite, allowLocal, resultHandler, properties)
511512
waiter.awaitResult() match {
512-
case JobSucceeded => {}
513+
case JobSucceeded => {
514+
logInfo("Job %d finished: %s, took %f s".format
515+
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
516+
}
513517
case JobFailed(exception: Exception) =>
514-
logInfo("Failed to run " + callSite.shortForm)
518+
logInfo("Job %d failed: %s, took %f s".format
519+
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
515520
throw exception
516521
}
517522
}

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ import scala.collection.mutable.ArrayBuffer
2323
import scala.collection.mutable.HashSet
2424
import scala.collection.mutable.Queue
2525

26-
import org.apache.spark.{TaskContext, Logging, SparkException}
26+
import org.apache.spark.{TaskContext, Logging}
2727
import org.apache.spark.network.{ManagedBuffer, BlockFetchingListener, BlockTransferService}
2828
import org.apache.spark.serializer.Serializer
2929
import org.apache.spark.util.Utils

0 commit comments

Comments
 (0)