Skip to content

Commit 4106558

Browse files
aarondavpwendell
authored andcommitted
SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars. This patch has the following features/bug fixes: - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar. - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven) - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR - avoid adding log message in compute-classpath.sh to the classpath :) Still TODO before mergeable: - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself. - Spark SQL documentation updates. Author: Aaron Davidson <[email protected]> Closes #237 from aarondav/master and squashes the following commits: 5dc4329 [Aaron Davidson] Typo fixes dd4f298 [Aaron Davidson] Doc update dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
1 parent 7ce52c4 commit 4106558

File tree

8 files changed

+83
-32
lines changed

8 files changed

+83
-32
lines changed

assembly/pom.xml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,16 @@
163163
</dependency>
164164
</dependencies>
165165
</profile>
166+
<profile>
167+
<id>hive</id>
168+
<dependencies>
169+
<dependency>
170+
<groupId>org.apache.spark</groupId>
171+
<artifactId>spark-hive_${scala.binary.version}</artifactId>
172+
<version>${project.version}</version>
173+
</dependency>
174+
</dependencies>
175+
</profile>
166176
<profile>
167177
<id>spark-ganglia-lgpl</id>
168178
<dependencies>

bin/compute-classpath.sh

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -30,21 +30,7 @@ FWDIR="$(cd `dirname $0`/..; pwd)"
3030
# Build up classpath
3131
CLASSPATH="$SPARK_CLASSPATH:$FWDIR/conf"
3232

33-
# Support for interacting with Hive. Since hive pulls in a lot of dependencies that might break
34-
# existing Spark applications, it is not included in the standard spark assembly. Instead, we only
35-
# include it in the classpath if the user has explicitly requested it by running "sbt hive/assembly"
36-
# Hopefully we will find a way to avoid uber-jars entirely and deploy only the needed packages in
37-
# the future.
38-
if [ -f "$FWDIR"/sql/hive/target/scala-$SCALA_VERSION/spark-hive-assembly-*.jar ]; then
39-
40-
# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
41-
DATANUCLEUSJARS=$(JARS=("$FWDIR/lib_managed/jars"/datanucleus-*.jar); IFS=:; echo "${JARS[*]}")
42-
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS
43-
44-
ASSEMBLY_DIR="$FWDIR/sql/hive/target/scala-$SCALA_VERSION/"
45-
else
46-
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION/"
47-
fi
33+
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION"
4834

4935
# First check if we have a dependencies jar. If so, include binary classes with the deps jar
5036
if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
@@ -59,7 +45,7 @@ if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
5945
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
6046
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"
6147

62-
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
48+
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar`
6349
CLASSPATH="$CLASSPATH:$DEPS_ASSEMBLY_JAR"
6450
else
6551
# Else use spark-assembly jar from either RELEASE or assembly directory
@@ -71,6 +57,23 @@ else
7157
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
7258
fi
7359

60+
# When Hive support is needed, Datanucleus jars must be included on the classpath.
61+
# Datanucleus jars do not work if only included in the uber jar as plugin.xml metadata is lost.
62+
# Both sbt and maven will populate "lib_managed/jars/" with the datanucleus jars when Spark is
63+
# built with Hive, so first check if the datanucleus jars exist, and then ensure the current Spark
64+
# assembly is built for Hive, before actually populating the CLASSPATH with the jars.
65+
# Note that this check order is faster (by up to half a second) in the case where Hive is not used.
66+
num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ | grep "datanucleus-.*\\.jar" | wc -l)
67+
if [ $num_datanucleus_jars -gt 0 ]; then
68+
AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR}
69+
num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l)
70+
if [ $num_hive_files -gt 0 ]; then
71+
echo "Spark assembly has been built with Hive, including Datanucleus jars on classpath" 1>&2
72+
DATANUCLEUSJARS=$(echo "$FWDIR/lib_managed/jars"/datanucleus-*.jar | tr " " :)
73+
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS
74+
fi
75+
fi
76+
7477
# Add test classes if we're running from SBT or Maven with SPARK_TESTING set to 1
7578
if [[ $SPARK_TESTING == 1 ]]; then
7679
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/test-classes"

bin/spark-class

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -154,5 +154,3 @@ if [ "$SPARK_PRINT_LAUNCH_COMMAND" == "1" ]; then
154154
fi
155155

156156
exec "$RUNNER" -cp "$CLASSPATH" $JAVA_OPTS "$@"
157-
158-

dev/create-release/create-release.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,14 @@ mvn -DskipTests \
4949
-Darguments="-DskipTests=true -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -Dgpg.passphrase=${GPG_PASSPHRASE}" \
5050
-Dusername=$GIT_USERNAME -Dpassword=$GIT_PASSWORD \
5151
-Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 \
52-
-Pyarn -Pspark-ganglia-lgpl \
52+
-Pyarn -Phive -Pspark-ganglia-lgpl\
5353
-Dtag=$GIT_TAG -DautoVersionSubmodules=true \
5454
--batch-mode release:prepare
5555

5656
mvn -DskipTests \
5757
-Darguments="-DskipTests=true -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -Dgpg.passphrase=${GPG_PASSPHRASE}" \
5858
-Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 \
59-
-Pyarn -Pspark-ganglia-lgpl\
59+
-Pyarn -Phive -Pspark-ganglia-lgpl\
6060
release:perform
6161

6262
rm -rf spark

docs/sql-programming-guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -264,8 +264,8 @@ evaluated by the SQL execution engine. A full list of the functions supported c
264264

265265
Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
266266
However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
267-
In order to use Hive you must first run '`SPARK_HIVE=true sbt/sbt assembly/assembly`'. This command builds a new assembly
268-
jar that includes Hive. Note that this Hive assembly jar must also be present
267+
In order to use Hive you must first run '`SPARK_HIVE=true sbt/sbt assembly/assembly`' (or use `-Phive` for maven).
268+
This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present
269269
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
270270
(SerDes) in order to acccess data stored in Hive.
271271

pom.xml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -377,7 +377,6 @@
377377
<groupId>org.apache.derby</groupId>
378378
<artifactId>derby</artifactId>
379379
<version>10.4.2.0</version>
380-
<scope>test</scope>
381380
</dependency>
382381
<dependency>
383382
<groupId>net.liftweb</groupId>
@@ -580,6 +579,12 @@
580579
</exclusion>
581580
</exclusions>
582581
</dependency>
582+
<dependency>
583+
<!-- Matches the version of jackson-core-asl pulled in by avro -->
584+
<groupId>org.codehaus.jackson</groupId>
585+
<artifactId>jackson-mapper-asl</artifactId>
586+
<version>1.8.8</version>
587+
</dependency>
583588
</dependencies>
584589
</dependencyManagement>
585590

project/SparkBuild.scala

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ object SparkBuild extends Build {
4343

4444
val DEFAULT_YARN = false
4545

46+
val DEFAULT_HIVE = false
47+
4648
// HBase version; set as appropriate.
4749
val HBASE_VERSION = "0.94.6"
4850

@@ -67,15 +69,17 @@ object SparkBuild extends Build {
6769

6870
lazy val sql = Project("sql", file("sql/core"), settings = sqlCoreSettings) dependsOn(core, catalyst)
6971

70-
// Since hive is its own assembly, it depends on all of the modules.
71-
lazy val hive = Project("hive", file("sql/hive"), settings = hiveSettings) dependsOn(sql, graphx, bagel, mllib, streaming, repl)
72+
lazy val hive = Project("hive", file("sql/hive"), settings = hiveSettings) dependsOn(sql)
73+
74+
lazy val maybeHive: Seq[ClasspathDependency] = if (isHiveEnabled) Seq(hive) else Seq()
75+
lazy val maybeHiveRef: Seq[ProjectReference] = if (isHiveEnabled) Seq(hive) else Seq()
7276

7377
lazy val streaming = Project("streaming", file("streaming"), settings = streamingSettings) dependsOn(core)
7478

7579
lazy val mllib = Project("mllib", file("mllib"), settings = mllibSettings) dependsOn(core)
7680

7781
lazy val assemblyProj = Project("assembly", file("assembly"), settings = assemblyProjSettings)
78-
.dependsOn(core, graphx, bagel, mllib, streaming, repl, sql) dependsOn(maybeYarn: _*) dependsOn(maybeGanglia: _*)
82+
.dependsOn(core, graphx, bagel, mllib, streaming, repl, sql) dependsOn(maybeYarn: _*) dependsOn(maybeHive: _*) dependsOn(maybeGanglia: _*)
7983

8084
lazy val assembleDeps = TaskKey[Unit]("assemble-deps", "Build assembly of dependencies and packages Spark projects")
8185

@@ -101,6 +105,11 @@ object SparkBuild extends Build {
101105
lazy val hadoopClient = if (hadoopVersion.startsWith("0.20.") || hadoopVersion == "1.0.0") "hadoop-core" else "hadoop-client"
102106
val maybeAvro = if (hadoopVersion.startsWith("0.23.") && isYarnEnabled) Seq("org.apache.avro" % "avro" % "1.7.4") else Seq()
103107

108+
lazy val isHiveEnabled = Properties.envOrNone("SPARK_HIVE") match {
109+
case None => DEFAULT_HIVE
110+
case Some(v) => v.toBoolean
111+
}
112+
104113
// Include Ganglia integration if the user has enabled Ganglia
105114
// This is isolated from the normal build due to LGPL-licensed code in the library
106115
lazy val isGangliaEnabled = Properties.envOrNone("SPARK_GANGLIA_LGPL").isDefined
@@ -141,13 +150,13 @@ object SparkBuild extends Build {
141150
lazy val allExternalRefs = Seq[ProjectReference](externalTwitter, externalKafka, externalFlume, externalZeromq, externalMqtt)
142151

143152
lazy val examples = Project("examples", file("examples"), settings = examplesSettings)
144-
.dependsOn(core, mllib, graphx, bagel, streaming, externalTwitter, hive) dependsOn(allExternal: _*)
153+
.dependsOn(core, mllib, graphx, bagel, streaming, hive) dependsOn(allExternal: _*)
145154

146155
// Everything except assembly, hive, tools, java8Tests and examples belong to packageProjects
147-
lazy val packageProjects = Seq[ProjectReference](core, repl, bagel, streaming, mllib, graphx, catalyst, sql) ++ maybeYarnRef ++ maybeGangliaRef
156+
lazy val packageProjects = Seq[ProjectReference](core, repl, bagel, streaming, mllib, graphx, catalyst, sql) ++ maybeYarnRef ++ maybeHiveRef ++ maybeGangliaRef
148157

149158
lazy val allProjects = packageProjects ++ allExternalRefs ++
150-
Seq[ProjectReference](examples, tools, assemblyProj, hive) ++ maybeJava8Tests
159+
Seq[ProjectReference](examples, tools, assemblyProj) ++ maybeJava8Tests
151160

152161
def sharedSettings = Defaults.defaultSettings ++ MimaBuild.mimaSettings(file(sparkHome)) ++ Seq(
153162
organization := "org.apache.spark",
@@ -417,10 +426,8 @@ object SparkBuild extends Build {
417426

418427
// Since we don't include hive in the main assembly this project also acts as an alternative
419428
// assembly jar.
420-
def hiveSettings = sharedSettings ++ assemblyProjSettings ++ Seq(
429+
def hiveSettings = sharedSettings ++ Seq(
421430
name := "spark-hive",
422-
jarName in assembly <<= version map { v => "spark-hive-assembly-" + v + "-hadoop" + hadoopVersion + ".jar" },
423-
jarName in packageDependency <<= version map { v => "spark-hive-assembly-" + v + "-hadoop" + hadoopVersion + "-deps.jar" },
424431
javaOptions += "-XX:MaxPermSize=1g",
425432
libraryDependencies ++= Seq(
426433
"org.apache.hive" % "hive-metastore" % hiveVersion,

sql/hive/pom.xml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,10 @@
6363
<artifactId>hive-exec</artifactId>
6464
<version>${hive.version}</version>
6565
</dependency>
66+
<dependency>
67+
<groupId>org.codehaus.jackson</groupId>
68+
<artifactId>jackson-mapper-asl</artifactId>
69+
</dependency>
6670
<dependency>
6771
<groupId>org.apache.hive</groupId>
6872
<artifactId>hive-serde</artifactId>
@@ -87,6 +91,30 @@
8791
<groupId>org.scalatest</groupId>
8892
<artifactId>scalatest-maven-plugin</artifactId>
8993
</plugin>
94+
95+
<!-- Deploy datanucleus jars to the spark/lib_managed/jars directory -->
96+
<plugin>
97+
<groupId>org.apache.maven.plugins</groupId>
98+
<artifactId>maven-dependency-plugin</artifactId>
99+
<version>2.4</version>
100+
<executions>
101+
<execution>
102+
<id>copy-dependencies</id>
103+
<phase>package</phase>
104+
<goals>
105+
<goal>copy-dependencies</goal>
106+
</goals>
107+
<configuration>
108+
<!-- basedir is spark/sql/hive/ -->
109+
<outputDirectory>${basedir}/../../lib_managed/jars</outputDirectory>
110+
<overWriteReleases>false</overWriteReleases>
111+
<overWriteSnapshots>false</overWriteSnapshots>
112+
<overWriteIfNewer>true</overWriteIfNewer>
113+
<includeGroupIds>org.datanucleus</includeGroupIds>
114+
</configuration>
115+
</execution>
116+
</executions>
117+
</plugin>
90118
</plugins>
91119
</build>
92120
</project>

0 commit comments

Comments
 (0)