[SPARK-49533][CORE][TESTS] Change default ivySettings in the IvyTestUtis#withRepository function to use .ivy2.5.2 as the Default Ivy User Dir

LuciferYang · dongjoon-hyun · commit b5e345caa875 · 2024-09-06T07:23:40.000-07:00
### What changes were proposed in this pull request? This pull request introduces changes to the default value of the `ivySettings` parameter in the `IvyTestUtils#withRepository` function. During the construction of the `IvySettings` object, the configurations of `DefaultIvyUserDir` and `DefaultCache` within the instance are modified through an additional call to the `MavenUtils.processIvyPathArg` function: 1. The `DefaultIvyUserDir` is set to `${user.home}/.ivy2.5.2`. 2. The `DefaultCache` is set to the `cache` directory under the modified `IvyUserDir`. By default, the `cache` directory is `${user.home}/.ivy2/cache`. These alterations are made to address a Badcase in the testing process. Additionally, to allow `IvyTestUtils` to invoke the `MavenUtils.processIvyPathArg` function, the visibility of the `processIvyPathArg` function has been adjusted from `private` to `private[util]`. ### Why are the changes needed? To fix a Badcase in the testing, the reproduction steps are as follows: 1. Clean up files and directories related to `mylib-0.1.jar` under `~/.ivy2.5.2` 2. Execute the following tests using Java 21: ``` java -version openjdk version "21.0.4" 2024-07-16 LTS OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS) OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing) build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive ``` ``` Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1 :: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars my.great.lib#mylib added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0 confs: [default] found my.great.lib#mylib;0.1 in repo-1 downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ... [SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms) :: resolution report :: resolve 4325ms :: artifacts dl 2ms :: modules in use: my.great.lib#mylib;0.1 from repo-1 in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 1 | 1 | 0 || 1 | 1 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4 confs: [default] 1 artifacts copied, 0 already retrieved (0kB/6ms) Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false [info] - External JAR (6 seconds, 288 milliseconds) ... [info] Run completed in 40 seconds, 441 milliseconds. [info] Total number of tests run: 26 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` 3. Re-execute the above tests using Java 17: ``` java -version openjdk version "17.0.12" 2024-07-16 LTS OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS) OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing) build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive ``` ``` [info] - External JAR *** FAILED *** (1 second, 626 milliseconds) [info] isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)': [info] scala> [info] scala> // this import will fail [info] scala> import my.great.lib.MyLib [info] scala> [info] scala> // making library available in the REPL to compile UDF [info] scala> import coursierapi.{Credentials, MavenRepository} import coursierapi.{Credentials, MavenRepository} [info] [info] scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")) [info] [info] scala> import $ivy.`my.great.lib:mylib:0.1` import $ivy.$ [info] [info] scala> [info] scala> val func = udf((a: Int) => { [info] import my.great.lib.MyLib [info] MyLib.myFunc(a) [info] }) func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction( [info] f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218721b2487, [info] dataType = IntegerType, [info] inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)), [info] outputEncoder = Some(value = BoxedIntEncoder), [info] givenName = None, [info] nullable = true, [info] deterministic = true [info] ) [info] [info] scala> [info] scala> // add library to the Executor [info] scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/") [info] [info] scala> [info] scala> spark.range(5).select(func(col("id"))).as[Int].collect() [info] scala> [info] scala> semaphore.release() [info] Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc [info] Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console) [info] cmd25.sc:1: not found: value my [info] import my.great.lib.MyLib [info] ^ [info] Compilation Failed [info] org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000 [info] org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195) [info] org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala) [info] org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114) [info] org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) [info] org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100) [info] scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) [info] scala.collection.mutable.Growable.addAll(Growable.scala:61) [info] scala.collection.mutable.Growable.addAll$(Growable.scala:57) [info] scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75) [info] scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505) [info] scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498) [info] scala.collection.AbstractIterator.toArray(Iterator.scala:1303) [info] org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183) [info] org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608) [info] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) [info] org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) [info] org.apache.spark.scheduler.Task.run(Task.scala:146) [info] org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644) [info] org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [info] org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [info] org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) [info] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647) [info] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [info] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [info] java.lang.Thread.run(Thread.java:840) [info] org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0 [info] java.lang.ClassLoader.defineClass1(Native Method) [info] java.lang.ClassLoader.defineClass(ClassLoader.java:1017) [info] java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150) [info] java.net.URLClassLoader.defineClass(URLClassLoader.java:524) [info] java.net.URLClassLoader$1.run(URLClassLoader.java:427) [info] java.net.URLClassLoader$1.run(URLClassLoader.java:421) [info] java.security.AccessController.doPrivileged(AccessController.java:712) [info] java.net.URLClassLoader.findClass(URLClassLoader.java:420) [info] java.lang.ClassLoader.loadClass(ClassLoader.java:592) [info] org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55) [info] java.lang.ClassLoader.loadClass(ClassLoader.java:579) [info] org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40) [info] java.lang.ClassLoader.loadClass(ClassLoader.java:525) [info] org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109) [info] java.lang.ClassLoader.loadClass(ClassLoader.java:592) [info] java.lang.ClassLoader.loadClass(ClassLoader.java:525) [info] ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3) [info] ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1) [info] org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112) [info] org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) [info] org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100) [info] scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) [info] scala.collection.mutable.Growable.addAll(Growable.scala:61) [info] scala.collection.mutable.Growable.addAll$(Growable.scala:57) [info] scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75) [info] scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505) [info] scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498) [info] scala.collection.AbstractIterator.toArray(Iterator.scala:1303) [info] org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183) [info] org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608) [info] org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) [info] org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) [info] org.apache.spark.scheduler.Task.run(Task.scala:146) [info] org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644) [info] org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) [info] org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) [info] org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) [info] org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647) [info] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [info] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [info] java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117) ``` The reasons I suspect for the aforementioned bad case are as follows: 1. Following #45075, to address compatibility issues, Spark 4.0 adopted `~/.ivy2.5.2` as the default Ivy user directory. When tests are executed with Java 21, the compiled `mylib-0.1.jar` is published to the directory `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`. 2. However, the `getDefaultCache` method within the default `IvySettings` instance still returns `~/.ivy2/cache`. Consequently, when the `purgeLocalIvyCache` function is called within the `withRepository` function, it attempts to clean the `artifact` and `deps` directories under `~/.ivy2/cache`. This results in the failure to effectively clean up the `mylib-0.1.jar` file located at `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled `mylib-0.1.jar`, the tests fail. https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L361-L371 https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L392-L403 To address this issue, the pull request modifies the default configuration of the `IvySettings` instance, ensuring that `purgeLocalIvyCache` is able to properly clean up the corresponding cache files located in `~/.ivy2.5.2/cache`. This resolution fixes the aforementioned problem. ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? 1. Pass GitHub Actions 2. Manually executing the tests described in the pull request results in success, and it is confirmed that the `~/.ivy2.5.2/cache/my.great.lib` directory is cleaned up promptly. ### Was this patch authored or co-authored using generative AI tooling? NO Closes #48006 from LuciferYang/IvyTestUtils-withRepository. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala b/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala
@@ -342,7 +342,7 @@ private[spark] object MavenUtils extends Logging {
   }
 
   /* Set ivy settings for location of cache, if option is supplied */
-  private def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {
+  private[util] def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {
     val alternateIvyDir = ivyPath.filterNot(_.trim.isEmpty).getOrElse {
       // To protect old Ivy-based systems like old Spark from Apache Ivy 2.5.2's incompatibility.
       System.getProperty("ivy.home",
diff --git a/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala b/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala
@@ -365,7 +365,7 @@ private[spark] object IvyTestUtils {
       useIvyLayout: Boolean = false,
       withPython: Boolean = false,
       withR: Boolean = false,
-      ivySettings: IvySettings = new IvySettings)(f: String => Unit): Unit = {
+      ivySettings: IvySettings = defaultIvySettings())(f: String => Unit): Unit = {
     val deps = dependencies.map(MavenUtils.extractMavenCoordinates)
     purgeLocalIvyCache(artifact, deps, ivySettings)
     val repo = createLocalRepositoryForTests(artifact, dependencies, rootDir, useIvyLayout,
@@ -401,4 +401,16 @@ private[spark] object IvyTestUtils {
       }
     }
   }
+
+  /**
+   * Creates and initializes a new instance of IvySettings with default configurations.
+   * The method processes the Ivy path argument using MavenUtils to ensure proper setup.
+   *
+   * @return A newly created and configured instance of IvySettings.
+   */
+  private def defaultIvySettings(): IvySettings = {
+    val settings = new IvySettings
+    MavenUtils.processIvyPathArg(ivySettings = settings, ivyPath = None)
+    settings
+  }
 }

Original file line number	Diff line number	Diff line change
`@@ -342,7 +342,7 @@ private[spark] object MavenUtils extends Logging {`
`342`	`342`	`}`
`343`	`343`
`344`	`344`	`/* Set ivy settings for location of cache, if option is supplied */`
`345`		`- private def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {`
	`345`	`+ private[util] def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {`
`346`	`346`	`val alternateIvyDir = ivyPath.filterNot(_.trim.isEmpty).getOrElse {`
`347`	`347`	`// To protect old Ivy-based systems like old Spark from Apache Ivy 2.5.2's incompatibility.`
`348`	`348`	`System.getProperty("ivy.home",`