-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-3319] [SPARK-3338] Resolve Spark submit config paths #2232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2232 at commit
|
|
QA tests have finished for PR 2232 at commit
|
|
@sryza can you look at this? |
|
QA tests have started for PR 2232 at commit
|
|
QA tests have finished for PR 2232 at commit
|
|
Could this change behavior in cases where the spark.yarn.dist.files is configured with no scheme? Without this change, it would interpret no scheme to mean that it's on HDFS, and with it, it would interpret it to mean that it's on the local FS? |
|
Yes, this changes the behavior if the user sets However I believe these configs are actually introduced before Spark submit. Do you think we should leave the yarn ones out in this PR to keep backward compatibility? |
|
Hmm. My feeling is that it's better to be consistent here and consider the old behavior a bug than to maintain compatibility than to support a cornerish case for a property that's rarely used. Just wanted to bring it up. @pwendell do you have an opinion on this? Either way, for each of these properties, we should document whether they can take HDFS paths and the default behavior when no scheme is given. Also, are there any strange corner cases we can run into if a file exists at the same path on the local FS and HDFS? Last, do we need to also resolve URIs for spark.yarn.jar? |
|
Yes, probably. This list is by no means complete. For |
|
From the other pr we need to keep backwards compatibility for the env variables, the configs should be fine. I believe this pr matches the behavior described below. Please correct me if I'm wrong as I haven't looked at the code in detail. The behavior we want is and should have been the behavior as of pr969:
|
|
Hi @tgravescs, I believe every point of the behavior you listed is correct and preserved in this PR, since it only affects |
…paths Conflicts: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala
|
Alright, I just resolved the merge conflict and added |
|
Test build #22549 has started for PR 2232 at commit
|
|
Test build #22549 has finished for PR 2232 at commit
|
|
Test PASSed. |
|
This looks good to me. |
|
Alright merging this. |
The bulk of this PR is comprised of tests. All changes in functionality are made in
SparkSubmit.scala(~20 lines).SPARK-3319. There is currently a divergence in behavior when the user passes in additional jars through
--jarsand through settingspark.jarsin the default properties file. The former will happily resolve the paths (e.g. convertmy.jartofile:/absolute/path/to/my.jar), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:--jars~spark.jars--files~spark.files/spark.yarn.dist.files--archives~spark.yarn.dist.archives--py-files~spark.submit.pyFilesSPARK-3338. This PR also fixes the following bug: if the user sets
spark.submit.pyFilesin his/her properties file, it does not actually get picked up even if--py-filesis not set. This is simply because the config is overridden by an empty string.