[HOPSWORKS-1081] Upgrade Spark to 2.4.3 #11

kai-chi · 2019-08-05T08:21:20Z

What changes were proposed in this pull request?

Updating to LC/spark to apache/spark v2.4.3

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

## What changes were proposed in this pull request? I saw CoarseGrainedSchedulerBackendSuite failed in my PR and finally reproduced the following error on a very busy machine: ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 400 times over 10.009828643999999 seconds. Last failure message: ArrayBuffer("2", "0", "3") had length 3 instead of expected length 4. ``` The logs in this test shows executor 1 was not up when the test failed. ``` 18/10/30 11:34:03.563 dispatcher-event-loop-12 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43656) with ID 2 18/10/30 11:34:03.593 dispatcher-event-loop-3 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43658) with ID 3 18/10/30 11:34:03.629 dispatcher-event-loop-6 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43654) with ID 0 18/10/30 11:34:03.885 pool-1-thread-1-ScalaTest-running-CoarseGrainedSchedulerBackendSuite INFO CoarseGrainedSchedulerBackendSuite: ===== FINISHED o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite: 'compute max number of concurrent tasks can be launched' ===== ``` And the following logs in executor 1 shows it was still doing the initialization when the timeout happened (at 18/10/30 11:34:03.885). ``` 18/10/30 11:34:03.463 netty-rpc-connection-0 INFO TransportClientFactory: Successfully created connection to 54b6b6217301/172.17.0.2:33741 after 37 ms (0 ms spent in bootstraps) 18/10/30 11:34:03.959 main INFO DiskBlockManager: Created local directory at /home/jenkins/workspace/core/target/tmp/spark-383518bc-53bd-4d9c-885b-d881f03875bf/executor-61c406e4-178f-40a6-ac2c-7314ee6fb142/blockmgr-03fb84a1-eedc-4055-8743-682eb3ac5c67 18/10/30 11:34:03.993 main INFO MemoryStore: MemoryStore started with capacity 546.3 MB ``` Hence, I think our current 10 seconds is not enough on a slow Jenkins machine. This PR just increases the timeout from 10 seconds to 60 seconds to make the test more stable. ## How was this patch tested? Jenkins Closes apache#22910 from zsxwing/fix-flaky-test. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 6be3cce) Signed-off-by: gatorsmile <[email protected]>

…cleaning up stages ## What changes were proposed in this pull request? * Update `AppStatusListener` `cleanupStages` method to remove tasks for those stages in a single pass instead of 1 for each stage. * This fixes an issue where the cleanupStages method would get backed up, causing a backup in the executor in ElementTrackingStore, resulting in stages and jobs not getting cleaned up properly. Tasks seem most susceptible to this as there are a lot of them, however a similar issue could arise in other locations the `KVStore` `view` method is used. A broader fix might involve updates to `KVStoreView` and `InMemoryView` as it appears this interface and implementation can lead to multiple and inefficient traversals of the stored data. ## How was this patch tested? Using existing tests in AppStatusListenerSuite This is my original work and I license the work to the project under the project’s open source license. Closes apache#22883 from patrickbrownsync/cleanup-stages-fix. Authored-by: Patrick Brown <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit e9d3ca0) Signed-off-by: Marcelo Vanzin <[email protected]>

## What changes were proposed in this pull request? Unfortunately, it seems that we missed this in 2.4.0. In Spark 2.4, if the default file system is not the local file system, `LOAD DATA LOCAL INPATH` only works in case of absolute paths. This PR aims to fix it to support relative paths. This is a regression in 2.4.0. ```scala $ ls kv1.txt kv1.txt scala> spark.sql("LOAD DATA LOCAL INPATH 'kv1.txt' INTO TABLE t") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: kv1.txt; ``` ## How was this patch tested? Pass the Jenkins Closes apache#22927 from dongjoon-hyun/SPARK-LOAD. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e91b607) Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? Clarify documentation about security. ## How was this patch tested? None, just documentation Closes apache#22852 from tgravescs/SPARK-25023. Authored-by: Thomas Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit c00186f) Signed-off-by: Thomas Graves <[email protected]>

## What changes were proposed in this pull request? Propose changing the documentation to state that there are 4, not 3, cluster managers available. ## How was this patch tested? This is a docs-only patch and doesn't need any new testing beyond the normal CI process for Spark. Closes apache#22922 from jameslamb/bugfix/cluster_docs. Authored-by: James Lamb <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit c71db43) Signed-off-by: Sean Owen <[email protected]>

…e buffers ## What changes were proposed in this pull request? Avoid converting encrypted bocks to regular ByteBuffers, to ensure they can be sent over the network for replication & remote reads even when > 2GB. Also updates some TODOs with links to a SPARK-25905 for improving the handling here. ## How was this patch tested? Tested on a cluster with encrypted data > 2GB (after SPARK-25904 was applied as well). Closes apache#22917 from squito/real_SPARK-25827. Authored-by: Imran Rashid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit 7ea594e) Signed-off-by: Marcelo Vanzin <[email protected]>

…ation.md ## What changes were proposed in this pull request? Change ptats.Stats() to pstats.Stats() for `spark.python.profile.dump` in configuration.md. ## How was this patch tested? Doc test Closes apache#22933 from AlexHagerman/doc_fix. Authored-by: Alex Hagerman <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 1a7abf3) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? - Issue is described in detail in [SPARK-25930](https://issues.apache.org/jira/browse/SPARK-25930). Since we rely on the std output, pick always the last line which contains the wanted value. Although minor, current implementation breaks tests. ## How was this patch tested? manually. rm -rf ~/.m2 and then run the tests. Closes apache#22931 from skonto/fix_scala_detection. Authored-by: Stavros Kontopoulos <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 1fb3759) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Fix typos and misspellings, per apache/spark-website#158 (comment) ## How was this patch tested? Existing tests. Closes apache#22950 from srowen/Typos. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit c0d1bf0) Signed-off-by: Sean Owen <[email protected]>

…-shell ## What changes were proposed in this pull request? This PR targets to document `-I` option from Spark 2.4.x (previously `-i` option until Spark 2.3.x). After we upgraded Scala to 2.11.12, `-i` option (`:load`) was replaced to `-I`(SI-7898). Existing `-i` became `:paste` which does not respect Spark's implicit import (for instance `toDF`, symbol as column, etc.). Therefore, `-i` option does not correctly from Spark 2.4.x and it's not documented. I checked other Scala REPL options but looks not applicable or working from quick tests. This PR only targets to document `-I` for now. ## How was this patch tested? Manually tested. **Mac:** ```bash $ ./bin/spark-shell --help Usage: ./bin/spark-shell [options] Scala REPL options: -I <file> preload <file>, enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` **Windows:** ```cmd C:\...\spark>.\bin\spark-shell --help Usage: .\bin\spark-shell.cmd [options] Scala REPL options: -I <file> preload <file>, enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). ... ``` Closes apache#22919 from HyukjinKwon/SPARK-25906. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit cc38abc) Signed-off-by: hyukjinkwon <[email protected]>

…hang because of blacklisting ## What changes were proposed in this pull request? Every time a task is unschedulable because of the condition where no. of task failures < no. of executors available, we currently abort the taskSet - failing the job. This change tries to acquire new executors so that we can complete the job successfully. We try to acquire a new executor only when we can kill an existing idle executor. We fallback to the older implementation where we abort the job if we cannot find an idle executor. ## How was this patch tested? I performed some manual tests to check and validate the behavior. ```scala val rdd = sc.parallelize(Seq(1 to 10), 3) import org.apache.spark.TaskContext val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index == 2) { Thread.sleep(30 * 1000); val attemptNum = TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail for blacklisting")}; iterator.toList.map (x => x + " -> " + index).iterator } ) mapped.collect ``` Closes apache#22288 from dhruve/bug/SPARK-22148. Lead-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Dhruve Ashar <[email protected]> Co-authored-by: Tom Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit fdd3bac) Signed-off-by: Thomas Graves <[email protected]>

## What changes were proposed in this pull request? When we added the `distanceMeasure`, we didn't update the `formatVersion` for `KMeans`. Despite this is not a big issue, as that information is used nowhere, we are returning a wrong information. ## How was this patch tested? NA Closes apache#22873 from mgaido91/SPARK-25866. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6b42587) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? update known_translations after running `translate-contributors.py` during 2.4.0 release ## How was this patch tested? N/A Closes apache#22949 from cloud-fan/contributors. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit a241a15) Signed-off-by: gatorsmile <[email protected]>

This reverts commit a75571b.

JVMs can't allocate arrays of length exactly Int.MaxValue, so ensure we never try to allocate an array that big. This commit changes some defaults & configs to gracefully fallover to something that doesn't require one large array in some cases; in other cases it simply improves an error message for cases which will still fail. Closes apache#22818 from squito/SPARK-25827. Authored-by: Imran Rashid <[email protected]> Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit 8fbc183)

## What changes were proposed in this pull request? Since Spark 2.4.0 is released, we should test it in HiveExternalCatalogVersionsSuite ## How was this patch tested? N/A Closes apache#22984 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 973f7c0) Signed-off-by: Dongjoon Hyun <[email protected]>

… names in Analyzer ## What changes were proposed in this pull request? When the queries do not use the column names with the same case, users might hit various errors. Below is a typical test failure they can hit. ``` Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); org.apache.spark.sql.AnalysisException: Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) ``` ## How was this patch tested? Added two test cases. Closes apache#22990 from gatorsmile/fix1283. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 657fd00) Signed-off-by: gatorsmile <[email protected]>

…eference ## What changes were proposed in this pull request? Very minor parser bug, but possibly problematic for code-generated queries: Consider the following two queries: ``` SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` and ``` SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` The former, with parens around the OVER condition, fails to parse while the latter, without parens, succeeds: ``` Error in SQL statement: ParseException: mismatched input '(' expecting {<EOF>, ',', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19) == SQL == SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 -------------------^^^ ``` This was found when running the cockroach DB tests. I tried PostgreSQL, The SQL with parentheses is also workable. ## How was this patch tested? Unit test Closes apache#22987 from gengliangwang/windowParentheses. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 1db7997) Signed-off-by: gatorsmile <[email protected]>

…a to be 2.3.0 ## What changes were proposed in this pull request? Although it's a little late, we should still update mima for branch 2.4, to avoid future breaking changes. Note that, when merging, we should forward port it to master branch, so that the excluding rules are still in `v24excludes`. TODO: update the release process document to mention about mima update. ## How was this patch tested? N/A Closes apache#23015 from cloud-fan/mima-2.4. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? changes in vignette only to disable eval ## How was this patch tested? Jenkins Author: Felix Cheung <[email protected]> Closes apache#23007 from felixcheung/rjavavervig. (cherry picked from commit 88c8262) Signed-off-by: Felix Cheung <[email protected]>

…t while python worker reuse ## What changes were proposed in this pull request? Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. This is because while python worker reuse, BarrierTaskContext._getOrCreate() will still return a TaskContext after firstly submit a normal spark job, we'll get a `AttributeError: 'TaskContext' object has no attribute 'barrier'`. Fix this by adding check logic in BarrierTaskContext._getOrCreate() and make sure it will return BarrierTaskContext in this scenario. ## How was this patch tested? Add new UT in pyspark-core. Closes apache#22962 from xuanyuanking/SPARK-25921. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c00e72f) Signed-off-by: Wenchen Fan <[email protected]>

…eTopicDeletionSuite ## What changes were proposed in this pull request? As initializing lazy vals shares the same lock, a thread is trying to initialize `executedPlan` when `isRDD` is running, this thread will hang forever. This PR just materializes `executedPlan` so that accessing it when `toRdd` is running doesn't need to wait for a lock ## How was this patch tested? Jenkins Closes apache#23023 from zsxwing/SPARK-26042. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]> (cherry picked from commit 4035c98) Signed-off-by: Shixiong Zhu <[email protected]>

…due lack of access to tmpDir from $PWD to HDFS WriteAheadLogBackedBlockRDD usage of java.io.tmpdir will fail if $PWD resolves to a folder in HDFS and the Spark YARN Cluster job does not have the correct access to this folder in regards to the dummy folder. So this patch provides an option to set spark.streaming.receiver.blockStore.tmpdir to override java.io.tmpdir which sets $PWD from YARN Cluster mode. ## What changes were proposed in this pull request? This change provides an option to override the java.io.tmpdir option so that when $PWD is resolved in YARN Cluster mode Spark does not attempt to use this folder and instead use the folder provided with the following option: spark.streaming.receiver.blockStore.tmpdir ## How was this patch tested? Patch was manually tested on a Spark Streaming Job with Write Ahead logs in Cluster mode. Closes apache#22867 from gss2002/SPARK-25778. Authored-by: gss2002 <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit 2b671e7) Signed-off-by: Marcelo Vanzin <[email protected]>

## What changes were proposed in this pull request? In SPARK-24865 `AnalysisBarrier` was removed and in order to improve resolution speed, the `analyzed` flag was (re-)introduced in order to process only plans which are not yet analyzed. This should not be the case when performing attribute deduplication as in that case we need to transform also the plans which were already analyzed, otherwise we can miss to rewrite some attributes leading to invalid plans. ## How was this patch tested? added UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#23035 from mgaido91/SPARK-26057. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b46f75a) Signed-off-by: Wenchen Fan <[email protected]>

… resource does not honor "spark.jars.packages" SparkSubmit determines pyspark app by the suffix of primary resource but Livy uses "spark-internal" as the primary resource when calling spark-submit, therefore args.isPython is set to false in SparkSubmit.scala. In Yarn mode, SparkSubmit module is responsible for resolving maven coordinates and adding them to "spark.submit.pyFiles" so that python's system path can be set correctly. The fix is to resolve maven coordinates not only when args.isPython is true, but also when primary resource is spark-internal. Tested the patch with Livy submitting pyspark app, spark-submit, pyspark with or without packages config. Signed-off-by: Shanyu Zhao <shzhaomicrosoft.com> Closes apache#23009 from shanyu/shanyu-26011. Authored-by: Shanyu Zhao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 9a5fda6) Signed-off-by: Sean Owen <[email protected]>

…from_avro`/`to_avro` Back port apache#22890 to branch-2.4. It is a bug fix for this issue: https://issues.apache.org/jira/browse/SPARK-26063 ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))logicalclocks#11] ``` It only makes the alias name quite long: `from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)logicalclocks#11 ``` ## How was this patch tested? Manual check Closes apache#23047 from gengliangwang/backport_avro_pretty_name. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…ad of `SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED)`. ## What changes were proposed in this pull request? This is a follow-up of apache#20393. We should read the conf `"spark.sql.sort.enableRadixSort"` from `SQLConf` instead of `SparkConf`, i.e., use `SQLConf.get.enableRadixSort` instead of `SparkEnv.get.conf.get(SQLConf.RADIX_SORT_ENABLED)`, otherwise the config is never read. ## How was this patch tested? Existing tests. Closes apache#23046 from ueshin/issues/SPARK-23207/conf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit dad2d82) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Highlights specific security issues to be aware of with Spark on K8S and recommends K8S mechanisms that should be used to secure clusters. ## How was this patch tested? N/A - Documentation only CC felixcheung tgravescs skonto Closes apache#23013 from rvesse/SPARK-25023. Authored-by: Rob Vesse <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 2aef79a) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Don't propagate SPARK_CONF_DIR to the driver in mesos cluster mode. ## How was this patch tested? I built the 2.3.2 tag with this patch added and deployed a test job to a mesos cluster to confirm that the incorrect SPARK_CONF_DIR was no longer passed from the submit command. Closes apache#22937 from mpmolek/fix-conf-dir. Authored-by: Matt Molek <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 696b75a) Signed-off-by: Sean Owen <[email protected]>

…rkListenerExecutorMetricsUpdate (backport 2.4) ## What changes were proposed in this pull request? This PR backports apache#24303 to 2.4. ## How was this patch tested? Jenkins Closes apache#24328 from zsxwing/SPARK-27394-2.4. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

… learn about the finished partitions" This reverts commit db86ccb.

…ion wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes apache#24286 from pengbo/master. Lead-authored-by: pengbo <[email protected]> Co-authored-by: mingbo_pb <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit c58a4fe) Signed-off-by: Dongjoon Hyun <[email protected]>

Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes apache#24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit 26ed65f) Signed-off-by: Tathagata Das <[email protected]>

## What changes were proposed in this pull request? The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. See the doc link: https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/kvstore/LevelDB.html ## How was this patch tested? N/A Closes apache#24386 from gatorsmile/rmDoc. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 61feb16) Signed-off-by: gatorsmile <[email protected]>

…s with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes apache#24315 from sujith71955/master_autoupdate. Authored-by: s71955 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 239082d) Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? This backports a tiny part of another change: apache@4bdfda9#diff-3c792ce7265b69b448a984caf629c96bR161 ... which just works around the possibility that the local python interpreter is 'python3' or 'python2' when running the spark-submit tests. I'd like to backport to 2.3 too. This otherwise prevents this test from passing on my mac, though I have a custom install with brew. But may affect others. ## How was this patch tested? Existing tests. Closes apache#24407 from srowen/Python23check. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.4 doesn't like python 3.6... :( NOTE: this is just for branch-2.4 PLEASE DO NOT MERGE Closes apache#24379 from shaneknapp/update-python-executable. Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>

## What changes were proposed in this pull request? This backports: apache@ab1650d apache@7857c6d which collectively updates Jackson to 2.9.8. ## How was this patch tested? Existing tests. Closes apache#24418 from srowen/SPARK-24601.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout. In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working. ## How was this patch tested? New unit tests. Closes apache#24396 from zsxwing/SPARK-27496. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 009059e) Signed-off-by: Dongjoon Hyun <[email protected]>

…Interval change to migration guide Add note about spark.executor.heartbeatInterval change to migration guide See also apache#24329 N/A Closes apache#24432 from srowen/SPARK-27419.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d4a16f4) Signed-off-by: Wenchen Fan <[email protected]>

…st 1.9.3 ## What changes were proposed in this pull request? Unify commons-beanutils deps to latest 1.9.3 Backport of apache#24378 ## How was this patch tested? Existing tests. Closes apache#24433 from srowen/SPARK-27469.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…h column containing null values ## What changes were proposed in this pull request? This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes apache#24436 from pengbo/aggregation_estimation. Authored-by: pengbo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d9b2ce0) Signed-off-by: Dongjoon Hyun <[email protected]>

…k on Scala-2.12 build ## What changes were proposed in this pull request? Since [SPARK-27274](https://issues.apache.org/jira/browse/SPARK-27274) deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more. This PR aims to fix the Python test script on Scala-2.12 build in `branch-2.4`. **BEFORE** ``` $ dev/change-scala-version.sh 2.12 $ build/sbt -Pscala-2.12 package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql Traceback (most recent call last): File "python/run-tests.py", line 70, in <module> raise Exception("Cannot find assembly build directory, please build Spark first.") Exception: Cannot find assembly build directory, please build Spark first. ``` **AFTER** ``` $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-sql'] Starting test(python2.7): pyspark.sql.tests ... ``` ## How was this patch tested? Manually do the above procedure because Jenkins doesn't test Scala-2.12 in `branch-2.4`. Closes apache#24439 from dongjoon-hyun/SPARK-27544. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… `kafka-0-8`profile for Scala-2.12 ## What changes were proposed in this pull request? Since SPARK-27274 deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more. Since Kakfa 0.8 doesn't have Scala-2.12 artifacts, e.g., `org.apache.kafka:kafka_2.12:jar:0.8.2.1`, this PR aims to fix `test-dependencies.sh` script to understand Scala binary version. ``` $ dev/change-scala-version.sh 2.12 $ dev/test-dependencies.sh Using `mvn` from path: /usr/local/bin/mvn Using `mvn` from path: /usr/local/bin/mvn Performing Maven install for hadoop-2.6 Using `mvn` from path: /usr/local/bin/mvn [ERROR] Failed to execute goal on project spark-streaming-kafka-0-8_2.12: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-8_2.12:jar:spark-335572: Could not find artifact org.apache.kafka:kafka_2.12:jar:0.8.2.1 in central (https://repo.maven.apache.org/maven2) -> [Help 1] ``` ## How was this patch tested? Manually do `dev/change-scala-version.sh 2.12` and `dev/test-dependencies.sh`. The script should show `DO NOT MATCH` message instead of Maven `[ERROR]`. ``` $ dev/test-dependencies.sh Using `mvn` from path: /usr/local/bin/mvn ... Generating dependency manifest for hadoop-3.1 Using `mvn` from path: /usr/local/bin/mvn Spark's published dependencies DO NOT MATCH the manifest file (dev/spark-deps). To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'. diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/pr-deps/spark-deps-hadoop-2.6 ... ``` Closes apache#24445 from dongjoon-hyun/SPARK-27550. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…nsSuite ## What changes were proposed in this pull request? update `HiveExternalCatalogVersionsSuite` to test 2.4.2, as 2.4.1 will be removed from Mirror Network soon. ## How was this patch tested? N/A Closes apache#24452 from cloud-fan/release. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b7f9830) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values. ## How was this patch tested? add new unit tests Closes apache#24441 from uncleGen/SPARK-27494. Authored-by: uncleGen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d2656aa) Signed-off-by: Wenchen Fan <[email protected]>

…in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ ## How was this patch tested? manually. Closes apache#24454 from cloud-fan/test. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ackendSuite ## What changes were proposed in this pull request? The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed. This multithreaded access of the `nodeBlacklist()` method is coming from: 1) the unit test thread via calling of the method `prepareRequestExecutors()` 2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method ## How was this patch tested? Existing unittest. (cherry picked from commit e4e4e2b) Closes apache#24474 from attilapiros/SPARK-26891-branch-2.4. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…mons-crypto. The commons-crypto library does some questionable error handling internally, which can lead to JVM crashes if some call into native code fails and cleans up state it should not. While the library is not fixed, this change adds some workarounds in Spark code so that when an error is detected in the commons-crypto side, Spark avoids calling into the library further. Tested with existing and added unit tests. Closes apache#24476 from vanzin/SPARK-25535-2.4. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… count This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). Manual testing, existing `JsonSuite` unit tests. Closes apache#23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 63bced9) Signed-off-by: Dongjoon Hyun <[email protected]>

…H in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of apache#24144 . apache#24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes apache#24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 7432e7d) Signed-off-by: Wenchen Fan <[email protected]>

…2.9.8 ## What changes were proposed in this pull request? This reverts commit 6f394a2. In general, we need to be very cautious about the Jackson upgrade in the patch releases, especially when this upgrade could break the existing behaviors of the external packages or data sources, and generate different results after the upgrade. The external packages and data sources need to change their source code to keep the original behaviors. The upgrade requires more discussions before releasing it, I think. In the previous PR apache#22071, we turned off `spark.master.rest.enabled` by default and added the following claim in our security doc: > The Rest Submission Server and the MesosClusterDispatcher do not support authentication. You should ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077 respectively by default) are restricted to hosts that are trusted to submit jobs. We need to understand whether this Jackson CVE applies to Spark. Before officially releasing it, we need more inputs from all of you. Currently, I would suggest to revert this upgrade from the upcoming 2.4.3 release, which is trying to fix the accidental default Scala version changes in pre-built artifacts. ## How was this patch tested? N/A Closes apache#24493 from gatorsmile/revert24418. Authored-by: gatorsmile <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

zsxwing and others added 30 commits August 1, 2019 19:01

Revert "[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader"

3c1a4d3

This reverts commit a75571b.

hot fix: add missing import

abce62d

zsxwing and others added 28 commits August 1, 2019 19:07

Revert "[SPARK-23433][SPARK-25250][CORE] Later created TaskSet should…

fffc344

… learn about the finished partitions" This reverts commit db86ccb.

Preparing Spark release v2.4.2-rc1

b8c95d5

Preparing development version 2.4.3-SNAPSHOT

726f41c

add missing import and fix compilation

31a1f22

Preparing Spark release v2.4.3-rc1

0df1ad7

[HOPSWORKS-1081] Upgrade Spark to 2.4.3

ad8028b

tkakantousis merged commit 434f24d into logicalclocks:branch-2.4 Aug 6, 2019

kai-chi deleted the HOPSWORKS-1081 branch August 8, 2019 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[HOPSWORKS-1081] Upgrade Spark to 2.4.3 #11

[HOPSWORKS-1081] Upgrade Spark to 2.4.3 #11

Uh oh!

kai-chi commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

69 participants

Uh oh!

[HOPSWORKS-1081] Upgrade Spark to 2.4.3 #11

[HOPSWORKS-1081] Upgrade Spark to 2.4.3 #11

Uh oh!

Conversation

kai-chi commented Aug 5, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

69 participants