- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
Branch 1.6 #18222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Closed
      
        
      
    
                
     Closed
            
            Branch 1.6 #18222
Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    …uration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <[email protected]> Author: mariobriggs <[email protected]> Closes #11022 from mariobriggs/spark-12739. (cherry picked from commit e9eb248) Signed-off-by: Shixiong Zhu <[email protected]>
…ld not fail analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. backport #11035 to 1.6 Author: Wenchen Fan <[email protected]> Closes #11042 from cloud-fan/branch-1.6.
minor fix for api link in ml onevsrest Author: Yuhao Yang <[email protected]> Closes #11068 from hhbyyh/onevsrestDoc. (cherry picked from commit c2c956b) Signed-off-by: Xiangrui Meng <[email protected]>
…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <[email protected]> Closes #11081 from zsxwing/SPARK-13195. (cherry picked from commit 8e2f296) Signed-off-by: Shixiong Zhu <[email protected]>
Author: Bill Chambers <[email protected]> Closes #11094 from anabranch/dynamic-docs. (cherry picked from commit 66e1383) Signed-off-by: Andrew Or <[email protected]>
There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <[email protected]> Closes #11095 from davies/fix_expand.
…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <[email protected]> Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle. (cherry picked from commit 34d0b70) Signed-off-by: Marcelo Vanzin <[email protected]>
JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Author: Joseph K. Bradley <[email protected]> Closes #8734 from viirya/dt-soft-centroids. (cherry picked from commit 9267bc6) Signed-off-by: Joseph K. Bradley <[email protected]>
… SpecificParquetRecordReaderBase This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods. Author: Josh Rosen <[email protected]> Closes #11131 from JoshRosen/SPARK-12921-take-2.
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <[email protected]> Closes #11158 from raelawang/master. (cherry picked from commit 719973b) Signed-off-by: Reynold Xin <[email protected]>
…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <[email protected]> Closes #11151 from yu-iskw/SPARK-13265. (cherry picked from commit efb65e0) Signed-off-by: Xiangrui Meng <[email protected]>
…n error
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false
cc holdenk
Author: sethah <[email protected]>
Closes #10962 from sethah/SPARK-13047.
(cherry picked from commit b354673)
Signed-off-by: Xiangrui Meng <[email protected]>
    …alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <[email protected]> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e20) Signed-off-by: Xiangrui Meng <[email protected]>
… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <[email protected]> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2) Signed-off-by: Sean Owen <[email protected]>
…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3) Signed-off-by: Xiangrui Meng <[email protected]>
Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <[email protected]> Closes #11180 from amitdev/master. (cherry picked from commit 331293c) Signed-off-by: Sean Owen <[email protected]>
…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <[email protected]> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb5483) Signed-off-by: Sean Owen <[email protected]>
There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <[email protected]> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c) Signed-off-by: Sean Owen <[email protected]>
This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <[email protected]> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e) Signed-off-by: Kay Ousterhout <[email protected]>
… default is "python2.7" Author: Christopher C. Aycock <[email protected]> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7) Signed-off-by: Josh Rosen <[email protected]>
…pares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <[email protected]> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 7856253) Signed-off-by: Andrew Or <[email protected]>
A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <[email protected]> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa94) Signed-off-by: Michael Armbrust <[email protected]>
…ome special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <[email protected]> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b399) Signed-off-by: Andrew Or <[email protected]>
In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <[email protected]> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac) Signed-off-by: Michael Armbrust <[email protected]> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
…false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <[email protected]> Closes #11237 from Earthson/SPARK-13359.
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <[email protected]> Closes #11226 from mengxr/SPARK-13355. (cherry picked from commit 764ca18) Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request?
This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
```
## How was the this patch tested?
Tested using two unit tests in sql/test.py and the DataFrameSuite.
Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
rxin
Author: Franklyn D'souza <[email protected]>
Closes #11333 from damnMeddlingKid/udt-union-patch.
    SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <[email protected]> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f737) Signed-off-by: Josh Rosen <[email protected]>
…ult on double value
## What changes were proposed in this pull request?
Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision.
This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted.
### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0
scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```
### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
+----+
|   a|
+----+
|-6.0|
+----+
```
## How was this patch tested?
Unit test.
Author: Sean Zhong <[email protected]>
Closes #15171 from clockfly/SPARK-17617.
(cherry picked from commit 3977223)
Signed-off-by: Wenchen Fan <[email protected]>
    …shed This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen <[email protected]> Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.
…ng entire job (branch-1.6 backport) This patch is a branch-1.6 backport of #15037: ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <[email protected]> Closes #15186 from JoshRosen/SPARK-17485-branch-1.6-backport.
…nousListenerBus (branch 1.6) ## What changes were proposed in this pull request? Backport #15220 to 1.6. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #15226 from zsxwing/SPARK-17649-branch-1.6.
… formats ## What changes were proposed in this pull request? This patch addresses a correctness bug in Spark 1.6.x in where `coalesce()` declares that it can process `UnsafeRows` but mis-declares that it always outputs safe rows. If UnsafeRow and other Row types are compared for equality then we will get spurious `false` comparisons, leading to wrong answers in operators which perform whole-row comparison (such as `distinct()` or `except()`). An example of a query impacted by this bug is given in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-17618). The problem is that the validity of our row format conversion rules depends on operators which handle `unsafeRows` (signalled by overriding `canProcessUnsafeRows`) correctly reporting their output row format (which is done by overriding `outputsUnsafeRows`). In #9024, we overrode `canProcessUnsafeRows` but forgot to override `outputsUnsafeRows`, leading to the incorrect `equals()` comparison. Our interface design is flawed because correctness depends on operators correctly overriding multiple methods this problem could have been prevented by a design which coupled row format methods / metadata into a single method / class so that all three methods had to be overridden at the same time. This patch addresses this issue by adding missing `outputsUnsafeRows` overrides. In order to ensure that bugs in this logic are uncovered sooner, I have modified `UnsafeRow.equals()` to throw an `IllegalArgumentException` if it is called with an object that is not an `UnsafeRow`. ## How was this patch tested? I believe that the stronger misuse-checking in `UnsafeRow.equals()` is sufficient to detect and prevent this class of bug. Author: Josh Rosen <[email protected]> Closes #15185 from JoshRosen/SPARK-17618.
From the original commit message: This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) Author: Charles Allen <charlesallen-net.com> (cherry picked from commit 2eaeafe) Author: Charles Allen <[email protected]> Closes #15270 from vanzin/SPARK-17696.
…atrix with SparseVector Backport PR of changes relevant to mllib only, but otherwise identical to #15296 jkbradley Author: Bjarne Fruergaard <[email protected]> Closes #15311 from bwahlgreen/bugfix-spark-17721-1.6.
This backports 733cbaa to Branch 1.6. It's a pretty simple patch, and would be nice to have for Spark 1.6.3. Unit tests Author: Burak Yavuz <[email protected]> Closes #15380 from brkyvz/bp-SPARK-15062. Signed-off-by: Michael Armbrust <[email protected]>
## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <[email protected]> Closes #15454 from zsxwing/SPARK-17850-1.6.
…cala-2.11 repl ## What changes were proposed in this pull request? Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" configuration, so user cannot set a fixed port number through "spark.replClassServer.port". ## How was this patch tested? N/A Author: jerryshao <[email protected]> Closes #15253 from jerryshao/SPARK-17678.
…m empty string to interval type ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <[email protected]> Closes #15479 from priyankagargnitk/cast_empty_string_bug.
…ld not depends on local timezone ## What changes were proposed in this pull request? Back-port of #13784 to `branch-1.6` ## How was this patch tested? Existing tests. Author: Davies Liu <[email protected]> Closes #15554 from srowen/SPARK-16078.
…executor loss ## What changes were proposed in this pull request? _This is the master branch-1.6 version of #15986; the original description follows:_ This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails. This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map. Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here. This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss. There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes. ## How was this patch tested? I added a new unit test to `TaskSchedulerImplSuite`. /cc kayousterhout and markhamstra, who reviewed #15986. Author: Josh Rosen <[email protected]> Closes #16070 from JoshRosen/fix-leak-following-total-executor-loss-1.6.
…functions No tests done for JDBCRDD#compileFilter. Author: Takeshi YAMAMURO <linguin.m.sgmail.com> Closes #10409 from maropu/AddTestsInJdbcRdd. (cherry picked from commit 8c1b867) Author: Takeshi YAMAMURO <[email protected]> Closes #16124 from dongjoon-hyun/SPARK-12446-BRANCH-1.6.
## What changes were proposed in this pull request? This fix is related to be bug: https://issues.apache.org/jira/browse/SPARK-18372 . The insertIntoHiveTable would generate a .staging directory, but this directory fail to be removed in the end. This is backport from spark 2.0.x code, and is related to PR #12770 ## How was this patch tested? manual tests Author: Mingjie Tang <mtanghortonworks.com> Author: Mingjie Tang <[email protected]> Author: Mingjie Tang <[email protected]> Closes #15819 from merlintang/branch-1.6.
The Hive client library is not smart enough to notice that the current user is a proxy user; so when using a proxy user, it fails to fetch delegation tokens from the metastore because of a missing kerberos TGT for the current user. To fix it, just run the code that fetches the delegation token as the real logged in user. Tested on a kerberos cluster both submitting normally and with a proxy user; Hive and HBase tokens are retrieved correctly in both cases. Author: Marcelo Vanzin <vanzincloudera.com> Closes #11358 from vanzin/SPARK-13478. (cherry picked from commit c7fccb5) Author: Marcelo Vanzin <[email protected]> Closes #16665 from vanzin/SPARK-13478_1.6.
## What changes were proposed in this pull request? This PR backports PR #16866 to branch-1.6 ## How was this patch tested? Existing tests. Author: Cheng Lian <[email protected]> Closes #16917 from liancheng/spark-19529-1.6-backport.
… beyond 64 KB ## What changes were proposed in this pull request? This is a backport pr of #15480 into `branch-1.6`. ## How was this patch tested? Existing tests. Author: Liwei Lin <[email protected]> Closes #17158 from ueshin/issues/SPARK-16845_1.6.
…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <[email protected]> Closes #17375 from HyukjinKwon/SPARK-19019-backport-1.6.
| Can one of the admins verify this patch? | 
| @chenjian2580 please close this PR. | 
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.