Tnachen coarse multiple executors #5

hermansc · 2015-10-24T13:40:06Z

Fixed conflicts so it can be merged with master. Let me know if I messed up. The conflict was in core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala and the executorTerminated function.

I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0` Author: y-shimizu <[email protected]> Closes apache#8904 from y-shimizu/master.

…s not working https://issues.apache.org/jira/browse/SPARK-10741 I choose the second approach: do not change output exprIds when convert MetastoreRelation to LogicalRelation Author: Wenchen Fan <[email protected]> Closes apache#8889 from cloud-fan/hot-bug.

…rom a local list of java beans Similar to SPARK-10630 it would be nice if Java users didn't have to parallelize there data explicitly (as Scala users already can skip). Issue came up in http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work Author: Holden Karau <[email protected]> Closes apache#8879 from holdenk/SPARK-10720-add-a-java-wrapper-to-create-a-dataframe-from-a-local-list-of-java-beans.

Author: Bin Wang <[email protected]> Closes apache#8898 from wb14123/doc.

seperate -> separate sees -> see Author: David Martin <[email protected]> Closes apache#8928 from dmartinpro/patch-1.

While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED *** [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau <[email protected]> Closes apache#8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.

…nsolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao <[email protected]> Closes apache#8910 from jerryshao/SPARK-10790.

Please refer to [SPARK-10395] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-10395 Author: Cheng Lian <[email protected]> Closes apache#8553 from liancheng/spark-10395/simplify-parquet-read-support.

The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu <[email protected]> Closes apache#8929 from davies/pushdown_string.

In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. Author: Sean Owen <[email protected]> Closes apache#8919 from srowen/SPARK-10833.

jira: https://issues.apache.org/jira/browse/SPARK-10670 In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md This JIRA is just for spark.ml, not spark.mllib Author: Yuhao Yang <[email protected]> Closes apache#8901 from hhbyyh/docAPI.

…AllocationSuite Fix the following issues in StandaloneDynamicAllocationSuite: 1. It should not assume master and workers start in order 2. It should not assume master and workers get ready at once 3. It should not assume the application is already registered with master after creating SparkContext 4. It should not access Master.app and idToApp which are not thread safe The changes includes: * Use `eventually` to wait until master and workers are ready to fix 1 and 2 * Use `eventually` to wait until the application is registered with master to fix 3 * Use `askWithRetry[MasterStateResponse](RequestMasterState)` to get the application info to fix 4 Author: zsxwing <[email protected]> Closes apache#8914 from zsxwing/fix-StandaloneDynamicAllocationSuite.

Author: Ryan Williams <[email protected]> Closes apache#8939 from ryan-williams/errmsg.

…PySpark API These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar: * Classes * Functions * Tags to highlight experimental features ![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png) Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html (The contribution is my original work and that I license the work to the project under the project's open source license) Author: noelsmith <[email protected]> Closes apache#8571 from noel-smith/pyspark-api-nav-enhance.

Add method to easily convert a StatCounter instance into a Python dict https://issues.apache.org/jira/browse/SPARK-6919 Note: This is my original work and the existing Spark license applies. Author: Erik Shilts <[email protected]> Closes apache#5516 from eshilts/statcounter-asdict.

Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases. Author: asokadiggs <[email protected]> Closes apache#8930 from asokadiggs/jira-10782.

When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <[email protected]> Closes apache#8907 from liancheng/spark-10811/eliminate-array-copying.

For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string. Author: Yanbo Liang <[email protected]> Closes apache#8937 from yanboliang/spark-10736.

…rface. This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface. The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results. There is still some documentation and testing left to do. cc yhuai Author: Herman van Hovell <[email protected]> Closes apache#8362 from hvanhovell/SPARK-9741.

…cluster mode) The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application. This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException. Author: Sun Rui <[email protected]> Closes apache#8938 from sun-rui/SPARK-10851.

…n InternalRow rather than external Row. Author: Reynold Xin <[email protected]> Closes apache#8900 from rxin/SPARK-10770-1.

This is an implementation of Hive's `json_tuple` function using Jackson Streaming. Author: Nathan Howell <[email protected]> Closes apache#7946 from NathanHowell/SPARK-9617.

Created method as.data.frame as a synonym for collect(). Author: Oscar D. Lara Yejas <[email protected]> Author: olarayej <[email protected]> Author: Oscar D. Lara Yejas <[email protected]> Closes apache#8908 from olarayej/SPARK-10807.

…Suite Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/ This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](https://github.com/apache/spark/blob/8fb3a65cbb714120d612e58ef9d12b0521a83260/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala#L121). There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them. Author: zsxwing <[email protected]> Closes apache#8946 from zsxwing/SPARK-10058.

… returns long instead of the Double type Floor & Ceiling function should returns Long type, rather than Double. Verified with MySQL & Hive. Author: Cheng Hao <[email protected]> Closes apache#8933 from chenghao-intel/ceiling.

…ve UDFs Takes over apache#8800 Author: Wenchen Fan <[email protected]> Closes apache#8941 from cloud-fan/hive-udf.

We introduced SQL option `spark.sql.parquet.followParquetFormatSpec` while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files. This option defaults to `false` and is marked as a non-public option (`isPublic = false`) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to `spark.sql.parquet.writeLegacyFormat`, and invert its default value (the two option names have opposite meanings). Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format. Author: Cheng Lian <[email protected]> Closes apache#8566 from liancheng/spark-10400/deprecate-follow-parquet-format-spec.

The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read. Author: Takeshi YAMAMURO <[email protected]> Closes apache#8122 from maropu/CleanUpForBinaryType.

Document CrossValidatorModel members: bestModel and avgMetrics Author: Rerngvit Yanggratoke <[email protected]> Closes apache#8882 from rerngvit/Spark-9798.

JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890). I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly. Author: Xusen Yin <[email protected]> Closes apache#5779 from yinxusen/SPARK-5890.

Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location. Author: zhichao.li <[email protected]> Closes apache#9096 from zhichao-li/uselessBranch.

SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains: ```r sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sparkR.stop() sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sqlContext ``` producing ```r Error in callJMethod(x, "getClass") : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed. ``` I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead. p.s. I tried lint-r but ended up a lots of errors on existing code. Author: Forest Fang <[email protected]> Closes apache#9205 from saurfang/sparkR.stop.

There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together. Author: Josh Rosen <[email protected]> Closes apache#8829 from JoshRosen/consolidate-sort-shuffle-implementations.

address comments in apache#9184 Author: Wenchen Fan <[email protected]> Closes apache#9212 from cloud-fan/encoder.

… send won't be interrupted The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application. Image the following execution order: | thread 1: TaskRunner.kill | thread 2: TaskRunner.run ------------- | ------------- | ------------- 1 | killed = true | 2 | | if (killed) { 3 | | throw new TaskKilledException 4 | | case _: TaskKilledException _: InterruptedException if task.killed => 5 | task.kill(interruptThread): interruptThread is true | 6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled)) 7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application. An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue. Author: zsxwing <[email protected]> Closes apache#9198 from zsxwing/dont-interrupt-send.

This commit removes unnecessary calls to addPendingTask in TaskSetManager.executorLost. These calls are unnecessary: for tasks that are still pending and haven't been launched, they're still in all of the correct pending lists, so calling addPendingTask has no effect. For tasks that are currently running (which may still be in the pending lists, depending on how they were scheduled), we call addPendingTask in handleFailedTask, so the calls at the beginning of executorLost are redundant. I think these calls are left over from when we re-computed the locality levels in addPendingTask; now that we call recomputeLocality separately, I don't think these are necessary. Now that those calls are removed, the readding parameter in addPendingTask is no longer necessary, so this commit also removes that parameter. markhamstra can you take a look at this? cc vanzin Author: Kay Ousterhout <[email protected]> Closes apache#9154 from kayousterhout/SPARK-11163.

…rtition schema for HadoopFsRelation To enable the unit test of `hadoopFsRelationSuite.Partition column type casting`. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one. ``` java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 in stage 3.0 (TID 206) java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ``` Author: Cheng Hao <[email protected]> Closes apache#8026 from chenghao-intel/partition_discovery.

…is documented incorrectly Minor fix on the comment Author: guoxi <[email protected]> Closes apache#9201 from xguo27/SPARK-11242.

*This PR adds a new experimental API to Spark, tentitively named Datasets.* A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations. Example usage is as follows: ### Functional ```scala > val ds: Dataset[Int] = Seq(1, 2, 3).toDS() > ds.filter(_ % 1 == 0).collect() res1: Array[Int] = Array(1, 2, 3) ``` ### Relational ```scala scala> ds.toDF().show() +-----+ |value| +-----+ | 1| | 2| | 3| +-----+ > ds.select(expr("value + 1").as[Int]).collect() res11: Array[Int] = Array(2, 3, 4) ``` ## Comparison to RDDs A `Dataset` differs from an `RDD` in the following ways: - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be used to serialize the object into a binary format. Encoders are also capable of mapping the schema of a given object to the Spark SQL type system. In contrast, RDDs rely on runtime reflection based serialization. - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored in the encoded form. This representation allows for additional logical operations and enables many operations (sorting, shuffling, etc.) to be performed without deserializing to an object. A `Dataset` can be converted to an `RDD` by calling the `.rdd` method. ## Comparison to DataFrames A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into specific Dataset by calling `df.as[ElementType]`. Similarly you can transform a strongly-typed `Dataset` to a generic DataFrame by calling `ds.toDF()`. ## Implementation Status and TODOs This is a rough cut at the least controversial parts of the API. The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API. The following is being deferred to future PRs: - Joins and Aggregations (prototype here apache@f11f91e) - Support for Java Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion. This is an internal detail, and what we are doing today works for the cases we care about. However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames). ## COMPATIBILITY NOTE Long term we plan to make `DataFrame` extend `Dataset[Row]`. However, making this change to che class hierarchy would break the function signatures for the existing function operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6. Author: Michael Armbrust <[email protected]> Closes apache#9190 from marmbrus/dataset-infra.

WIP Author: Gábor Lipták <[email protected]> Closes apache#8323 from gliptak/SPARK-7021.

``` // My machine only has 8 cores $ bin/spark-shell --master local[32] scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b") scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count() Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351) ``` Author: Andrew Or <[email protected]> Closes apache#9209 from andrewor14/fix-local-page-size.

…ve the message disorder issue The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread, ``` ref.send("A") ref.send("B") ``` The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel. To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established. Author: zsxwing <[email protected]> Closes apache#9197 from zsxwing/rpc-outbox.

This test can take a little while to finish on slow / loaded machines. Author: Marcelo Vanzin <[email protected]> Closes apache#9235 from vanzin/SPARK-11134.

Author: Jacek Laskowski <[email protected]> Closes apache#9230 from jaceklaskowski/utils-seconds-typo.

…util package Author: Reynold Xin <[email protected]> Closes apache#9239 from rxin/types-private.

Removed typo on line 8 in markdown : "Received" -> "Receiver" Author: Rohan Bhanderi <[email protected]> Closes apache#9242 from RohanBhanderi/patch-1.

For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes. cc cloud-fan Author: Davies Liu <[email protected]> Closes apache#9217 from davies/zero_out.

A POC code for making example code in user guide testable. mengxr We still need to talk about the labels in code. Author: Xusen Yin <[email protected]> Closes apache#9109 from yinxusen/SPARK-10382.

…b.regression Author: Yu ISHIKAWA <[email protected]> Closes apache#8684 from yu-iskw/SPARK-10277.

This is a PR for Parquet-based model import/export. * Added save/load for ChiSqSelectorModel * Updated the test suite ChiSqSelectorSuite Author: Jayant Shekar <[email protected]> Closes apache#6785 from jayantshekhar/SPARK-6723.

This adds API for reading and writing text files, similar to SparkContext.textFile and RDD.saveAsTextFile. ``` SQLContext.read.text("/path/to/something.txt") DataFrame.write.text("/path/to/write.txt") ``` Using the new Dataset API, this also supports ``` val ds: Dataset[String] = SQLContext.read.text("/path/to/something.txt").as[String] ``` Author: Reynold Xin <[email protected]> Closes apache#9240 from rxin/SPARK-11274.

…IsolatedClientLoader. https://issues.apache.org/jira/browse/SPARK-11194 Author: Yin Huai <[email protected]> Closes apache#9170 from yhuai/SPARK-11194.

Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <[email protected]> Closes apache#9179 from sun-rui/SPARK-10971.

Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable. Several text issues: ![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png) - text collapsed into a single paragraph - text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:" shivaram Author: felixcheung <[email protected]> Closes apache#9261 from felixcheung/rdocreadwritedf.

…tho… …ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set This is the exception after this patch. Please help review. ``` java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:412) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 21 more Failed to load hive class. You need to build Spark with -Phive and -Phive-thriftserver. ``` Author: Jeff Zhang <[email protected]> Closes apache#9134 from zjffdu/SPARK-11125.

…iple_executors

## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. **Before** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)apache#7,(A#0 + 1)apache#8,(1 + A#0)apache#9], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)apache#7, (A#0 + 1)apache#8, (1 + A#0)apache#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)apache#7,(A#0 + 1) AS (A#0 + 1)apache#8,(1 + A#0) AS (1 + A#0)apache#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)apache#7,(A#0 + 1)apache#8,(1 + A#0)apache#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` **After** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12590 from dongjoon-hyun/SPARK-14830.

## What changes were proposed in this pull request? Implements `eval()` method for expression `AssertNotNull` so that we can convert local projection on LocalRelation to another LocalRelation. ### Before change: ``` scala> import org.apache.spark.sql.catalyst.dsl.expressions._ scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull scala> import org.apache.spark.sql.Column scala> case class A(a: Int) scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain java.lang.UnsupportedOperationException: Only code-generated evaluation is supported. at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) ... ``` ### After the change: ``` scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain(true) == Parsed Logical Plan == 'Project [assertnotnull('_1) AS assertnotnull(_1)#5] +- LocalRelation [_1#2, _2#3] == Analyzed Logical Plan == assertnotnull(_1): struct<a:int> Project [assertnotnull(_1#2) AS assertnotnull(_1)#5] +- LocalRelation [_1#2, _2#3] == Optimized Logical Plan == LocalRelation [assertnotnull(_1)#5] == Physical Plan == LocalTableScan [assertnotnull(_1)#5] ``` ## How was this patch tested? Unit test. Author: Sean Zhong <[email protected]> Closes apache#14486 from clockfly/assertnotnull_eval.

y-shimizu and others added 30 commits September 27, 2015 16:36

[SPARK-10778] [MLLIB] Implement toString for AssociationRules.Rule

299b439

I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0` Author: y-shimizu <[email protected]> Closes apache#8904 from y-shimizu/master.

add doc for spark.streaming.stopGracefullyOnShutdown

fb4c7be

Author: Bin Wang <[email protected]> Closes apache#8898 from wb14123/doc.

Fix two mistakes in programming-guide page

b582499

seperate -> separate sees -> see Author: David Martin <[email protected]> Closes apache#8928 from dmartinpro/patch-1.

[SPARK-10395] [SQL] Simplifies CatalystReadSupport

14978b7

Please refer to [SPARK-10395] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-10395 Author: Cheng Lian <[email protected]> Closes apache#8553 from liancheng/spark-10395/simplify-parquet-read-support.

[SPARK-10871] include number of executor failures in error msg

b7ad54e

Author: Ryan Williams <[email protected]> Closes apache#8939 from ryan-williams/errmsg.

[SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should retur…

03cca5d

…n InternalRow rather than external Row. Author: Reynold Xin <[email protected]> Closes apache#8900 from rxin/SPARK-10770-1.

[SPARK-9617] [SQL] Implement json_tuple

89ea004

This is an implementation of Hive's `json_tuple` function using Jackson Streaming. Author: Nathan Howell <[email protected]> Closes apache#7946 from NathanHowell/SPARK-9617.

[SPARK-10671] [SQL] Throws an analysis exception if we cannot find Hi…

02026a8

…ve UDFs Takes over apache#8800 Author: Wenchen Fan <[email protected]> Closes apache#8941 from cloud-fan/hive-udf.

[SPARK-9798] [ML] CrossValidatorModel Documentation Improvements

2a71782

Document CrossValidatorModel members: bestModel and avgMetrics Author: Rerngvit Yanggratoke <[email protected]> Closes apache#8882 from rerngvit/Spark-9798.

zhichao-li and others added 26 commits October 22, 2015 03:59

[SPARK-11121][CORE] Correct the TaskLocation type

c03b6d1

Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location. Author: zhichao.li <[email protected]> Closes apache#9096 from zhichao-li/uselessBranch.

[SPARK-11216][SQL][FOLLOW-UP] add encoder/decoder for external row

42d225f

address comments in apache#9184 Author: Wenchen Fan <[email protected]> Closes apache#9212 from cloud-fan/encoder.

[SPARK-11242][SQL] In conf/spark-env.sh.template SPARK_DRIVER_MEMORY …

188ea34

…is documented incorrectly Minor fix on the comment Author: guoxi <[email protected]> Closes apache#9201 from xguo27/SPARK-11242.

[SPARK-7021] Add JUnit output for Python unit tests

163d53e

WIP Author: Gábor Lipták <[email protected]> Closes apache#8323 from gliptak/SPARK-7021.

[SPARK-11134][CORE] Increase LauncherBackendSuite timeout.

fa6a4fb

This test can take a little while to finish on slow / loaded machines. Author: Marcelo Vanzin <[email protected]> Closes apache#9235 from vanzin/SPARK-11134.

Fix a (very tiny) typo

b1c1597

Author: Jacek Laskowski <[email protected]> Closes apache#9230 from jaceklaskowski/utils-seconds-typo.

[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.…

cdea017

…util package Author: Reynold Xin <[email protected]> Closes apache#9239 from rxin/types-private.

Fix typo "Received" to "Receiver" in streaming-kafka-integration.md

16dc9f3

Removed typo on line 8 in markdown : "Received" -> "Receiver" Author: Rohan Bhanderi <[email protected]> Closes apache#9242 from RohanBhanderi/patch-1.

[SPARK-10382] Make example code in user guide testable

03ccb22

A POC code for making example code in user guide testable. mengxr We still need to talk about the labels in code. Author: Xusen Yin <[email protected]> Closes apache#9109 from yinxusen/SPARK-10382.

[SPARK-10277] [MLLIB] [PYSPARK] Add @SInCE annotation to pyspark.mlli…

282a15f

…b.regression Author: Yu ISHIKAWA <[email protected]> Closes apache#8684 from yu-iskw/SPARK-10277.

[SPARK-11194] [SQL] Use MutableURLClassLoader for the classLoader in …

4725cb9

…IsolatedClientLoader. https://issues.apache.org/jira/browse/SPARK-11194 Author: Yin Huai <[email protected]> Closes apache#9170 from yhuai/SPARK-11194.

Merge remote-tracking branch 'origin/master' into tnachen_coarse_mult…

c74c14a

…iple_executors

hermansc closed this Oct 24, 2015

hermansc deleted the tnachen_coarse_multiple_executors branch October 24, 2015 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tnachen coarse multiple executors #5

Tnachen coarse multiple executors #5

Uh oh!

hermansc commented Oct 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

78 participants

Tnachen coarse multiple executors #5

Tnachen coarse multiple executors #5

Uh oh!

Conversation

hermansc commented Oct 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

78 participants