[SPARK-13233][SQL][WIP] Python Dataset #11117

cloud-fan · 2016-02-08T16:32:10Z

draft prototype, submit PR to test it via jenkins.

TODO:

the new Dataset map, mapPartitions, etc. conflict with the existing ones(which just forward to RDD), we should remove old ones, but that will break some other code, so here we still keep the old ones, and use different names like mapPartitions2 for new ones.
mapPartitions is the fundamental function, which is enough for prototype, I'll add map, flatMap, etc. later based on it.

SparkQA · 2016-02-08T18:22:24Z

Test build #50927 has finished for PR 11117 at commit 4c757f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PipelinedDataFrame(DataFrame):
- case class PythonMapPartitions(
- case class PythonMapPartitions(

SparkQA · 2016-02-14T08:23:18Z

Test build #51258 has finished for PR 11117 at commit 5282d42.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PipelinedDataFrame(DataFrame):
- case class PythonMapPartitions(
- case class PythonMapPartitions(

SparkQA · 2016-02-14T11:59:20Z

Test build #51262 has finished for PR 11117 at commit 6107495.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PipelinedDataFrame(DataFrame):
- case class PythonMapPartitions(
- case class PythonMapPartitions(

SparkQA · 2016-02-15T06:47:52Z

Test build #51297 has finished for PR 11117 at commit 15fd836.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-15T07:19:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

+        ).compute(inputIterator, context.partitionId(), context)
+
+      if (outputIsPickled) {
+        outputIterator.map(bytes => InternalRow(bytes))


To avoid copying the bytes, here I create safe rows. However, according to #10511, operators should always produce unsafe rows. Actually python UDF operator(BatchPythonEvaluation) also produce safe rows, which may also have problems. Should we bring back the requireUnsafeRow stuff? In some cases like here, converting to unsafe rows is expensive and may not have much benefit.

cc @davies

BatchPythonEvaluation will produce UnsafeRow.

Oh sorry, I missed the unsafe projection at the very last. Then we can probably add an unsafe projection here too.

SparkQA · 2016-02-15T09:18:39Z

Test build #51300 has finished for PR 11117 at commit 6c26daa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-15T10:51:17Z

Test build #51301 has finished for PR 11117 at commit d96f103.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-15T13:47:34Z

Test build #51308 has finished for PR 11117 at commit 4862fe1.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-16T09:35:34Z

retest this please

SparkQA · 2016-02-16T11:47:28Z

Test build #51354 has finished for PR 11117 at commit 4dfe604.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-17T15:56:39Z

Test build #51433 has finished for PR 11117 at commit 1c8e7b3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PythonAppendColumns(
- case class PythonMapGroups(
- case class PythonAppendColumns(
- case class PythonMapGroups(

SparkQA · 2016-02-17T19:21:59Z

Test build #51436 has finished for PR 11117 at commit 862288b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PythonAppendColumns(
- case class PythonMapGroups(
- case class PythonAppendColumns(
- case class PythonMapGroups(

SparkQA · 2016-02-17T20:31:25Z

Test build #51438 has finished for PR 11117 at commit e0ca98f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PythonAppendColumns(
- case class PythonMapGroups(
- case class PythonAppendColumns(
- case class PythonMapGroups(

SparkQA · 2016-02-19T05:23:01Z

Test build #51518 has finished for PR 11117 at commit a772492.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-19T08:09:30Z

Test build #51526 has finished for PR 11117 at commit 590308a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-19T09:10:36Z

Test build #51534 has finished for PR 11117 at commit c883fa6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-19T14:22:02Z

retest this please

SparkQA · 2016-02-19T16:23:36Z

Test build #51559 has finished for PR 11117 at commit df53348.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-20T09:00:17Z

Test build #51588 has finished for PR 11117 at commit 97dcac2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-21T07:31:23Z

Test build #51610 has finished for PR 11117 at commit e0e86c2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-21T08:10:58Z

retest this please

SparkQA · 2016-02-21T08:34:56Z

Test build #51617 has finished for PR 11117 at commit 4783c4c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-21T11:24:08Z

Test build #51620 has finished for PR 11117 at commit 4c3c2b5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-21T17:20:30Z

Test build #51638 has finished for PR 11117 at commit 349b119.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-22T04:36:40Z

Test build #51652 has finished for PR 11117 at commit 8c32d31.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-22T08:49:58Z

retest this please

SparkQA · 2016-02-22T23:57:05Z

Test build #51680 has finished for PR 11117 at commit aec6fc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-23T03:08:26Z

retest this please

SparkQA · 2016-02-23T05:21:59Z

Test build #51720 has finished for PR 11117 at commit 1095d7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan force-pushed the python-ds branch from 4c757f1 to 5282d42 Compare February 14, 2016 07:18

python dataset

6107495

cloud-fan force-pushed the python-ds branch from 5282d42 to 6107495 Compare February 14, 2016 09:51

code cleanup

15fd836

cloud-fan added 2 commits February 15, 2016 14:57

scala side cleanup

a0a0dd6

fix style

6c26daa

cloud-fan reviewed Feb 15, 2016
View reviewed changes

produce unsafe rows

d96f103

infer schema

4dfe604

cloud-fan force-pushed the python-ds branch from 4862fe1 to 4dfe604 Compare February 16, 2016 00:43

cloud-fan force-pushed the python-ds branch from 1c8e7b3 to 862288b Compare February 17, 2016 17:25

aggregate

e0ca98f

cloud-fan force-pushed the python-ds branch from 862288b to e0ca98f Compare February 17, 2016 18:34

cloud-fan added 2 commits February 19, 2016 11:31

improve aggregate

da77adc

fix style

a772492

cloud-fan added 2 commits February 19, 2016 14:21

add pivot

590308a

some more tests

c883fa6

minor fix

df53348

add import

97dcac2

cloud-fan force-pushed the python-ds branch from e0e86c2 to 4783c4c Compare February 21, 2016 07:48

cloud-fan force-pushed the python-ds branch from 4783c4c to 4c3c2b5 Compare February 21, 2016 09:18

fix python 3

349b119

cloud-fan force-pushed the python-ds branch from 4c3c2b5 to 349b119 Compare February 21, 2016 15:13

small fix

8c32d31

update

aec6fc4

small cleanup

1095d7f

cloud-fan mentioned this pull request Feb 24, 2016

[SPARK-13233][SQL] Python Dataset (basic version) #11347

Closed

cloud-fan closed this Feb 24, 2016

Uh oh!

[SPARK-13233][SQL][WIP] Python Dataset #11117

[SPARK-13233][SQL][WIP] Python Dataset #11117

Uh oh!

Conversation

cloud-fan commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 8, 2016

Uh oh!

SparkQA commented Feb 14, 2016

Uh oh!

SparkQA commented Feb 14, 2016

Uh oh!

SparkQA commented Feb 15, 2016

Uh oh!

cloud-fan Feb 15, 2016

Choose a reason for hiding this comment

Uh oh!

davies Feb 15, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 15, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2016

Uh oh!

SparkQA commented Feb 15, 2016

Uh oh!

SparkQA commented Feb 15, 2016

Uh oh!

cloud-fan commented Feb 16, 2016

Uh oh!

SparkQA commented Feb 16, 2016

Uh oh!

SparkQA commented Feb 17, 2016

Uh oh!

SparkQA commented Feb 17, 2016

Uh oh!

SparkQA commented Feb 17, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

cloud-fan commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 20, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

cloud-fan commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 21, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

cloud-fan commented Feb 22, 2016

Uh oh!

SparkQA commented Feb 22, 2016

Uh oh!

cloud-fan commented Feb 23, 2016

Uh oh!

SparkQA commented Feb 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants