[SPARK-8127][Streaming][Kafka] KafkaRDD optimize count() take() isEmpty() #6632

koeninger · 2015-06-04T03:44:33Z

…ed KafkaRDD methods. Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.

srowen · 2015-06-04T11:51:01Z

Jenkins, add to whitelist

srowen · 2015-06-04T11:51:03Z

ok to test

srowen · 2015-06-04T11:51:54Z

At a glance this makes sense to me. Let's see what tests say.

SparkQA · 2015-06-04T13:46:31Z

Test build #34181 has finished for PR 6632 at commit c3768c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2015-06-04T21:51:39Z

external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala

I thought about it, and I have a little concern about this. What if someone create a KafkaRDD with wrong offset ranges, which does not exist. In the current state, count will fail which is the correct thing to do. However, with this patch, it will give a count which is technically incorrect, rather than fail. May be a good idea to validate the limits of the offset ranges by actually querying Kafka, to verify that they exist before returning the count. And do it just once.

Now checking offset ranges in the createRdd method

tdas · 2015-06-04T21:54:31Z

Can you make a JIRA for this and add it to the title of the PR (see other PR's formatting). And title of the JIRA and PR could be a little more obvious - KafkaRDD optimizations for count() and take()

tdas · 2015-06-04T22:16:54Z

Other than that this looks very promising.

tdas · 2015-06-04T22:18:34Z

external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala

nonEmptyPartitions

…g KafkaRDD

SparkQA · 2015-06-05T22:27:55Z

Test build #34323 has finished for PR 6632 at commit 8974b9e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

…ethod

SparkQA · 2015-06-06T03:22:57Z

Test build #34348 has finished for PR 6632 at commit 253031d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koeninger · 2015-06-19T16:20:44Z

@tdas is there anything else you feel needs to be done on this?

tdas · 2015-06-19T18:11:14Z

external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala

@koeninger I know this is probably a good scala way, but this is kinda hard to read for the nesting. Could you take a the for-yield and put it in a separate variable? And then check for errors?

tdas · 2015-06-19T18:25:19Z

I think its almost good to go. Few minor points.

tdas · 2015-06-19T21:05:30Z

external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaRDDSuite.scala

nit: extra space

There is not check whether isEmpty is successful.

SparkQA · 2015-06-19T21:05:34Z

Test build #35310 has finished for PR 6632 at commit f68bd32.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // class ParentClass(parentField: Int)
- // class ChildClass(childField: Int) extends ParentClass(1)
- // If the class type corresponding to current slot has writeObject() defined,
- // then its not obvious which fields of the class will be serialized as the writeObject()
- abstract class GeneratedClass
- case class Bin(child: Expression)
- case class Md5(child: Expression)

tdas · 2015-06-19T21:06:27Z

external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaRDDSuite.scala

What does this check? Shouldnt it check that rdd.take(1) === "the" // whatever is expected

It's asserting that item taken from the rdd is a member of the set of
messages sent

On Fri, Jun 19, 2015 at 4:07 PM, Tathagata Das [email protected]
wrote:

In
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaRDDSuite.scala
#6632 (comment):

@@ -68,6 +68,21 @@ class KafkaRDDSuite extends SparkFunSuite with BeforeAndAfterAll {

val received = rdd.map(_._2).collect.toSet assert(received === messages)

// size-related method optimizations return sane results

assert(rdd.count === messages.size)

assert(rdd.countApprox(0).getFinalValue.mean === messages.size)

assert(! rdd.isEmpty)

assert(rdd.take(1).size === 1)

assert(messages(rdd.take(1).head._2))

What does this check? Shouldnt it check that rdd.take(1) === "the" //
whatever is expected

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/6632/files#r32869380.

Shouldnt the test be stronger that it return the expected message from the right offset and not just any of the messages? Basically if there is a bug in the code where take(1) returns the last message in the offset range rather than the first message, it wont be caught.

tdas · 2015-06-19T21:07:37Z

Just a couple of more comments on the tests.

SparkQA · 2015-06-19T23:55:53Z

Test build #35329 has finished for PR 6632 at commit 5a05d0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // class ParentClass(parentField: Int)
- // class ChildClass(childField: Int) extends ParentClass(1)
- // If the class type corresponding to current slot has writeObject() defined,
- // then its not obvious which fields of the class will be serialized as the writeObject()
- abstract class GeneratedClass
- case class Bin(child: Expression)
- case class Md5(child: Expression)

SparkQA · 2015-06-20T00:08:53Z

Test build #35331 has finished for PR 6632 at commit 321340d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // class ParentClass(parentField: Int)
- // class ChildClass(childField: Int) extends ParentClass(1)
- // If the class type corresponding to current slot has writeObject() defined,
- // then its not obvious which fields of the class will be serialized as the writeObject()
- abstract class GeneratedClass
- case class Bin(child: Expression)
- case class Md5(child: Expression)

tdas · 2015-06-20T01:14:12Z

Merging this to master, thanks a lot.

tdas · 2015-06-20T01:14:38Z

Wait, oh, the title, please fix order :/

koeninger · 2015-06-20T01:19:08Z

fixed title

tdas · 2015-06-20T01:53:20Z

Merging to master.

tdas · 2015-06-24T19:05:41Z

I forgot to say, thanks Cody! :)

koeninger · 2015-06-24T19:07:16Z

Cheers :)

On Wed, Jun 24, 2015 at 2:06 PM, Tathagata Das [email protected]
wrote:

I forgot to say, thanks Cody! :)

—
Reply to this email directly or view it on GitHub
#6632 (comment).

[Streaming][Kafka] Take advantage of offset range info for size-relat…

c3768c5

…ed KafkaRDD methods. Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.

tdas reviewed Jun 4, 2015
View reviewed changes

external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala Outdated

Copy link

Contributor

tdas Jun 4, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonEmptyPartitions

koeninger changed the title ~~[Streaming][Kafka] Take advantage of offset range info for size-relat…~~ [Streaming][Kafka][SPARK-8127] KafkaRDD optimize count() take() isEmpty() Jun 5, 2015

[Streaming][Kafka][SPARK-8127] check offset ranges before constructin…

8974b9e

…g KafkaRDD

[Streaming][Kafka][SPARK-8127] mima exclusion for change to private m…

253031d

…ethod

tdas reviewed Jun 19, 2015
View reviewed changes

koeninger added 2 commits June 19, 2015 13:44

Merge branch 'master' into kafka-rdd-count

9555b73

[Streaming][Kafka][SPARK-8127] code cleanup

f68bd32

tdas reviewed Jun 19, 2015
View reviewed changes

koeninger added 2 commits June 19, 2015 16:58

[SPARK-8127][Streaming][Kafka] additional test of isEmpty

5a05d0f

[SPARK-8127][Streaming][Kafka] additional test of ordering of take()

321340d

koeninger changed the title ~~[Streaming][Kafka][SPARK-8127] KafkaRDD optimize count() take() isEmpty()~~ [SPARK-8127][Streaming][Kafka] KafkaRDD optimize count() take() isEmpty() Jun 20, 2015

asfgit closed this in 1b6fe9b Jun 20, 2015

[SPARK-8127][Streaming][Kafka] KafkaRDD optimize count() take() isEmpty() #6632

[SPARK-8127][Streaming][Kafka] KafkaRDD optimize count() take() isEmpty() #6632

Uh oh!

Conversation

koeninger commented Jun 4, 2015

Uh oh!

srowen commented Jun 4, 2015

Uh oh!

srowen commented Jun 4, 2015

Uh oh!

srowen commented Jun 4, 2015

Uh oh!

SparkQA commented Jun 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jun 4, 2015

Uh oh!

tdas commented Jun 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 5, 2015

Uh oh!

SparkQA commented Jun 6, 2015

Uh oh!

koeninger commented Jun 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jun 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 20, 2015

Uh oh!

tdas commented Jun 20, 2015

Uh oh!

tdas commented Jun 20, 2015

Uh oh!

koeninger commented Jun 20, 2015

Uh oh!

tdas commented Jun 20, 2015

Uh oh!

tdas commented Jun 24, 2015

Uh oh!

koeninger commented Jun 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants