[SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. #20646

jose-torres · 2018-02-20T23:48:02Z

What changes were proposed in this pull request?

The stream-stream join tests add data to multiple sources, and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.

Fortunately, MemoryStream synchronizes batch generation on itself, and StreamExecution won't generate empty batches. So we can resolve this race condition by synchronizing successive AddDataMemory actions against every MemoryStream together. Then we can be sure StreamExecution won't start generating a batch before all the data is present.

How was this patch tested?

existing tests

jose-torres · 2018-02-20T23:48:12Z

@tdas

SparkQA · 2018-02-21T01:33:32Z

Test build #87571 has finished for PR 20646 at commit 1df90e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-02-21T01:42:11Z

java.lang.RuntimeException: [unresolved dependency: com.sun.jersey#jersey-core;1.14: configuration not found in com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 compile]

Surely unrelated to this change.

jose-torres · 2018-02-21T01:42:16Z

retest this please

jose-torres · 2018-02-21T01:44:04Z

(https://issues.apache.org/jira/browse/SPARK-23369 was already filed for previous flake)

tdas · 2018-02-21T02:30:33Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

+          addDataMemoryActions.append(actionIterator.next().asInstanceOf[AddDataMemory[_]])
+        }
+        if (addDataMemoryActions.nonEmpty) {
+          val synchronizeAll = addDataMemoryActions


This is some magic-ish code. Can you add a bit more comments on how this compose thing works?

tdas · 2018-02-21T02:36:49Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala

-      startedTest.foreach { action =>
+      val actionIterator = startedTest.iterator.buffered
+      while (actionIterator.hasNext) {
+        // Synchronize sequential addDataMemory actions.


// Synchronize --> // Collectively synchronize .... actions so that the data gets added together in a single batch.

tdas · 2018-02-21T03:09:43Z

Actually, I am having second thoughts about this. This is fundamentally changing how the tests work, especially for stress tests. The stress tests actually test these corner cases (by randomly adding successive AddData) about what if data was being added while the previously added data is being picked up. With this change, we will accidentally not test those race-condition-prone cases.

Second, we are taking multiple locks here in multiple sources, and the StreamExecution is likely to take the same locks. I am really afraid that we are introducing deadlocks by doing this.

I am still thinking what the right approach here. I think it should be

Explicit synchronized adding of data to multiple sources.
Not holding locks in multiple sources.

SparkQA · 2018-02-21T03:58:23Z

Test build #87572 has finished for PR 20646 at commit 1df90e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-21T11:17:02Z

I opened a new PR to test out an alternate approach. PTAL - https://github.com/apache/spark/pull/20650/files?w=1

(note the w=1, that is to ignore whitespaces diffs in the diff view).

…*JoinSuite **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** ## What changes were proposed in this pull request? The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in apache#20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. ## How was this patch tested? Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Author: Tathagata Das <[email protected]> Closes apache#20650 from tdas/SPARK-23408.

…*JoinSuite **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in apache#20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Author: Tathagata Das <[email protected]> Closes apache#20650 from tdas/SPARK-23408. NOTE: Modified a bit to cover DSv2 incompatibility between Spark 2.3 and 2.4 by Jungtaek Lim <[email protected]> * StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4

…in Streaming*JoinSuite ## What changes were proposed in this pull request? **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4 ## How was this patch tested? Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Closes #23757 from HeartSaVioR/fix-streaming-join-test-flakiness-branch-2.3. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Co-authored-by: Tathagata Das <[email protected]> Signed-off-by: Sean Owen <[email protected]>

jose-torres added 3 commits February 20, 2018 15:34

just use synchronization

d540be6

Merge branch 'master' of https://github.com/apache/spark into flaky

d91c55f

fix merge

dce075f

fix indent

1df90e7

tdas reviewed Feb 21, 2018

View reviewed changes

tdas mentioned this pull request Feb 21, 2018

[SPARK-23408][SS] Synchronize successive AddData actions in Streaming*JoinSuite #20650

Closed

jose-torres closed this Mar 7, 2018

HeartSaVioR mentioned this pull request Feb 11, 2019

[SPARK-23408][SS][BRANCH-2.3] Synchronize successive AddData actions in Streaming*JoinSuite #23757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. #20646

[SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. #20646

Uh oh!

jose-torres commented Feb 20, 2018

Uh oh!

jose-torres commented Feb 20, 2018

Uh oh!

SparkQA commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

tdas Feb 21, 2018

Uh oh!

tdas Feb 21, 2018

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

SparkQA commented Feb 21, 2018

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. #20646

[SPARK-23408][SS] Synchronize successive AddDataMemory actions in StreamTest. #20646

Uh oh!

Conversation

jose-torres commented Feb 20, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jose-torres commented Feb 20, 2018

Uh oh!

SparkQA commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

jose-torres commented Feb 21, 2018

Uh oh!

tdas Feb 21, 2018

Choose a reason for hiding this comment

Uh oh!

tdas Feb 21, 2018

Choose a reason for hiding this comment

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

SparkQA commented Feb 21, 2018

Uh oh!

tdas commented Feb 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants