[SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instead of `maxConnections` #15966

dongjoon-hyun · 2016-11-21T20:27:25Z

What changes were proposed in this pull request?

This is a follow-up PR of #15868 to merge maxConnections option into numPartitions options.

How was this patch tested?

Pass the existing tests.

…ections` JDBCOption

gatorsmile · 2016-11-21T21:02:35Z

docs/sql-programming-guide.md

+     <td><code>numPartitions</code></td>
+     <td>
+       The number of partitions that can be used, if set. It works by limiting both read and write
+       operations' parallelism. If the number of partitions to write exceeds this limit, the


Could you also document the impact on the number of JDBC connections here?

Sure. I'll add that, too.

gatorsmile · 2016-11-21T22:25:28Z

docs/sql-programming-guide.md

+     <td><code>numPartitions</code></td>
+     <td>
+       The number of partitions that can be used, if set. It works by limiting both read and write
+       operations' parallelism. If the number of partitions to write exceeds this limit, the


fewer partitions is obscure. : )

Thank you for review again. Yep. I'll revise it. :)

How about this?

The number of partitions that can be used, if set. It works by limiting both read and write operations' parallelism. If the number of partitions to write exceeds this limit, the operation will coalesce the data set with this value before writing. In other words, this determines the maximum number of concurrent JDBC connections.

In the read path, we might not generate the exact number of partitions.

Found a bug in the read path in #15499.

Oh really? The number of JDBCPartition is different?

Conceptually, if the column values are exactly identical, only one partition will be generated.

The bug in #15499 is not related the above fact.

Ya. Actually, the above statement is describing coalesce and its parameter. So, shall we keep this?

Maybe we need to rephrase it and document both behaviors (read and write paths). Also emphasize this is the max number.

Like this?

The number of partitions that can be used, if set. It works by limiting both read and write operations' parallelism. In other words, this determines the maximum number of concurrent JDBC connections. For reading, it will make partitions less than or equal to this maximum. For writing, if the number of partitions to write exceeds this limit, the operation will coalesce the data set with this maximum before writing.

(I am sorry I goofed in the PR.)

gatorsmile · 2016-11-21T22:40:40Z

docs/sql-programming-guide.md

  </tr>

  <tr>
-    <td><code>partitionColumn, lowerBound, upperBound, numPartitions</code></td>


We should revert it back. These four parameters are related.

These options must all be specified if any of them is specified

That is incorrect now. numPartitions can be used alone. Those three are read-only optional parameters and numPartitions are general optional parameter.

In the read path, they are still related.

For that, I'll add some other description mentioning the relations.

SparkQA · 2016-11-21T22:44:08Z

Test build #68951 has finished for PR 15966 at commit 7df41a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-21T22:44:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

-    (lowerBound != null && upperBound != null && numPartitions != null),
+    (lowerBound != null && upperBound != null && numPartitions.isDefined),
    s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND'," +
      s" and '$JDBC_NUM_PARTITIONS' are required.")


We need to update this error message too.

Ur, for me, this error message looks correct. JDBC_NUM_PARTITION is indenpendent of the others, but JDBC_PARTITION_COLUMN requires the others (including JDBC_NUM_PARTITION), isn't it?

I did not try it. What happened if we input JDBC_PARTITION_COLUMN when writing the JDBC tables?

~~It will raise exception due to the require. I think it's the correct previous behavior.~~ Ah, let me check.

Since it's used in declaration of view, users cannot go to writing path.

CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2', partitionColumn 'value') java.lang.IllegalArgumentException: requirement failed: If 'partitionColumn' is specified then 'lowerBound', 'upperBound', and 'numPartitions' are required.

Have you tried the DataFrameWriter's JDBC/write() APIs?

dongjoon-hyun · 2016-11-21T23:24:28Z

To further review, I updated the doc. We can proceed on the updated content.

SparkQA · 2016-11-22T00:04:02Z

Test build #68955 has finished for PR 15966 at commit 48b6d25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-22T00:58:09Z

Is it possible that users read a table through JDBC and then write data to a table through JDBC and want the read and write have different parallelism?

SparkQA · 2016-11-22T01:52:44Z

Test build #68963 has finished for PR 15966 at commit f8c67ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-22T04:25:01Z

Thank you for review, @cloud-fan .

With the same parameter name numPartitions for read/write, we will use the parallelism with the same maximum limit by default. It's easy to use.

To use different parallelisms for read/write, we are able to do that with different view names. In the following example, t1 is numPartitions=1 and t2 is numPartitions=2. It's an example for writing, but I think the situation is the same with read operation. (For the read operation, partitionColumn, lowerBound, upperBound are required.)

sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', numPartitions '1')")
sql("CREATE OR REPLACE TEMPORARY VIEW t2 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', numPartitions '2')")
sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
sql("INSERT OVERWRITE TABLE t2 SELECT a FROM data GROUP BY a")

gatorsmile · 2016-11-22T06:22:50Z

I found a way to verify the coalesce logics of JDBC writing. See my PR: #15975 It added numPartition into JDBCRelation.

With minor code changes, you can see the adjusted numPartition in the output of EXPLAIN.

sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE").explain(true)

gatorsmile · 2016-11-22T06:40:10Z

docs/sql-programming-guide.md

+       The number of partitions that can be used, if set. It works by limiting both read and write
+       operations' parallelism. In other words, this determines the maximum number of concurrent
+       JDBC connections. For reading, it will make partitions less than or equal to this maximum.
+       For writing, if the number of partitions to write exceeds this limit, the operation will


Note, I am not good at writing tech document. Below is my revision.

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

dongjoon-hyun · 2016-11-22T18:17:47Z

Great! Thank you for #15975.

gatorsmile · 2016-11-22T19:19:13Z

LGTM pending test.

Also cc @rxin

rxin · 2016-11-22T19:47:09Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

    } else {
      JDBCPartitioningInfo(
-        partitionColumn, lowerBound.toLong, upperBound.toLong, numPartitions.toInt)
+        partitionColumn, lowerBound.toLong, upperBound.toLong, numPartitions.get)


is this safe to call get on? calling get on an option is always dangerous and not future proof, especially when there is no if (x.isDefined) check surrounding this.

Thank you for review, @rxin . Yes. It implicitly depends on that partitionColumn requires all others columns before.
I'll change here like this. Is it better?

- val partitionInfo = if (partitionColumn == null) { + val partitionInfo = if (partitionColumn == null || lowerBound == null || upperBound == null || + numPartitions.isEmpty) {

If you are doing this, how about changing partitionColumn, lowerBound, upperBound to Option types as well?

Sure! No problem. I'll update them tonight.

SparkQA · 2016-11-22T20:02:49Z

Test build #69019 has finished for PR 15966 at commit ba8be46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-22T20:07:03Z

The only one failure seems to be irrelevant to this.

- SPARK-8020: set sql conf in spark conf *** FAILED *** (17 seconds, 602 milliseconds)

SparkQA · 2016-11-22T23:09:08Z

Test build #69024 has finished for PR 15966 at commit d45467e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-22T23:29:37Z

Hi, @rxin .
Could you review this again?

dongjoon-hyun · 2016-11-24T00:18:58Z

Hi, @rxin , @cloud-fan , @gatorsmile .
Please let me know if there is something to do more.

SparkQA · 2016-11-24T05:51:12Z

Test build #69104 has finished for PR 15966 at commit f9db374.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-24T07:17:14Z

The only one failure is irrelevant to this PR.

[info] KafkaSourceStressForDontFailOnDataLossSuite:
[info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 58 seconds)

dongjoon-hyun · 2016-11-24T07:17:21Z

Retest this please.

SparkQA · 2016-11-24T07:22:33Z

Test build #69112 has started for PR 15966 at commit f9db374.

dongjoon-hyun · 2016-11-24T08:56:59Z

Maybe something internal errors.

Traceback (most recent call last):
  File "./dev/run-tests-jenkins.py", line 232, in <module>
    main()
  File "./dev/run-tests-jenkins.py", line 219, in main
    test_result_code, test_result_note = run_tests(tests_timeout)
  File "./dev/run-tests-jenkins.py", line 140, in run_tests
    test_result_note = ' * This patch **fails %s**.' % failure_note_by_errcode[test_result_code]
KeyError: -9

dongjoon-hyun · 2016-11-24T08:57:06Z

Retest this please.

SparkQA · 2016-11-24T11:05:27Z

Test build #69125 has finished for PR 15966 at commit f9db374.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-24T11:10:14Z

Test build #69126 has finished for PR 15966 at commit f9db374.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-24T11:18:05Z

LGTM

dongjoon-hyun · 2016-11-24T11:59:05Z

Thank you, @cloud-fan !

rxin · 2016-11-25T00:48:49Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

    val numPartitions = jdbcOptions.numPartitions

-    val partitionInfo = if (partitionColumn == null) {
+    val partitionInfo = if (partitionColumn.isEmpty || lowerBound.isEmpty || upperBound.isEmpty ||


I'd change this to

if (partitionColumn == null) { assert(lowerBound.isEmpty && upperBound.isEmpty && numPartitions.isEmpty) null } else { ... }

to be future proof.

Thanks. I'll update soon.

I assumed the following.

if (partitionColumn.isEmpty) { assert(lowerBound.isEmpty && upperBound.isEmpty && numPartitions.isEmpty) null } else { ... }

Ur, @rxin .
We are using numPartitions for both writing and reading, and numPartitions can be used alone.

I'll use the following.

val partitionInfo = if (partitionColumn.isEmpty) { assert(lowerBound.isEmpty && upperBound.isEmpty) null } else { assert(lowerBound.nonEmpty && upperBound.nonEmpty && numPartitions.nonEmpty) JDBCPartitioningInfo( partitionColumn.get, lowerBound.get, upperBound.get, numPartitions.get) }

rxin · 2016-11-25T00:49:30Z

LGTM other than that one change.

SparkQA · 2016-11-25T06:04:17Z

Test build #69143 has finished for PR 15966 at commit 4a957e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-25T18:34:03Z

Thanks - merging in master.

dongjoon-hyun · 2016-11-25T20:09:41Z

Thank you for review and merging, @rxin , @gatorsmile , @cloud-fan !

…ections` ## What changes were proposed in this pull request? This is a follow-up PR of apache#15868 to merge `maxConnections` option into `numPartitions` options. ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <[email protected]> Closes apache#15966 from dongjoon-hyun/SPARK-18413-2.

[SPARK-18413][SQL][FOLLOW-UP] Use numPartitions instead of `maxConn…

7df41a1

…ections` JDBCOption

gatorsmile reviewed Nov 21, 2016

View reviewed changes

Update docs.

48b6d25

gatorsmile reviewed Nov 21, 2016

View reviewed changes

Update docs.

f8c67ba

gatorsmile reviewed Nov 22, 2016

View reviewed changes

Update docs.

ba8be46

gatorsmile mentioned this pull request Nov 22, 2016

[SPARK-18538] [SQL] Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs #15975

Closed

rxin reviewed Nov 22, 2016

View reviewed changes

Address comments.

d45467e

Use Option type.

f9db374

rxin reviewed Nov 25, 2016

View reviewed changes

Add assert.

4a957e4

asfgit closed this in fb07bbe Nov 25, 2016

dongjoon-hyun deleted the SPARK-18413-2 branch November 27, 2016 07:20

Uh oh!

[SPARK-18413][SQL][FOLLOW-UP] Use numPartitions instead of maxConnections #15966

[SPARK-18413][SQL][FOLLOW-UP] Use numPartitions instead of maxConnections #15966

Uh oh!

Conversation

dongjoon-hyun commented Nov 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 21, 2016

Uh oh!

SparkQA commented Nov 22, 2016

Uh oh!

cloud-fan commented Nov 22, 2016

Uh oh!

SparkQA commented Nov 22, 2016

Uh oh!

dongjoon-hyun commented Nov 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Nov 22, 2016

Uh oh!

[SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instead of `maxConnections` #15966

[SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instead of `maxConnections` #15966

gatorsmile Nov 21, 2016 •

edited

Loading

dongjoon-hyun Nov 21, 2016 •

edited

Loading

gatorsmile Nov 21, 2016 •

edited

Loading

dongjoon-hyun Nov 21, 2016 •

edited

Loading

dongjoon-hyun Nov 21, 2016 •

edited

Loading

dongjoon-hyun Nov 21, 2016 •

edited

Loading

dongjoon-hyun commented Nov 22, 2016 •

edited

Loading

rxin Nov 22, 2016 •

edited

Loading

dongjoon-hyun Nov 22, 2016 •

edited

Loading