[SPARK-19335] Introduce UPSERT feature to SPARK #16692

kevinyu98 · 2017-01-24T11:58:35Z

What changes were proposed in this pull request?

This PR proposes to add the UPSERT feature support into SPARK through DataFrameWriter's JDBC data source options.

For example:
If the mytable2 in mysql database have unique constraints on column c1, and the user wants to save the dataframe into the mysql database, it will fail with violation of unique constraints.

val df = Seq((1,4)).toDF("c1","c2")
val url = "jdbc:mysql://9.30.167.220:3306/mydb"
df.write.mode(org.apache.spark.sql.SaveMode.Append) .option("user","kevin").option("password","kevin").jdbc(url,"mytable2",new java.util.Properties())

With this feature, the user can use this UPSERT options to write the dataframe into the mysql database table.

df.write.mode(org.apache.spark.sql.SaveMode.Append) .option(“upsert”,true).option(“upsertUpdateColumn”,”c1”).option("user","kevin").option("password","kevin").jdbc(url,"mytable2",new java.util.Properties())

Here is the design doc.
UPSERT DESIGN DOC

How was this patch tested?

Local test: run the test case from spark-shell and connect to MySQL and Postgresql database
Test case: add test cases in the existing test cases including docker integration suite
Please review http://spark.apache.org/contributing.html before opening a pull request.

AmplabJenkins · 2017-01-24T12:03:14Z

Can one of the admins verify this patch?

ilganeli · 2017-01-24T23:53:01Z

Hi, all - thanks for this submission. Overall it's a very clean implementation and I like it a lot. There's obviously a large amount of effort that went into developing this. The main issue with this approach however is that the Upsert statement itself is an extremely expensive operation. Depending on how your uniqueness condition is defined, validating against the uniqueness constraint proves to be the most expensive part of this whole sequence. In #16685 I chose to implement this by reading in the existing table and doing a join operation to identify conflicts. The reason for this is that operation is easily distributed across the entire dataset.

In contrast, the implementation as it stands in this PR ultimately depends entirely on the database to enforce the uniqueness constraint, something that in general does not parallelize very well and requires a full traversal of the index created on the uniqueness constraint. Furthermore, this index, in both MySQL and Postgres (the examples you've provided) cannot be implemented as a Hash index. Unless the owner of the database manually computes and enforces hashes on individual rows, this approach instead relies on btree indices to do this lookup.

This is a marginal cost when the btree is on a single field but if the uniqueness constraint spans multiple columns, this index is implemented as nested btrees. This, in turn, proves to be an extraordinarily costly update with non-linear performance degradation as both the size of the database and the size of the table being upserted increase.

This approach mirrors our initial approach to the problem but we ultimately moved away from this approach in favor of the one in #16685 for performance reasons. We were able to achieve a more than 10x performance increase, even taking into account the cost of the additional joins. Our tests were not massive - we tested against a roughly 10gb database in Postgres with approximately 10 million rows - on a relatively middle-line machine. I would love to know if you guys have done any performance benchmarks with this approach and if you could try out the approach in #16685 and let me know how that performs. Thanks!

gatorsmile · 2017-06-14T22:00:27Z

@kevinyu98 How about closing this PR at first and revisit it later?

gatorsmile · 2017-06-27T06:42:55Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

AydinChavez · 2018-02-13T12:19:16Z

Such feature is quite needed - please proceed
(there are plenty of ppl on SO searching for such SaveMode.Upsert or SaveMode.Replace mode)

The biggest pain point is that one duplicate entry leads to a cancellation of the whole batch insert in case of having unique constraints on the database table.

Performance (as the cost) should not be the blocker here as often results are written to traditional RDBMs and they're usually quite small (compared to the typical spark scenario having lots of "big data" for analysis jobs).

The uniqueness constraint check should be done entirely by the RDBMs, the solution should be very lightweight from spark perspective - an extra option for write.jdbc() should be enough. Therefore I like this PR/approach a lot. Maybe Oracle can be added having its "MERGE INTO" syntax.

An even more lightweight alternative: Following should even be enough at least for mariadb/mysql databases:
use INSERT IGNORE in your batched inserts. That way the dup keys do not cause trouble.

xwu0226 mentioned this pull request Jan 24, 2017

[SPARK-19335] Introduce insert, update, and upsert commands to the JdbcUtils class #16685

Closed

rebase to resolve conflict

25e5c83

kevinyu98 force-pushed the upsert2 branch from e622258 to 25e5c83 Compare April 28, 2017 04:27

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

This was referenced Jun 8, 2023

[SPARK-19335][SQL] Add upserts for writing to JDBC G-Research/spark#12

Closed

[SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC #41518

Closed

EnricoMi mentioned this pull request Jan 16, 2025

[SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC #49528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19335] Introduce UPSERT feature to SPARK #16692

[SPARK-19335] Introduce UPSERT feature to SPARK #16692

Uh oh!

kevinyu98 commented Jan 24, 2017

Uh oh!

AmplabJenkins commented Jan 24, 2017

Uh oh!

ilganeli commented Jan 24, 2017 •

edited

Loading

Uh oh!

gatorsmile commented Jun 14, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

AydinChavez commented Feb 13, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-19335] Introduce UPSERT feature to SPARK #16692

[SPARK-19335] Introduce UPSERT feature to SPARK #16692

Uh oh!

Conversation

kevinyu98 commented Jan 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jan 24, 2017

Uh oh!

ilganeli commented Jan 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jun 14, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

AydinChavez commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ilganeli commented Jan 24, 2017 •

edited

Loading

AydinChavez commented Feb 13, 2018 •

edited

Loading