Skip to content

Conversation

@jqin61
Copy link
Contributor

@jqin61 jqin61 commented Jul 22, 2024

for issue #954

@sungwy
Copy link
Collaborator

sungwy commented Jul 22, 2024

Thank you very much for putting up this PR so quickly! It's amazing how fast you've investigated this issue and put forward a working solution. Just wanted to highlight that the proposed implementation is consistent with Spark Iceberg's behavior.

For a given Iceberg table:

spark.read.table("demo.tacocat.test_null").show()

>> +----+----+
>> |  id|data|
>> +----+----+
>> |   1|   a|
>> |   2|   b|
>> |   3|   c|
>> |NULL|   d|
>> |   5|   e|
>> +----+----+

Scan API ignore Null values unless null is specified in the predicate expression:

spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2""").show()
>> +---+----+
>> | id|data|
>> +---+----+
>> |  1|   a|
>> |  2|   b|
>> +---+----+

spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE not id > 2""").show()

>> +---+----+
>> | id|data|
>> +---+----+
>> |  3|   c|
>> |  5|   e|
>> +---+----+

spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2 OR id is NULL""").show()

>> +----+----+
>> |  id|data|
>> +----+----+
>> |   3|   c|
>> |NULL|   d|
>> |   5|   e|
>> +----+----+

While the DELETE API avoids deleting nulls unless it is specified directly in the predicate expression.

spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id == 2""")
spark.read.table("demo.tacocat.test_null").show()

>> +----+----+
>> |  id|data|
>> +----+----+
>> |   1|   a|
>> |   3|   c|
>> |NULL|   d|
>> |   5|   e|
>> +----+----+

spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2""")
spark.read.table("demo.tacocat.test_null").show()

>> +----+----+
>> |  id|data|
>> +----+----+
>> |   3|   c|
>> |NULL|   d|
>> |   5|   e|
>> +----+----+

spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2 or id IS NULL""")
spark.read.table("demo.tacocat.test_null").show()

>> +---+----+
>> | id|data|
>> +---+----+
>> |  3|   c|
>> |  5|   e|
>> +---+----+

So I agree with @jqin61 's finding that we have to walk through the predicate expression to check if Nulls/NaNs were directly mentioned explicitly in the delete filter to revert the expression correctly as proposed. Simple negation of pyarrow.compute.Expression will unfortunately yield the wrong outcome.

import pyarrow as pa
import pyarrow.compute as pc
expr = (pc.field("a") == pc.scalar(3))
tbl = pa.Table.from_pydict({"a": [1,2,3,None]})
tbl.filter(~ expr)

>> pyarrow.Table
>> a: int64
>> ----
>> a: [[1,2]]

@jqin61 jqin61 marked this pull request as draft July 23, 2024 06:02
Copy link
Contributor Author

@jqin61 jqin61 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to fix some tests

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jqin61 Thanks for flagging this critical issue and providing a fix quickly! It looks great! Leaving some comments below

@jqin61 jqin61 marked this pull request as ready for review July 23, 2024 20:26
@jqin61 jqin61 requested review from HonahX and sungwy July 23, 2024 20:27
@jqin61 jqin61 requested a review from sungwy July 24, 2024 14:30
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for taking the time to investigate the typing issue. Although that does not affect code logic, it is great to fix those to improve readability and avoid confusion. I have a final comment about using Any instead of generic L in concrete class.

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @jqin61 ! Thank you again for jumping into this complex issue, investigating the root cause and implementing the solution so quickly.

And thank you @HonahX for the detailed rounds of review!

@sungwy sungwy merged commit 9b6503d into apache:main Jul 25, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* fix

* naming

* handle nan as well

* naming as sung suggested

* one more test to fix; one more comment to address

* fix a test

* refactor code organization

* fix mouthful naming

* restore usage of BoundTerm

* small fixes for comments

* small fix for typing

* fix typing according to pr comment
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* fix

* naming

* handle nan as well

* naming as sung suggested

* one more test to fix; one more comment to address

* fix a test

* refactor code organization

* fix mouthful naming

* restore usage of BoundTerm

* small fixes for comments

* small fix for typing

* fix typing according to pr comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants