-
Notifications
You must be signed in to change notification settings - Fork 247
Description
Background
We encountered a new issue while trying to upgrade to DF49 where a test using the first or last aggregations when ignoring nulls returns incorrect results. DF49 did have some changes to these code paths, but I think this is just exposing a Comet issue, rather than a DF issue. After digging into it, the data coming out of the partial aggregation is correct on the native side with 3 columns showing a single row for one of the parititions that is [2, null, false] but when we import the vector on the Spark side the value changes to [2, null, true] which breaks the final aggregation and produces wrong results. When I dig into the FFI snapshot of the problematic batches, the only difference I found was that offset was non-zero (1 in this case).
More discussion on a proposed DataFusion workaround: apache/datafusion#16918.
Digging into past issues, offset with FFI between C and Rust with Java seems problematic...
apache/arrow-rs#3671
apache/arrow-rs#3675
apache/arrow-rs#5959
...and perhaps most importantly...
apache/arrow-java#88
From what I've inferred in issue discussion and documentation, a non-zero offset is not guaranteed to be supported by all FFI consumers:
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.offset
Producers MAY specify that they will only produce 0-offset arrays to ease implementation of consumer code. Consumers MAY decide not to support non-0-offset arrays, but they should document this limitation.
I don't currently see anything in the Arrow Java documentation that says it doesn't support non-zero offset but the previously linked issues are concerning.
Proposed fix
I have a test branch locally that just does a take on any arrays with a non-zero offset before sending them over the JNI boundary with FFI. Running make test locally with a debug print if a non-zero offset array occurs makes this seem like a very rare code path. I will likely open a PR later today with this workaround, but wanted an issue to add a comment in the code related and to collect discussion.
Steps to reproduce
It's slightly non-deterministic due to the first/last function behavior, but this test with DF49 reproduces almost every time for me:
Expected behavior
A false should not flip to true when crossing the JNI boundary with Arrow FFI.
Additional context
A fuzzer between Arrow-rs and Arrow Java that just does JNI calls back and forth to FFI batches would be amazing, but is perhaps outside of the scope of the Comet repo.