Skip to content

FFI: data corruption on boolean Array with non-zero offset #2051

@mbutrovich

Description

@mbutrovich

Background

We encountered a new issue while trying to upgrade to DF49 where a test using the first or last aggregations when ignoring nulls returns incorrect results. DF49 did have some changes to these code paths, but I think this is just exposing a Comet issue, rather than a DF issue. After digging into it, the data coming out of the partial aggregation is correct on the native side with 3 columns showing a single row for one of the parititions that is [2, null, false] but when we import the vector on the Spark side the value changes to [2, null, true] which breaks the final aggregation and produces wrong results. When I dig into the FFI snapshot of the problematic batches, the only difference I found was that offset was non-zero (1 in this case).

More discussion on a proposed DataFusion workaround: apache/datafusion#16918.

Digging into past issues, offset with FFI between C and Rust with Java seems problematic...
apache/arrow-rs#3671
apache/arrow-rs#3675
apache/arrow-rs#5959
...and perhaps most importantly...
apache/arrow-java#88

From what I've inferred in issue discussion and documentation, a non-zero offset is not guaranteed to be supported by all FFI consumers:
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowArray.offset

Producers MAY specify that they will only produce 0-offset arrays to ease implementation of consumer code. Consumers MAY decide not to support non-0-offset arrays, but they should document this limitation.

I don't currently see anything in the Arrow Java documentation that says it doesn't support non-zero offset but the previously linked issues are concerning.

Proposed fix

I have a test branch locally that just does a take on any arrays with a non-zero offset before sending them over the JNI boundary with FFI. Running make test locally with a debug print if a non-zero offset array occurs makes this seem like a very rare code path. I will likely open a PR later today with this workaround, but wanted an issue to add a comment in the code related and to collect discussion.

Steps to reproduce

It's slightly non-deterministic due to the first/last function behavior, but this test with DF49 reproduces almost every time for me:

#2040 (comment)

Expected behavior

A false should not flip to true when crossing the JNI boundary with Arrow FFI.

Additional context

A fuzzer between Arrow-rs and Arrow Java that just does JNI calls back and forth to FFI batches would be amazing, but is perhaps outside of the scope of the Comet repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions