Skip to content

Conversation

@vegarsti
Copy link
Contributor

@vegarsti vegarsti commented Oct 2, 2025

Which issue does this PR close?

What changes are included in this PR?

Makes convert_array_to_scalar_vec return None instead of the empty list when list elements are null.

Are these changes tested?

Covered by existing tests and adds new test cases.

Are there any user-facing changes?

Change to return type of public function ScalarValue::convert_array_to_scalar_vec()

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate functions Changes to functions implementation labels Oct 2, 2025
@vegarsti vegarsti changed the title feat: convert_array_to_scalar_vec returns optional arrays feat: convert_array_to_scalar_vec respects null elements Oct 2, 2025
@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch from 9c6db36 to b1bb29c Compare October 2, 2025 20:03
@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch 2 times, most recently from 217287c to 35358d6 Compare October 2, 2025 20:35
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vegarsti I would love to see if there any performance degradations, you can find benches in the project.

Maybe we can have a separate test for this issues?

@vegarsti
Copy link
Contributor Author

vegarsti commented Oct 3, 2025

Thanks @vegarsti I would love to see if there any performance degradations, you can find benches in the project.

Maybe we can have a separate test for this issues?

Good call, I will do both. Thanks!

@vegarsti

This comment was marked as outdated.

@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch from 35358d6 to e232719 Compare October 4, 2025 06:27
@vegarsti
Copy link
Contributor Author

vegarsti commented Oct 4, 2025

Thanks @vegarsti I would love to see if there any performance degradations, you can find benches in the project.

Maybe we can have a separate test for this issues?

I added two test cases to show the change in this PR. On main these both return an empty vector, but with this PR the nullability is preserved, so in the first one (4) it's None and in the second one (5) it's Some (an empty vec).

https://github.com/apache/datafusion/pull/17891/files#diff-49e275af8f09685c7bbc491db8ab3b9479960878f42ac558ec0e3e39570590bd

https://github.com/vegarsti/datafusion/blob/e23271991bb2726df3ced25011cff0b08501b7b0/datafusion/common/src/scalar/mod.rs#L9128-L9177

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do plan to take a closer look at this, specifically around understanding the impact to array_agg, nth_value and array_has 👍

I'll also mark this as API change since convert_array_to_scalar_vec is technically part of the public API

@Jefffrey Jefffrey added the api change Changes the API exposed to users of the crate label Oct 8, 2025
let list = scalar_values
.into_iter()
.flatten()
.flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For here, it might be more correct to do it like so:

if let Ok(scalar_values) =
    ScalarValue::convert_array_to_scalar_vec(&array)
{
    assert_eq!(scalar_values.len(), 1);
    match &scalar_values[0] {
        // Haystack was a single list element as expected
        Some(list) => {
            let list = list
                .iter()
                .map(|v| Expr::Literal(v.clone(), None))
                .collect();

            return Ok(ExprSimplifyResult::Simplified(in_list(
                std::mem::take(needle),
                list,
                false,
            )));
        }
        // Haystack was a singular null, should be handled elsewhere
        None => {
            return Ok(ExprSimplifyResult::Original(args));
        }
    };
}
  • Instead of flattening None into an empty vec, we handle it separately.

Example of an accompanying test:

    #[test]
    fn test123() {
        let haystack = ListArray::new_null(
            Arc::new(Field::new_list_field(DataType::Int32, true)),
            1,
        );
        let haystack = lit(ScalarValue::List(Arc::new(haystack)));
        let needle = col("c");

        let props = ExecutionProps::new();
        let context = datafusion_expr::simplify::SimplifyContext::new(&props);

        let Ok(ExprSimplifyResult::Original(args)) =
            ArrayHas::new().simplify(vec![haystack.clone(), needle.clone()], &context)
        else {
            panic!("Expected non-simplified expression");
        };

        assert_eq!(args, vec![haystack, col("c")]);
    }

We can see it's essentially like doing array_has(null, 1) and handling this case in the simplifier; we don't try to simplify it as I don't think there's a way to pass the null as the haystack to in_list().

However, it's likely some other code path handles this null propagation already so that's why tests didn't fail. I prefer the robust (albeit more verbose) approach here (to me to reads easier), but we can stick with what we have here (preferably calling it out via comments?) either way.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, very good. I agree that the robust approach is better! Thanks for this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included both - thanks a lot. I didn't change anything because this felt very clear.

Copy link
Contributor Author

@vegarsti vegarsti Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like performance took a hit after introducing this 😭 #17891 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get the performance back, I reverted back to adding a second flatten() call but with some comments (and the test is there and still passes), in d9aa726. Does that sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too bothered by this honestly, mainly because its not really on a hot path (I think) and not sure how often people might look at this part of code anyway; but feel free to take a stab at refactoring if you feel like it 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, are you saying you don't think it's worth adding your suggested fix because it's code people don't look at much, or that it's fine to add it even if it is slightly less performant because it's not in the hot path? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add this ^ with the test in a patch 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I wasn't clear; I meant the former, since I don't think the benchmarks actually tested that bit of code (coupled with your remark about benchmarks being a bit noisy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch from e232719 to b51c66f Compare October 8, 2025 20:25
@vegarsti

This comment was marked as outdated.

@vegarsti
Copy link
Contributor Author

vegarsti commented Oct 8, 2025

Reverted back to using flatten in d9aa726, now back to having ~same performance as on main (slightly better than on main):

# ./bench.sh compare main convert-array-to-scalar-vec-optional
Comparing main and convert-array-to-scalar-vec-optional
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ convert-array-to-scalar-vec-optional ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  75.02 ms │                             77.72 ms │     no change │
│ QQuery 2     │  21.47 ms │                             21.94 ms │     no change │
│ QQuery 3     │  33.10 ms │                             34.15 ms │     no change │
│ QQuery 4     │  25.09 ms │                             24.66 ms │     no change │
│ QQuery 5     │  53.54 ms │                             53.55 ms │     no change │
│ QQuery 6     │  22.10 ms │                             22.11 ms │     no change │
│ QQuery 7     │  73.68 ms │                             73.33 ms │     no change │
│ QQuery 8     │  48.50 ms │                             45.75 ms │ +1.06x faster │
│ QQuery 9     │  65.69 ms │                             65.30 ms │     no change │
│ QQuery 10    │  48.25 ms │                             47.06 ms │     no change │
│ QQuery 11    │  16.03 ms │                             16.26 ms │     no change │
│ QQuery 12    │  32.05 ms │                             31.61 ms │     no change │
│ QQuery 13    │  37.23 ms │                             36.88 ms │     no change │
│ QQuery 14    │  26.97 ms │                             28.53 ms │  1.06x slower │
│ QQuery 15    │  38.11 ms │                             37.94 ms │     no change │
│ QQuery 16    │  14.98 ms │                             14.06 ms │ +1.07x faster │
│ QQuery 17    │  95.52 ms │                             89.33 ms │ +1.07x faster │
│ QQuery 18    │ 104.07 ms │                            102.60 ms │     no change │
│ QQuery 19    │  44.68 ms │                             43.64 ms │     no change │
│ QQuery 20    │  33.32 ms │                             33.89 ms │     no change │
│ QQuery 21    │  77.52 ms │                             78.33 ms │     no change │
│ QQuery 22    │  11.18 ms │                             12.61 ms │  1.13x slower │
└──────────────┴───────────┴──────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary                                   ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (main)                                   │ 998.12ms │
│ Total Time (convert-array-to-scalar-vec-optional)   │ 991.26ms │
│ Average Time (main)                                 │  45.37ms │
│ Average Time (convert-array-to-scalar-vec-optional) │  45.06ms │
│ Queries Faster                                      │        3 │
│ Queries Slower                                      │        2 │
│ Queries with No Change                              │       17 │
│ Queries with Failure                                │        0 │
└─────────────────────────────────────────────────────┴──────────┘

@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch 2 times, most recently from b022f1e to bf60995 Compare October 9, 2025 19:05
@vegarsti vegarsti force-pushed the convert-array-to-scalar-vec-optional branch from bf60995 to 633ab64 Compare October 10, 2025 05:08
@Jefffrey Jefffrey added this pull request to the merge queue Oct 13, 2025
@Jefffrey
Copy link
Contributor

Thanks @vegarsti

Merged via the queue into apache:main with commit 556eb9b Oct 13, 2025
28 checks passed
@vegarsti vegarsti deleted the convert-array-to-scalar-vec-optional branch October 15, 2025 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScalarValue::convert_array_to_scalar_vec doesn't respect null list elements

3 participants