Skip to content

Conversation

@JFinis
Copy link
Contributor

@JFinis JFinis commented Nov 22, 2023

This commit adds a new column order IEEE754TotalOrder, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16).

The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used.

With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section.

This solution to the problem is the result of the extended discussion in #196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as nan_counts, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives.

Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order.

The remainder of this message explains in detail what the problems are and how the proposed solution fixes them.

Problem Description

Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist:

Statistics don't tell whether NaNs are present

As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values.

For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result.

Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight.

Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as stated, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values.

Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present.

Handling NaNs in a ColumnIndex

There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks

Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values.

The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional.

Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages.

Solution

IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering.

As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate x > 5.0 (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN.

Only-NaN pages can now also be included in the column index, as they are no longer a special case.

In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.

Make sure you have checked all steps below.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JFinis! First read looks pretty good to me. I just have a few small suggestions.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thank you for writing this up

@Mytherin
Copy link

Mytherin commented Apr 2, 2025

Are there any plans to get this or #196 merged? Either of these solutions would still be useful today.

@JFinis
Copy link
Contributor Author

JFinis commented Apr 3, 2025

Yes, I would like to merge this, as the problem still persists. I guess what's missing now is mostly 2 open source implementations. I currently do not have the time allocated for that much open source work, sadly.

@Mytherin
Copy link

Mytherin commented Apr 3, 2025

I can implement support for this in a branch in DuckDB if that helps get this in.

@wgtmac
Copy link
Member

wgtmac commented Apr 8, 2025

I can implement support for this in a branch in DuckDB if that helps get this in.

@Mytherin I can help with the PoC implementation with parquet-java and/or parquet-cpp but I cannot commit the time.

@etseidl
Copy link
Contributor

etseidl commented Apr 8, 2025

I can take a stab at a rust implementation, but it would help if this PR could merge in recent changes from master.

@JFinis
Copy link
Contributor Author

JFinis commented Apr 8, 2025

I can take a stab at a rust implementation, but it would help if this PR could merge in recent changes from master.

I'll update in the next days

@etseidl
Copy link
Contributor

etseidl commented Apr 11, 2025

parquet-rs PoC is largely ready to go, at least as far as producing files that use the new ColumnOrder. What's missing is modifications to datafusion to use the new stats. @Mytherin, would you be able to modify DuckDB instead for the proof of utility (https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format point 2)?

Then all that would be required is implementation in parquet-java and demonstration of interoperability.

@mapleFU
Copy link
Member

mapleFU commented Apr 12, 2025

I'm glad to provide the POC for C++ version in these days

@wgtmac
Copy link
Member

wgtmac commented Apr 12, 2025

@mapleFU Thanks! Then I will focus on parquet-java.

Mytherin added a commit to Mytherin/duckdb that referenced this pull request Apr 12, 2025
@Mytherin
Copy link

I've implemented this for DuckDB in my branch here. Attached is a Parquet file that has the orders defined.

fp_order.parquet.zip

@wgtmac
Copy link
Member

wgtmac commented Apr 14, 2025

PoC from parquet-java is now alive: apache/parquet-java#3191

@JFinis JFinis force-pushed the totalorder branch 3 times, most recently from a7efb32 to cb7c38e Compare April 14, 2025 13:52
@JFinis
Copy link
Contributor Author

JFinis commented Apr 14, 2025

I have rebased my changes onto latest master and applied all suggestions. I have also split this PR into a styling change and the actual change, as suggested by @wgtmac. I don't know how to make this PR depend on the other PR (I can't put branches from my fork as base branches to this PR), so as long as the other PR isn't shipped, this PR still contains both commits.

So, how do we continue from here? Do we have enough implementations now to start a vote? Or what is still missing?

@etseidl
Copy link
Contributor

etseidl commented Apr 14, 2025

So, how do we continue from here? Do we have enough implementations now to start a vote? Or what is still missing?

Thanks @JFinis! I think what's needed now is to have the PoC engines each write a file with the new ordering (perhaps alltypes_tiny_pages.parquet from parquet-testing), and then have a query engine (one of Spark, Datafusion, DuckDB) do a query like 'select * ... where float_col is NaN' and show that the physical plan uses the new stats and results in no pages being read. (Perhaps a NaN could be introduced in one page). This would demonstrate interoperability and utility.

After that, I'd probably kick off a thread on the dev list to see if there are any last concerns before holding a vote.

@JFinis
Copy link
Contributor Author

JFinis commented Jun 3, 2025

@orlp I actually had nan_counts in my initial proposal. I'll try to summarize the gist of the discussion in that post (feel free to read it yourself) that prompted us to go to this approach instead.

  • It was argued that adding such special handling just for floating point is actually not that clean. What if the next type needs special handling for some values? Do we add special fields for them as well? If yes, statistics might over time get a mess where type-specific information spills into the generic statistics definition. If not, then why is float so special, that we are willing to give it special fields while other types do not get them?
  • NaNs are rare. The current main problem of Parquet is not that NaNs, if present, are inefficient. The problem is that NaNs make filtering impossible, even if they are not present. So the main focus of this proposal is to have a sane semantics that especially works well if there are no NaNs, while still working in the presence of NaNs, albeit with possibly reduced filter efficiency.
  • The "NaN poisoning" you mention, which I also mentioned in my initial proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in this proposal and therefore overwrites other extreme values, making filtering less efficient, but is it really worth to make a case distinction here? E.g., what about infinity? You could argue that having infinity in statistics is similarly unuseful as it's too wide of a bound, so we could also introduce infinity_counts and then have the statistics only contain non-infinite, non-NaN values. We could go further and also "peel off" further large outlier values and handle them in special ways. All of this would increase the filtering capabilities of the approach, at the cost of implementation complexity and additional branches.

So in the end, this boils down to: Are we willing to special case float to make filtering in the presence of NaNs more efficient, or do we go with a more streamlined implementation without special fields that only make sense for a single data type, at the cost of worse filtering performance when NaNs are present. The consensus was somehow the latter, and over time I have come to agree with this. It's just a way more minimalistic and clean design.

To your second comment about making the sign bit significant. I am working on an engine myself that does not care about sign bits in NaNs, but I still consider having them in Parquet okay. Our engine, when reading will simply normalize the NaN bit out (or just keep it as is). When writing, we will simply never write negative NaNs (or just write whatever bit pattern we have). However we do it, we of course have to make sure that the statistics we write are in line with the bit patterns we write and when reading, we have to make sure the same is true.

Therefore, I don't see the danger you mention IMHO. Of course, each writer who wants to conform to the Parquet spec would have to handle this correctly, so either normalize the sign bit out, or consider it when computing min/max; otherwise it's an incorrect writer. The writing engine cannot just do whatever it does internally, it has to do whatever the Parquet spec mandates. Again, the engine I am working on handles NaNs internally differently as well, but of course our Parquet read/write code does whatever Parquet mandates, and not what we otherwise do internally. So basically your danger boils down to: "There is a chance that a sloppily programmed writer gets this wrong". That is indeed a risk. But IMHO, there are way more complex things in Parquet to get wrong, so this shouldn't be our bar.

On the upside, what we gain by considering the sign bit for NaNs is:

  • a very efficient, branch-free and straightforward software implementation for the comparison predicate
  • this gives a total ordering of every possible NaN bit pattern and thereby leaves no room for interpretation that could lead to different results between engines.
  • a comparison predicate that is actually standardized by IEEE, so one could argue that sticking to standards is preferrable over rolling out our own thing.

Your own proposal suggests an ordering that would be different for min and max. While this might help with squeezing the most filtering possibilities out of NaNs, it adds even more special handling logic, and I guess a sloppily written writer implementation could get this wrong as well.

I guess, in the end, whatever we do will be light years better than the current state. We just have to agree on doing something. So far, reaching that agreement was hard, but it looks like this version of the proposal had gotten most agreement so far. I do agree though that the solution you propose is absolutely also a possibility with its own merits.

@orlp
Copy link

orlp commented Jun 4, 2025

Are we willing to special case float to make filtering in the presence of NaNs more efficient, or do we go with a more streamlined implementation without special fields that only make sense for a single data type, at the cost of worse filtering performance when NaNs are present.

I consider IEEE 754 total order as arbitrary and special-cased as the NaN-ignoring order. Either way we have to do something we're not doing right now for floats.

On the upside, what we gain by considering the sign bit for NaNs is:

  • a very efficient, branch-free and straightforward software implementation for the comparison predicate

Computing the min/max while ignoring NaNs is also efficient and branch-free if done correctly. And I hope we agree the computation of the statistic is far more important than the singular comparison to decide whether or not to skip an entire page.

  • this gives a total ordering of every possible NaN bit pattern and thereby leaves no room for interpretation that could lead to different results between engines.

My proposal is also fully defined without room for interpretation (apart from the exact NaN pattern used for the full-NaN page, we can figure that out).

The "NaN poisoning" you mention, which I also mentioned in my initial proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in this proposal and therefore overwrites other extreme values, making filtering less efficient, but is it really worth to make a case distinction here?

Yes, absolutely. NaNs aren't as rare as you might think depending on the use-case, and need special handling regardless, because they aren't ordered naturally. The 'large value' conundrum you mention is a red herring, those are naturally handled the same by every data system.

But not every data system makes the same choices when it comes to handling NaNs. By taking them out of the ordering and presented separately through a nan_count field every data system can process the information in a way which is compatible with their system.

Your own proposal suggests an ordering that would be different for min and max. While this might help with squeezing the most filtering possibilities out of NaNs, it adds even more special handling logic, and I guess a sloppily written writer implementation could get this wrong as well.

Let's not forget that my proposal very closely matches what Parquet already requires today.

@JFinis
Copy link
Contributor Author

JFinis commented Jun 4, 2025

I agree @orlp, all your arguments are valid and it's good to have the discussion. However, they all basically already came up in the initial discussion, so they are merely another educated opinion rather than new points we didn't consider before and therefore aren't likely to sway existing opinions on the topic; or maybe they do.

As mentioned, I'm happy with both approaches and willing to continue with any of them; I just want the main problem to be fixed, somehow. I mostly challenged your proposal to give you and others an overview of where the discussion went so far. While I like the conceptual simplicity of total ordering, I'm not a firm believer that it is the only possible solution to the problem. And I do recognize that your proposal of sorting NaN differently for min and max wasn't in my initial proposal and is probably a good idea, so if we went with NaN-counts, there are things that could be improved upon in the initial proposal before continuing.

So from my side, I'm happy to continue either way. But I guess we now should at least get an informal vote going to see where the majority opinion resides on this.

@Mytherin
Copy link

Mytherin commented Jun 4, 2025

From my perspective both solutions are significantly better than the status quo, and solve the core problem which is that floating point statistics cannot be used fully even when there are no NaN values present due to the reader always having to assume NaN values might be present in the data.

Which of these solutions is adapted is not important to me - but given that changes to the standard take a long time to propagate, it would be better if they are adapted sooner rather than later. At this point it is over 2 years since the original proposal and pages of discussions have already been had over this. It would be great to make some progress here.

@etseidl
Copy link
Contributor

etseidl commented Jun 4, 2025

Personally I can see pros and cons for each approach and agree either is better than the status quo. I'll just point out that if speed is of the essence then the current proposal already has 3 open source implementations and AFAICT is in a state it can be voted on and adopted quickly. Kicking off another round of discussions will likely delay roll out by at least a few months.

But I guess we now should at least get an informal vote going to see where the majority opinion resides on this.

I like this idea. If there are PMC members who would veto this PR, it's better to know now rather than after a few more months of discussion. If vetoed we can regroup and try again with nan_count.

@orlp
Copy link

orlp commented Jun 5, 2025

I'll just point out that if speed is of the essence then the current proposal already has 3 open source implementations

I don't believe either proposal requires a lot of implementation effort.

If there are PMC members who would veto this PR, it's better to know now rather than after a few more months of discussion. If vetoed we can regroup and try again with nan_count.

I don't think this is a fair voting procedure if the options are stated as "approve this or veto". Since everyone agrees that the status quo is bad, almost everyone will prefer this over vetoing. At the very least the voting options must include

  • this proposal,
  • my proposal,
  • veto (current proposed solutions not sufficient/defective).

@orlp
Copy link

orlp commented Jun 5, 2025

There is actually a problem with the singular NaN count for data systems which use IEEE 754 total ordering (such as datafusion), they would need two counts for efficient page filtering in the face of NaNs: one for positive NaNs and one for negative NaNs.

@tustvold
Copy link
Contributor

tustvold commented Jun 5, 2025

To echo what has been said before, and reiterated by @JFinis the total order mechanism is a relatively straightforward fix that is not only easy to implement but also easy to explain to users, if it even needs explaining. A NaN just becomes a very big value, and a negative NaN a very small value. Yes, this means a NaN may limit the ability to prune statistics, but this is no different from an abnormally large or small value. IMO this is pretty intuitive.

That's not to say the nan counts proposal is without merit, I could definitely see it being useful for engines that order NaNs differently, however, it is more complex both for users and parquet readers to reason about.

IMO we have a proposal with relatively broad consensus, and multiple implementations, I think it is worthwhile bring it to a broader audience in the form of a vote. I don't see adopting total ordering as a one way door, we can always add a nan count mechanism later.

@orlp
Copy link

orlp commented Jun 5, 2025

I don't see adopting total ordering as a one way door, we can always add a nan count mechanism later.

I do see it as a one-way door because no vendor is going to adopt a new ordering later if it means all readers which do not (yet) support the new ordering will downgrade from "suboptimal pruning" (the new status quo if this is adopted) to "no pruning" (the current status quo). We realistically speaking only have one chance to go from the status quo to a good solution, adopting a suboptimal solution now will likely leave us in a local maximum for the remainder of Parquet's lifetime.

Comment on lines +1151 to +1152
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment explaining those lines?

Suggested change
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
* // Turn sign+mantissa into 2's complement representation
* x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
* y_int ^= (((y_int >> 63) as u64) >> 1) as i64;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's assuming the comment is right. Please double-check :)

@JFinis
Copy link
Contributor Author

JFinis commented Jun 5, 2025

There is actually a problem with the singular NaN count for data systems which use IEEE 754 total ordering (such as datafusion), they would need two counts for efficient page filtering in the face of NaNs: one for positive NaNs and one for negative NaNs.

I don't think that's a big problem. It just means that if the system needs to include either -NaN or +NaN in a query, any page that has a non-zero nan_count has to be scanned. Yes, that might mean that you scan a page in vain, if you're only looking for, say, +NaN, but the page happens to only include -NaN, but this seems to be a rather small problem.

@JFinis
Copy link
Contributor Author

JFinis commented Jun 5, 2025

I am only a guest here and no PMC member, so I don't know how to continue the process. Could someone pick this up and do a vote?

I agree with @orlp that since there are two possible solutions for the problem on the table, we could also do a vote taking both approaches into account. Or two separate votes for each of them? Or just a single vote? Or something completely different? I am happy with any of those. As mentioned, I don't know how things are done here. Please, someone more official than I, pick this up and do whatever is prudent here.

I am happy to work in the latest suggestions and rebase; or switch back and improve the other PR, but before doing so, I would like to get a better understanding of whether the PR at hand has a chance to make it in the first place.

@etseidl
Copy link
Contributor

etseidl commented Jun 5, 2025

I don't think this is a fair voting procedure if the options are stated as "approve this or veto".

I wasn't suggesting a formal vote on total order. Rather, I was thinking an informal poll as @JFinis had suggested to fail fast on this PR if it lacked PMC support. But I see your point. What I want to avoid is a protracted discussion that after several months ends with the majority opinion of those involved to proceed with total order, only to then be shot down in an actual vote. The most vocal opinions recently have been from non-PMC.

Also, any vote involving nan_count would have to be informal since there are no PoC implementations (see https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format). The only formal vote that can be held at this point is "approve this or veto".

There is actually a problem with the singular NaN count for data systems which use IEEE 754 total ordering (such as datafusion), they would need two counts for efficient page filtering in the face of NaNs: one for positive NaNs and one for negative NaNs.

I don't think that's a big problem. It just means that if the system needs to include either -NaN or +NaN in a query, any page that has a non-zero nan_count has to be scanned. Yes, that might mean that you scan a page in vain, if you're only looking for, say, +NaN, but the page happens to only include -NaN, but this seems to be a rather small problem.

I believe it's worse than that. Consider a page [-NaN, -2.0, 0.0]. With nan_counts the stats are (-2.0, 0.0, nan_count=1). Not knowing what type of NaN was seen, an engine like Datafusion will have to treat the stats as (-NaN, NaN) rather than (-NaN, 0.0). With a query predicate like x > 0.0, Datafusion could prune the page with total order stats, but would have to scan the page with nan_count. It's the opposite of the problem @orlp raised. An engine that treats all NaNs as equal and greater than all real values would turn the total order stats (-NaN, 0.0) into (-NaN, NaN), and thus would be unable to prune with a predicate of x < -2.0. Adding a separate count for -NaN, I think, would satisfy both types of engines, but adds even more complexity (but if we're counting anyway it's not much of an added burden).

@emkornfield
Copy link
Contributor

I am only a guest here and no PMC member, so I don't know how to continue the process. Could someone pick this up and do a vote?

I can try to scan through the two proposals again, but I think maybe we should send a note about progress made with the current proposal and the alternative that has been brought up. If the current proposal. We generally haven't had too many complaints about NaNs in general in parquet so I'm inclined to the simplest thing possible here from a specification perspective (but if there are open datasets that we can evaluate impact on prunability, setting the argument with data would be useful).

@wgtmac
Copy link
Member

wgtmac commented Jul 26, 2025

I've posted a brief update on the Parquet dev ML: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

I don't see adopting total ordering as a one way door, we can always add a nan count mechanism later.

I do see it as a one-way door because no vendor is going to adopt a new ordering later if it means all readers which do not (yet) support the new ordering will downgrade from "suboptimal pruning" (the new status quo if this is adopted) to "no pruning" (the current status quo). We realistically speaking only have one chance to go from the status quo to a good solution, adopting a suboptimal solution now will likely leave us in a local maximum for the remainder of Parquet's lifetime.

I don't think the two proposal are mutually exclusive. Here is a existence proof of a way they could both co-exist

  1. Introduce IEEE 754 total order (this PR/proposal)
  2. In a follow on change, adopt the nan_count statistic proposed by @orlp, with semantics: if nan_count is specified, then Nans should not be included in the statistics; If nan_count is not specified then Nans are included in the statistics, per this proposal

Of course, the spec would be less complicated if we included nan_count to begin with, but I don't see any reason we can't introduce it later on

So my suggestion is let's get this proposal up for a vote, and then write up a follow on proposal for being more efficient in the presence of Nans

@orlp
Copy link

orlp commented Jul 31, 2025

@alamb

but I don't see any reason we can't introduce it later on

It would be incompatible with all existing query engines with Parquet readers which implemented this PR and use it to prune.

Suppose I write a query engine with a Parquet reader, and want to do pruning. My query is "find all rows where column x is NaN". Naturally, I read this proposal and implemented it by cleverly pruning anything where the min and max statistics aren't NaN (since if there was a NaN, it would've shown up in the statistics).

In a follow on change, adopt the nan_count statistic proposed by @orlp, with semantics: if nan_count is specified, then Nans should not be included in the statistics; If nan_count is not specified then Nans are included in the statistics, per this proposal

Now if we later change the semantics like you propose, this query engine would incorrectly miss rows if it doesn't understand nan_count. That's clearly unacceptable - we can't just change semantics post facto.

So the only option would be to completely replace the ordering introduced in this PR with another ordering which includes this nan_count behavior. But then we go back to my original complaint, as the older engines don't support this new ordering:

I do see it as a one-way door because no vendor is going to adopt a new ordering later if it means all readers which do not (yet) support the new ordering will downgrade from "suboptimal pruning" (the new status quo if this is adopted) to "no pruning" (the current status quo).

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

So the only option would be to completely replace the ordering introduced in this PR with another ordering which includes this nan_count behavior. But then we go back to my original complaint, as the older engines don't support this new ordering:

In my opinion this situation would be better than the current state of affairs.

I still don't see the proposals as mutually exclusive (though the current proposal could be improved)

Is there a formal proposal (aka a PR with proposed wording) for the alternative NAN handling? @orlp are you willing to to make such a proposal if it doesn't exist?

@orlp
Copy link

orlp commented Jul 31, 2025

@orlp are you willing to to make such a proposal if it doesn't exist?

I don't have the time or energy to make a fully formal proposal, no.

That said, if I may informally propose something, I think the current order we have in TypeDefinedOrder is honestly fine:

   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
   * 
   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.
   *     - If the computed max value is zero (whether negative or positive),
   *       `+0.0` should be written into the max statistics field.
   *     - If the computed min value is zero (whether negative or positive),
   *       `-0.0` should be written into the min statistics field.

I would honestly only clarify what should be done in the event a full-NaN page is encountered. I think in that case it makes the most sense to write -inf to the min and +inf to the max. That's honestly the only defect in the current spec, that we specify "NaNs should not be written to min or max statistics fields", but not what should be done in that case.


Then if this ordering (which we already have! - can't get easier than that) is simply combined with the addition of two new statistics (pos_nan_count and neg_nan_count), all execution engines should be able to prune as they please, regardless of what semantics they have for comparison (both total ordering like in DataFusion and partial ordering like in Polars).

I assume the addition of a completely new statistic is backwards compatible, if we don't change any semantics of the existing ordering (only clarifying what should be done for the full-NaN page)?

Then the statistics can be implemented in an incredibly simple branchless (with cmovs) way:

min = inf
max = -inf
pos_nan_count = 0
neg_nan_count = 0
for x in data {
    min = if x < min { x } else { min }
    max = if x > max { x } else { max }
    pos_nan_count += x.is_nan() & x.is_sign_positive()
    neg_nan_count += x.is_nan() & x.is_sign_negative()
}

// Post-processing normalization.
if min == 0 { min = -0.0 }
if max == 0 { max = 0.0 }
if pos_nan_count + neg_nan_count == data.len() {
    min = -inf
    max = inf
}

@etseidl
Copy link
Contributor

etseidl commented Aug 1, 2025

combined with the addition of two new statistics (pos_nan_count and neg_nan_count)

I'm wondering if counts are actually necessary. I see a need for null counts, but with NaN I think all we really care about is the lack or presence of them, rather than the number. Perhaps two boolean fields has_pos_nan and has_neg_nan. Or maybe rather than adding two more fields we could instead add an i32 or i64 bitfield and an enum to define the bit positions. This would allow for other special cases in the future (say we want to exclude +/- infinity as well).

are you willing to to make such a proposal if it doesn't exist?

I think #196 could be revived if need be.

@JFinis
Copy link
Contributor Author

JFinis commented Aug 1, 2025

My personal opinion would be not to revive nan counts. Can we maybe first establish whether we have a majority for anything? I feel like we're being startled by a singular opinion here, while we already had broad consensus of switching from nan_counts to total order; that's why I created this second PR in the first place. It feels like going in circles without adding new insights.

One technical comment:

I think the current order we have in TypeDefinedOrder is honestly fine:

I don't think it's fully fine. It's not a total order, so it makes sorting_columns ambiguous (are NaNs first? Last? It's underdefined). Therefore, one cannot leverage sortedness on floating point columns.

@wgtmac
Copy link
Member

wgtmac commented Aug 8, 2025

@emkornfield @Fokko @gszadovszky @julienledem @rdblue
Do you have any opinion on this? I'd like to move forward this with a formal vote.

@gszadovszky
Copy link
Contributor

+1 for having a formal vote between the two options and a third option for veto as @orlp suggested.

This commit adds a new column order `IEEE754TotalOrder`, which can be
used for floating point types (`FLOAT`, `DOUBLE`, `FLOAT16`).

The advantage of the new order is a well-defined ordering between -0,+0
and the various possible bit patterns of NaNs. Thus, every single
possible bit pattern of a floating point value has a well-defined order
now, so there are no possibilities where two implementations might
apply different orders when the new column order is used.

With the default column order, there were many problems w.r.t. NaN
values which lead to reading engines not being able to use statistics
of floating point columns for scan pruning even in the case where no
NaNs were in the data set. The problems are discussed in detail in the
next section.

This solution to the problem is the result of the extended discussion
in PR apache#196 [1], which ended with the consensus that IEEE 754 total
ordering is the best approach to solve the problem in a simple manner
without introducing special fields for floating point columns (such as
`nan_counts`, which was proposed in that PR). Please refer to the
discussion in that PR for all the details why this solution was chosen
over various design alternatives.

Note that this solution is fully backward compatible and should not
break neither old readers nor writers, as a new column order is added.
Legacy writers can continue not writing this new order and instead
writing the default type defined order. Legacy readers should avoid
using any statistics on columns that have a column order they do not
understand and therefore should just not use the statistics for columns
ordered using the new order.

The remainder of this message explains in detail what the problems are
and how the proposed solution fixes them.

Problem Description
===================

Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in `DOUBLE` or `FLOAT`
columns. Concretely the following problems exist:

Statistics don't tell whether NaNs are present
----------------------------------------------

As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However, NaN
is ordered in most database systems. For example, Postgres, DB2, and
Oracle treat NaN as greater than any other value, while MSSQL and MySQL
treat it as less than any other value. An overview over what different
systems are doing can be found here [2]. The gist of it is that
different systems with different semantics exist w.r.t. NaNs and most
of the systems do order NaNs; either less than or greater than all
other values.

For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate
`x > 1.0` should include NaN values. If a page has max = 0.0 now, the
engine would not be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.

Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with min = 2.0 couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem
on first sight.

Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all. And as state, that can be many seemingly innocuous greater than or
less than predicates in most databases systems. Conversely, it would be
nice if Parquet would enable scan pruning in these cases, regardless of
whether the reader and writer agree upon whether NaN is smaller,
greater, or incomparable to all other values.

Note that the problem exists especially if the Parquet file doesn't
include any NaNs, so this is not only a problem in the edge case where
NaNs are present; it is a problem in the way more common case of NaNs
not being present.

Handling NaNs in a ColumnIndex
------------------------------
There is currently no well-defined way to write a spec-conforming
`ColumnIndex` once a page has only NaN (and possibly null) values. NaN
values should not be included in min/max bounds, but if a page contains
only NaN values, then there is no other value to put into the min/max
bounds. However, bounds in a ColumnIndex are non-optional, so we have
to put something in here. The spec does not describe what engines
should do in this case. Parquet-mr takes the safe route and does not
write a column index once NaNs are present. But this is a huge
pessimization, as a single page containing NaNs will prevent writing a
column index for the column chunk containing that page, so even pages
in that chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the ColumnIndex
when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks
---------------------------------------

Note: Hereinafter, whenever the term only-NaN is used, it refers to a
page or column chunk, whose only non-null values are NaNs. E.g., an
only-NaN page is allowed to have a mixture of null values and NaNs or
only NaNs, but no non-NaN non-null values.

The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is
now unclear whether the min/max value wasn't written because there were
only NaNs, or simply because the writing engine did decide to omit them
for whatever other reason, which is allowed by the spec as the field is
optional.

Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, I can skip this page if I am looking for
only non-NaN values" or "no stats written, you have to read this page
as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.

Solution
========

IEEE 754 total order solves all the mentioned problems. As NaNs now
have a defined place in the ordering, they can be incorporated into min
and max bounds. In fact, in contrast to the default ordering, they do
not need any special casing anymore, so all the remarks how readers and
writers should special-handle NaNs and -0/+0 no longer apply to the new
ordering.

As NaNs are incorporated into min and max, a reader can now see whether
NaNs are contained through the statistics. Thus, a reading engine just
has to map its NaN semantics to the NaN semantics of total ordering.
For example, if the semantics of the reading engine treat all NaNs
(also -NaNs) as greater than all other values, a reading engine having
a predicate `x > 5.0` (which should include NaNs) may not filter any
pages / row groups if either min or max are (+/-)NaN.

Only-NaN pages can now also be included in the column index, as they
are no longer a special case.

[1]
apache#196
[2]
apache/arrow-rs#264 (comment)
JFinis added a commit to JFinis/parquet-format that referenced this pull request Aug 9, 2025
This commit is a combination of the following PRs:

* Introduce IEEE 754 total order
  apache#221
* Add nan_count to handle NaNs in statistics
  apache#196

Both these PRs try to solve the same problems; read
the description of the respective PRs for explanation.

This PR is the result of an extended discussion in
which it was repeatedly brought up that another possible
solution to the problem could be the combination of
the two approaches. Please refer to this discussion
on the mailing list and in the two PRs for details.
the mailing list discussion can be found here:
https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s

The contents of this PR are basically a straightforward
combination of the two approaches:
* IEEE total order is introduced as a new order for floating
  point types
* nan_count and nan_counts fields are added

Legacy writers may not write nan_count(s) fields,
so readers have to handle them being absent. Also, legacy
writers may have included NaNs into min/max bounds, so readers
also have to handle that.

As there are no legacy writers writing IEEE total order,
nan_count(s) are defined to be mandatory if this order is used,
so readers can assume their presense when this order is used.

This commit removes `nan_pages` from the ColumnIndex,
which the nan_counts PR mandated. We don't need them anymore,
as IEEE total order solves this: As this is a new order and
there are thus no legacy writers and readers, we have the
freedom to define that for only-NaN pages using this order,
we can actually write NaNs into min & max bounds in this case,
and readers can assume that isNaN(min) signals an only-NaN
page.
@alamb
Copy link
Contributor

alamb commented Aug 11, 2025

@JFinis has created a combined, updated proposal that adds total order and nan_count here. Please take a look if you are interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.