Skip to content

Inconsistency with count distinct on NaN values #16254

@andygrove

Description

@andygrove

Describe the bug

I have this csv file:

a,b
x,NaN
x,NaN
x,NaN

With a simple select query, DF says there is only 1 distinct value for column b (which, I think is correct).

> select count(distinct b) from 'nan.csv';
+---------------------------+
| count(DISTINCT nan.csv.b) |
+---------------------------+
| 1                         |
+---------------------------+

However, in an aggregate query, DF says there are 3 distinct values:

> select a, count(distinct b) from 'nan.csv' group by 1 order by 1;
+---+---------------------------+
| a | count(DISTINCT nan.csv.b) |
+---+---------------------------+
| x | 3                         |
+---+---------------------------+

This behavior seems inconsistent. I would expect the aggregate query to also report that there is one distinct value (in Spark, the behavior is consistent between the two queries).

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions