Skip to content

Predicates on to_timestamp do not work as expected with "naive" timestamp strings #765

@alamb

Description

@alamb

Describe the bug
Given a TimestampNanosecondArray which pretty-prints as follows:

    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+

Queries involving a predicate such as time < to_timestamp('2021-07-20 23:29:30') do not filter any rows (even though they should filter the row with 2021-07-20 23:30:30)

To Reproduce

async fn datafusion_reproducer() {
    let array = TimestampNanosecondArray::from(vec![1626823730000000000, 1626823830000000000]);
    let array: ArrayRef = Arc::new(array);
    println!("array[0]: {:?}", array_value_to_string(&array, 0).unwrap());
    println!("array[1]: {:?}", array_value_to_string(&array, 1).unwrap());

    let batch = RecordBatch::try_from_iter(vec![("time", array)]).unwrap();
    let table = MemTable::try_new(batch.schema(), vec![vec![batch]]).unwrap();
    let table = Arc::new(table);

    // select * from t
    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+
    run_query(table.clone(), "select * from t").await;

    // Using the following predicate should result in a single row,
    // but instead results in both
    //
    // select * from t where time < to_timestamp('2021-07-20 23:29:30')
    // +---------------------+
    // | time                |
    // +---------------------+
    // | 2021-07-20 23:28:50 |
    // | 2021-07-20 23:30:30 |
    // +---------------------+
    run_query(table.clone(), "select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;



    // explain select * from t where time < to_timestamp('2021-07-20 23:29:30')
    // +---------------+---------------------------------------------------------------+
    // | plan_type     | plan                                                          |
    // +---------------+---------------------------------------------------------------+
    // | logical_plan  | Projection: #t.time                                           |
    // |               |   Filter: #t.time Lt TimestampNanosecond(1626838170000000000) |
    // |               |     TableScan: t projection=Some([0])                         |
    // | physical_plan | ProjectionExec: expr=[time@0 as time]                         |
    // |               |   CoalesceBatchesExec: target_batch_size=4096                 |
    // |               |     FilterExec: time@0 < 1626838170000000000                  |
    // |               |       RepartitionExec: partitioning=RoundRobinBatch(16)       |
    // |               |         MemoryExec: partitions=1, partition_sizes=[1]         |
    // +---------------+---------------------------------------------------------------+

    run_query(table.clone(), "explain select * from t where time < to_timestamp('2021-07-20 23:29:30')").await;
}

#[allow(dead_code)]
async fn run_query(csvdata: Arc<dyn TableProvider>, query: &str)  {


    let mut ctx = ExecutionContext::new();
    ctx.register_table("t", csvdata).unwrap();

    let results = ctx.sql(query)
        .unwrap()
        .collect()
        .await
        .unwrap();

    let pretty = pretty_format_batches(&results).unwrap();
    println!("{}\n{}", query, pretty);
}

Expected behavior
The query should produce a single row with timestamp 2021-07-20 23:28:50

However the actual query returns both rows

Additional context
We saw this in IOx: https://github.com/influxdata/influxdb_iox/issues/2071

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions