Skip to content

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

@houqp

Description

@houqp

Describe the bug

Here is a pandas dataframe with nanosecond timestamp Data index:

>>> hist.index
DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
               '1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
               '1986-03-25', '1986-03-26',
               ...
               '2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
               '2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
               '2021-06-10', '2021-06-11'],
              dtype='datetime64[ns]', name='Date', length=8885, freq=None)

When storing this dataframe into parquet 1.0 format, pyarrow stores the Date column in microsecond unit. pyarrow is able to load the Date column with microsecond precision as well:

>>> from pyarrow.parquet import ParquetFile
>>> pp = ParquetFile("test_data/msft.parquet")
>>> pp.metadata.schema
<pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
required group field_id=0 schema {
  optional double field_id=1 Open;
  optional double field_id=2 High;
  optional double field_id=3 Low;
  optional double field_id=4 Close;
  optional int64 field_id=5 Volume;
  optional double field_id=6 Dividends;
  optional double field_id=7 StockSplits;
  optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

But when loaded using arrow parquet crate, it is incorrectly loaded as nanosecond timestamp type.

To Reproduce

Here is a sample file to reproduce the issue: https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.

The file can be reproduced with the following python code:

import yfinance as yf
hist = yf.Ticker('MSFT').history(period="max")
hist.to_parquet('msft.parquet')

Expected behavior

Data column should be loaded as micro second precision.

Additional context

Arrow parquet crate handles parquet 2.0 files without any issue.

Initially reported in roapi/roapi#42.

Here is the decoded ipc field from the 'ARROW:schema' metadata for the Date column in arrow crate:

Field {
    name: Some(
        "Date",
    ),
    nullable: true,
    type_type: Timestamp,
    type_: Timestamp {
        unit: NANOSECOND,
        timezone: None,
    },
    dictionary: None,
    children: Some(
        [],
    ),
    custom_metadata: None,
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions