-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
Here is a pandas dataframe with nanosecond timestamp Data index:
>>> hist.index
DatetimeIndex(['1986-03-13', '1986-03-14', '1986-03-17', '1986-03-18',
'1986-03-19', '1986-03-20', '1986-03-21', '1986-03-24',
'1986-03-25', '1986-03-26',
...
'2021-05-28', '2021-06-01', '2021-06-02', '2021-06-03',
'2021-06-04', '2021-06-07', '2021-06-08', '2021-06-09',
'2021-06-10', '2021-06-11'],
dtype='datetime64[ns]', name='Date', length=8885, freq=None)
When storing this dataframe into parquet 1.0 format, pyarrow stores the Date column in microsecond unit. pyarrow is able to load the Date column with microsecond precision as well:
>>> from pyarrow.parquet import ParquetFile
>>> pp = ParquetFile("test_data/msft.parquet")
>>> pp.metadata.schema
<pyarrow._parquet.ParquetSchema object at 0x7f720d1bbac0>
required group field_id=0 schema {
optional double field_id=1 Open;
optional double field_id=2 High;
optional double field_id=3 Low;
optional double field_id=4 Close;
optional int64 field_id=5 Volume;
optional double field_id=6 Dividends;
optional double field_id=7 StockSplits;
optional int64 field_id=8 Date (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
But when loaded using arrow parquet crate, it is incorrectly loaded as nanosecond timestamp type.
To Reproduce
Here is a sample file to reproduce the issue: https://github.com/roapi/roapi/files/6599704/msft.parquet.zip.
The file can be reproduced with the following python code:
import yfinance as yf
hist = yf.Ticker('MSFT').history(period="max")
hist.to_parquet('msft.parquet')Expected behavior
Data column should be loaded as micro second precision.
Additional context
Arrow parquet crate handles parquet 2.0 files without any issue.
Initially reported in roapi/roapi#42.
Here is the decoded ipc field from the 'ARROW:schema' metadata for the Date column in arrow crate:
Field {
name: Some(
"Date",
),
nullable: true,
type_type: Timestamp,
type_: Timestamp {
unit: NANOSECOND,
timezone: None,
},
dictionary: None,
children: Some(
[],
),
custom_metadata: None,
}