-
Notifications
You must be signed in to change notification settings - Fork 382
Compare Schema and StructType fields irrespective of ordering
#700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) | ||
|
|
||
| expected = StructWriter(((1, DoubleWriter()), (0, StringWriter()))) | ||
| expected = StructWriter(((0, DoubleWriter()), (1, StringWriter()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this change is semantically correct. This test is affected because resolve_writer compares the two given schemas (record_schema and file_schema)
iceberg-python/pyiceberg/avro/resolver.py
Lines 200 to 214 in 7bd5d9e
| def resolve_writer( | |
| record_schema: Union[Schema, IcebergType], | |
| file_schema: Union[Schema, IcebergType], | |
| ) -> Writer: | |
| """Resolve the file and read schema to produce a reader. | |
| Args: | |
| record_schema (Schema | IcebergType): The schema of the record in memory. | |
| file_schema (Schema | IcebergType): The schema of the file that will be written | |
| Raises: | |
| NotImplementedError: If attempting to resolve an unrecognized object type. | |
| """ | |
| if record_schema == file_schema: | |
| return construct_writer(file_schema) |
Previously, comparison returned False due to different ordering
| ) | ||
|
|
||
| with tbl.update_schema() as schema_update: | ||
| schema_update.move_before("struct.data", "struct.count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me think that the Field ordering does matter...
@Fokko wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks for digging into this 🎉
Technically the ordering does not matter when you write the data, because when reading we're correcting the order using this one:
iceberg-python/pyiceberg/io/pyarrow.py
Line 1143 in d02d7a1
| def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table: |
Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced in #523
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced
We're relying on pyarrow cast to translate some pyarrow data types into corresponding Iceberg-supported data types. Such as large_string -> string. Since LargeString is not an Iceberg-supported data type, we cannot use to_requested_schema. Maybe it's possible to cast pyarrow LargeString into Iceberg String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do that. The String and LargeString are an Arrow encoding detail (similar to the categorial type). Maybe we should have a different version of to_requested_schema where we don't cast, and just keep the original types? If the types are incompatible (for example, the field-id points to a string in the schema, and you try to write a boolean, it should fail).
|
Closing in favor of #829 |
Fixes #674
SchemaandStructTypefieldsvariable is represented byTuple, which means that ordering matters when performing comparison.Two
Schemas with the samefieldsin different order should be consider the sameiceberg-python/pyiceberg/schema.py
Line 88 in 7bd5d9e
iceberg-python/pyiceberg/types.py
Line 346 in 7bd5d9e