Skip to content

[ray[default]==2.22.0] .write_parquet(s3_path, mode="overwrite") doesn't work properly #47799

@nthanapaisal

Description

@nthanapaisal

What happened + What you expected to happen

Hello,

I am currently experiencing issue trying to overwrite an existing Parquet table in s3. I have tested the basic code below in snippet (1) and it works. However, what I am trying to implement a more complicated logic, it doesn't write but append to it instead(snippet 2).

I wonder why does the basic snippet works but not when I read using read_parquet, map_batches, write_parquet ?

Versions / Dependencies

ray[default]==2.22.0
ray cluster on top of k8s spark cluster using docker image
and also tried on Databricks ML g5.xlarge, 14LTS, spark 3.5.0 cluster

Reproduction script

Snipper 1:

import ray

ray.init()

data = {
    'column1': [1, 2, 5],
    'column2': ['a', 'b', 'h']
}

pandas_df = pd.DataFrame(data)

ray_df = ray.data.from_pandas(pandas_df)

s3_path = "s3://path/"

ray_df.write_parquet(s3_path, mode="overwrite")

Snipper 2:

import ray

ray.init()
input_df = ray.data.read_parquet("s3://path/input_data")
ray_df = input_df.map_batches(DummyClass,  concurrency=3, batch_size=64, num_gpus=1)
ray_df.write_parquet("s3://path/output", mode="overwrite")

class DummyClass:
    def __init__(self):
    def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
        # llm inferencing here
        return {
            "key1": batch["key1"],
            "key2": batch["key2"]
        }

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tdataRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions