-
Couldn't load subscription status.
- Fork 6.8k
Open
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
Hello,
I am currently experiencing issue trying to overwrite an existing Parquet table in s3. I have tested the basic code below in snippet (1) and it works. However, what I am trying to implement a more complicated logic, it doesn't write but append to it instead(snippet 2).
I wonder why does the basic snippet works but not when I read using read_parquet, map_batches, write_parquet ?
Versions / Dependencies
ray[default]==2.22.0
ray cluster on top of k8s spark cluster using docker image
and also tried on Databricks ML g5.xlarge, 14LTS, spark 3.5.0 cluster
Reproduction script
Snipper 1:
import ray
ray.init()
data = {
'column1': [1, 2, 5],
'column2': ['a', 'b', 'h']
}
pandas_df = pd.DataFrame(data)
ray_df = ray.data.from_pandas(pandas_df)
s3_path = "s3://path/"
ray_df.write_parquet(s3_path, mode="overwrite")
Snipper 2:
import ray
ray.init()
input_df = ray.data.read_parquet("s3://path/input_data")
ray_df = input_df.map_batches(DummyClass, concurrency=3, batch_size=64, num_gpus=1)
ray_df.write_parquet("s3://path/output", mode="overwrite")
class DummyClass:
def __init__(self):
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
# llm inferencing here
return {
"key1": batch["key1"],
"key2": batch["key2"]
}
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)