Skip to content

[core][datasets] Worker raylets OOM when using smaller machines and default object store memory #24176

@stephanie-wang

Description

@stephanie-wang

What happened + What you expected to happen

Currently the datasets_shuffle_* nightly tests use 32-vCPU large machines and limit the object store memory to only a fraction of the available RAM. This means that we're probably underutilizing the machines and it's not very representative to a real setup either. Previous attempts to use smaller instance types / larger object store memory cause worker raylet OOMs, such as in datasets_shuffle_random_shuffle_1tb.

Here's some example output from dmesg on a worker raylet.

Versions / Dependencies

2.0dev

Reproduction script

datasets_shuffle_* nightly tests, and changing the instance type in the cluster config.

Issue Severity

No response

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions