Skip to content

[Fix] kubectl ray create cluster config file CPU overwrites the whole resource requests and limits #3811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented Jun 19, 2025

Why are these changes needed?

Root Cause

This issue occurs because the data overrides the default values in &config, causing the defaults to be lost.

config := newRayClusterConfigWithDefaults()
if err := yaml.UnmarshalStrict(data, &config); err != nil {
     return nil, fmt.Errorf("failed to parse YAML: %w", err)
}

How I Solved It

I created a MergeWithDefaults function to merge the RayClusterConfig field by field.
I know this approach isn't particularly elegant, but I couldn’t find a better way to merge deeply nested structs.

If you have a better suggestion, I'd be happy to hear it.

Manual testing

config.yaml

worker-groups:
  - cpu: 3

Dry Run the command

$ kubectl ray create cluster test --file ./config.yaml --dry-run | tail
containers:
- image: rayproject/ray:2.46.0
  name: ray-worker
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: "3"
      memory: 4Gi

Run the command

$ kubectl ray create cluster test --file ./config.yaml

image

config2.yaml

namespace: my-namespace
labels:
  kubernetes: 123
  kubectl: 456
ray-version: 2.0.0
head:
  cpu: 100
  memory: 200Gi

worker-groups:
  - gpu: 1
  - cpu: 2
    name: worker-2

Dry run the command

$ kubectl ray create cluster test --file ./config2.yaml --dry-run | tail

Result:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    kubectl: "456"
    kubernetes: "123"
  name: test
  namespace: my-namespace
spec:
  headGroupSpec:
    template:
      spec:
        containers:
        - image: rayproject/ray:2.46.0
          name: ray-head
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              memory: 200Gi
            requests:
              cpu: "100"
              memory: 200Gi
  rayVersion: 2.0.0
  workerGroupSpecs:
  - groupName: default-group
    replicas: 1
    template:
      spec:
        containers:
        - image: rayproject/ray:2.46.0
          name: ray-worker
          resources:
            limits:
              memory: 4Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "2"
              memory: 4Gi
              nvidia.com/gpu: "1"
  - groupName: worker-2
    replicas: 1
    template:
      spec:
        containers:
        - image: rayproject/ray:2.46.0
          name: ray-worker
          resources:
            limits:
              memory: 4Gi
            requests:
              cpu: "2"
              memory: 4Gi

Related issue number

Closes #3803

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You can find if there are any helper libraries or tools to handle object merging more elegantly as follow-up.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have enough time to review this PR carefully, but we should be very cautious when implementing this kind of complex merge logic. It’s hard to maintain and prone to errors.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to better support the merge logic. In addition, this is not a release blocker.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 5, 2025

Hi @kevin85421 , I have refactored the merge logic. PTAL

Comment on lines 464 to 468
config.Namespace = overrideConfig.Namespace
config.Name = overrideConfig.Name
config.ServiceAccount = overrideConfig.ServiceAccount
config.GKE = overrideConfig.GKE
config.Autoscaler = overrideConfig.Autoscaler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in default config, so I assign the value directly

@@ -1108,3 +1111,275 @@ func TestGetGCSFuseVolumeAttributes(t *testing.T) {
result := getGCSFuseVolumeAttributes(config)
assert.Equal(t, expected, result)
}

func TestMergeWithDefaults(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid adding tests for private methods. Instead, test public methods if possible. In this case, write tests for ParseConfigFile.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 13, 2025

@kevin85421, I have removed the private function testing and move them to TestParseConfigFile

return config, nil
}

func mergeWithDefaultConfig(overrideConfig *RayClusterConfig) (*RayClusterConfig, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function be simplified a bit ? It seems not need to handle some merges manually.

https://go.dev/play/p/KqVRVqXQVJO

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fscnick thanks for reviewing
I initially considered using mergo.Merge(config, &overrideConfig, mergo.WithOverride), but it introduces an unintended side effect when merging slice fields like worker-groups.

For example, consider the following minimal config.yaml:

worker-groups:
  - cpu: 3

If we directly call:

config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig, mergo.WithOverride)

Then overrideConfig.WorkerGroups will completely overwrite the default WorkerGroups defined in config, instead of preserving and partially overriding fields (e.g., memory limits or container settings). This leads to missing default values in the final rendered output. As a result, the merged config will produce:

resources:
  limits: {}
  requests:
    cpu: "3"

But the expected output should retain default memory settings like:

resources:
  limits:
    memory: 4Gi
  requests:
    cpu: "3"
    memory: 4Gi

So, I used a more sophisticated approach to handle this.

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we omit mergo.WithOverride and simply call:

config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig)

Then the fields in overrideConfig will only be merged if they are not already set in config. As a result, the WorkerGroups defined in config.yaml will be ignored entirely if the default already includes a value. This leads to the user-provided override being silently ignored.

For example, the output might look like this:

containers:
- image: rayproject/ray:2.46.0
  name: ray-worker
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: "2"
      memory: 4Gi

Here, the cpu: 3 specified by the user in config.yaml is not reflected in the final output — the default cpu: 2 remains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three thoughts about handling this slice.

  • detach the WorkerGroups, merge it manually and attach back to config.
// detach WorkerGroups
overrideConfigWG := overrideConfig.WorkerGroups
overrideConfig.WorkerGroups = nil

// merge config and overrideConfig
...

// merge WorkerGroups
...

// put back
config.WorkerGroups = mergedWorkerGroups
  • use Transformer in mergo to skip WorkerGroups and merge it manually.
  • use Transformer in mergo to merge it directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea !! thx

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 24, 2025

Hi @fscnick ,I have updated the code and test manually, PTAL

Comment on lines 453 to 486
overrideConfigWG := overrideConfig.WorkerGroups
overrideConfig.WorkerGroups = nil

config := newRayClusterConfigWithDefaults()
err = mergo.Merge(config, &overrideConfig, mergo.WithOverride)
if err != nil {
return nil, fmt.Errorf("failed to merge config with defaults: %w", err)
}
// merge WorkerGroups and keep the default values for missing fields
// if overrideConfigWG is not nil, we will merge the worker groups from the config file
// and keep the default values for missing fields
if overrideConfigWG != nil {
for len(config.WorkerGroups) < len(overrideConfigWG) {
config.WorkerGroups = append(config.WorkerGroups, WorkerGroup{
Replicas: util.DefaultWorkerReplicas,
CPU: ptr.To(util.DefaultWorkerCPU),
Memory: ptr.To(util.DefaultWorkerMemory),
})
}
for i, workerGroup := range overrideConfigWG {
err := mergo.Merge(&config.WorkerGroups[i], workerGroup, mergo.WithOverride)
if err != nil {
return nil, fmt.Errorf("failed to merge worker group %d: %w", i, err)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrap the merging code for readability as previously did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 25, 2025

Hi @kevin85421 PTAL

@@ -78,6 +78,7 @@ helm uninstall raycluster
| nameOverride | string | `"kuberay"` | String to partially override release name. |
| fullnameOverride | string | `""` | String to fully override release name. |
| imagePullSecrets | list | `[]` | Secrets with credentials to pull images from a private registry |
| gcsFaultTolerance.enabled | bool | `false` | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase as this had been fixed in #3890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] kubectl ray create cluster config file CPU overwrites the whole resource requests and limits
5 participants