-
Notifications
You must be signed in to change notification settings - Fork 579
[Fix] kubectl ray create cluster config file CPU overwrites the whole resource requests and limits #3811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You can find if there are any helper libraries or tools to handle object merging more elegantly as follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have enough time to review this PR carefully, but we should be very cautious when implementing this kind of complex merge logic. It’s hard to maintain and prone to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to better support the merge logic. In addition, this is not a release blocker.
Hi @kevin85421 , I have refactored the merge logic. PTAL |
config.Namespace = overrideConfig.Namespace | ||
config.Name = overrideConfig.Name | ||
config.ServiceAccount = overrideConfig.ServiceAccount | ||
config.GKE = overrideConfig.GKE | ||
config.Autoscaler = overrideConfig.Autoscaler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in default config, so I assign the value directly
@@ -1108,3 +1111,275 @@ func TestGetGCSFuseVolumeAttributes(t *testing.T) { | |||
result := getGCSFuseVolumeAttributes(config) | |||
assert.Equal(t, expected, result) | |||
} | |||
|
|||
func TestMergeWithDefaults(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid adding tests for private methods. Instead, test public methods if possible. In this case, write tests for ParseConfigFile
.
@kevin85421, I have removed the private function testing and move them to |
return config, nil | ||
} | ||
|
||
func mergeWithDefaultConfig(overrideConfig *RayClusterConfig) (*RayClusterConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be simplified a bit ? It seems not need to handle some merges manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fscnick thanks for reviewing
I initially considered using mergo.Merge(config, &overrideConfig, mergo.WithOverride)
, but it introduces an unintended side effect when merging slice fields like worker-groups
.
For example, consider the following minimal config.yaml
:
worker-groups:
- cpu: 3
If we directly call:
config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig, mergo.WithOverride)
Then overrideConfig.WorkerGroups
will completely overwrite the default WorkerGroups
defined in config, instead of preserving and partially overriding fields (e.g., memory limits or container settings). This leads to missing default values in the final rendered output. As a result, the merged config will produce:
resources:
limits: {}
requests:
cpu: "3"
But the expected output should retain default memory settings like:
resources:
limits:
memory: 4Gi
requests:
cpu: "3"
memory: 4Gi
So, I used a more sophisticated approach to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we omit mergo.WithOverride
and simply call:
config := newRayClusterConfigWithDefaults()
err := mergo.Merge(config, &overrideConfig)
Then the fields in overrideConfig will only be merged if they are not already set in config
. As a result, the WorkerGroups defined in config.yaml
will be ignored entirely if the default already includes a value. This leads to the user-provided override being silently ignored.
For example, the output might look like this:
containers:
- image: rayproject/ray:2.46.0
name: ray-worker
resources:
limits:
memory: 4Gi
requests:
cpu: "2"
memory: 4Gi
Here, the cpu: 3
specified by the user in config.yaml
is not reflected in the final output — the default cpu: 2
remains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three thoughts about handling this slice.
- detach the
WorkerGroups
, merge it manually and attach back to config.
// detach WorkerGroups
overrideConfigWG := overrideConfig.WorkerGroups
overrideConfig.WorkerGroups = nil
// merge config and overrideConfig
...
// merge WorkerGroups
...
// put back
config.WorkerGroups = mergedWorkerGroups
- use
Transformer
in mergo to skipWorkerGroups
and merge it manually. - use
Transformer
in mergo to merge it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea !! thx
Hi @fscnick ,I have updated the code and test manually, PTAL |
overrideConfigWG := overrideConfig.WorkerGroups | ||
overrideConfig.WorkerGroups = nil | ||
|
||
config := newRayClusterConfigWithDefaults() | ||
err = mergo.Merge(config, &overrideConfig, mergo.WithOverride) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to merge config with defaults: %w", err) | ||
} | ||
// merge WorkerGroups and keep the default values for missing fields | ||
// if overrideConfigWG is not nil, we will merge the worker groups from the config file | ||
// and keep the default values for missing fields | ||
if overrideConfigWG != nil { | ||
for len(config.WorkerGroups) < len(overrideConfigWG) { | ||
config.WorkerGroups = append(config.WorkerGroups, WorkerGroup{ | ||
Replicas: util.DefaultWorkerReplicas, | ||
CPU: ptr.To(util.DefaultWorkerCPU), | ||
Memory: ptr.To(util.DefaultWorkerMemory), | ||
}) | ||
} | ||
for i, workerGroup := range overrideConfigWG { | ||
err := mergo.Merge(&config.WorkerGroups[i], workerGroup, mergo.WithOverride) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to merge worker group %d: %w", i, err) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrap the merging code for readability as previously did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem
Hi @kevin85421 PTAL |
helm-chart/ray-cluster/README.md
Outdated
@@ -78,6 +78,7 @@ helm uninstall raycluster | |||
| nameOverride | string | `"kuberay"` | String to partially override release name. | | |||
| fullnameOverride | string | `""` | String to fully override release name. | | |||
| imagePullSecrets | list | `[]` | Secrets with credentials to pull images from a private registry | | |||
| gcsFaultTolerance.enabled | bool | `false` | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase as this had been fixed in #3890
Signed-off-by: Cheyu Wu <[email protected]>
Why are these changes needed?
Root Cause
This issue occurs because the
data
overrides the default values in&config
, causing the defaults to be lost.How I Solved It
I created a
MergeWithDefaults
function to merge theRayClusterConfig
field by field.I know this approach isn't particularly elegant, but I couldn’t find a better way to merge deeply nested structs.
If you have a better suggestion, I'd be happy to hear it.
Manual testing
config.yaml
Dry Run the command
Run the command
$ kubectl ray create cluster test --file ./config.yaml
config2.yaml
Dry run the command
Result:
Related issue number
Closes #3803
Checks