-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
When I run rllib on ray 1.5.2:
- the resource demands stay even after the application finishes, for example, I still see the following resource demands (for a few minutes) from the scheduler even after the job prints
(pid=191) 2021-08-22 10:45:21,492 INFO tune.py:550 -- Total run time: 1095.71 seconds (1094.69 seconds for the tuning loop).:
Demands:
{'CPU_group_8eb7d5e8a4ed413432db93d0b79b3e67': 1.0}: 96+ pending tasks/actors
{'GPU_group_16cd93bbf7607454e10fb4e3334f5da6': 0.001, 'GPU_group_0_16cd93bbf7607454e10fb4e3334f5da6': 0.001}: 1+ pending tasks/actors
{'GPU_group_1431a0326b37900afe3595513b2e1818': 0.001, 'GPU_group_0_1431a0326b37900afe3595513b2e1818': 0.001}: 1+ pending tasks/actors
{'CPU': 1.0, 'GPU': 1.0} * 1, {'CPU': 1.0} * 128 (PACK): 1+ pending placement groups
- RLLIB prints a lot of verbose resources:
(pid=191) == Status ==
(pid=191) Memory usage on this node: 6.1/31.4 GiB
(pid=191) Using FIFO scheduling algorithm.
(pid=191) Resources requested: 0/296 CPUs, 0/8 GPUs, 0.0/787.44 GiB heap, 0.0/338.81 GiB objects (0.0/1.0 CPU_group_15_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_2_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_4_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_6_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_0_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 GPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_12_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_13_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_10_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_1_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_7_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_9_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_3_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_11_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_8_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_14_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_5_8c84f56bef40324a35f6e63418c2a54d, 0.0/129.0 CPU_group_8c84f56bef40324a35f6e63418c2a54d, 0.0/8.0 accelerator_type:T4, 0.0/1.0 CPU_group_116_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_127_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_117_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_119_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_121_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_113_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_124_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_123_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_118_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_115_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_126_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_120_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_114_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_125_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_122_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_112_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_128_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_83_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_85_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_94_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_87_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_90_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_84_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_88_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_82_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_89_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_91_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_92_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_86_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_80_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_93_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_81_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_95_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_100_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_97_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_103_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_108_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_98_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_104_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_111_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_102_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_96_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_99_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_110_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_101_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_106_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_109_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_105_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_107_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_43_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_33_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_36_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_32_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_34_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_35_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_37_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_40_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_39_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_42_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_45_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_44_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_41_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_46_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_47_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_38_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_21_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_18_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_28_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_16_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_19_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_25_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_20_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_27_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_17_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_24_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_22_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_26_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_23_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_30_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_31_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_29_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_71_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_72_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_76_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_68_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_79_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_78_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_70_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_69_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_67_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_65_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_64_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_75_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_66_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_74_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_73_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_77_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_52_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_48_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_63_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_56_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_54_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_62_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_55_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_59_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_51_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_53_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_57_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_58_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_50_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_49_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_61_8c84f56bef40324a35f6e63418c2a54d, 0.0/1.0 CPU_group_60_8c84f56bef40324a35f6e63418c2a54d)
-
RLLIB requests a lot of resources sometimes, and if the cluster cannot scale up to accommodate it ends up adding nodes and removing them for being idle and hanging forever. (e.g., it requests resources that should run on 200 nodes, but the cluster can scale only to 10 nodes, so it keeps adding 10 nodes and removing them while the trials says “pending”).
-
I think we should have e2e tests of rllib with GPUs, this might be already existing but for some reason, I am not able for example to run (the cluster keeps adding and removing nodes like issue 3) :
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yamlorANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/impala/atari-impala-large.yaml -
when I run
ANYSCALE_DEBUG=1 RAY_ADDRESS=anyscale://timeout_fix_cluster_final2_aws?cluster_env=riot:5 rllib train -f ../ray/rllib/tuned_examples/compact-regression-test.yamlI get a lot of:
A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: ffffffffffffffffa70b3f9b10676c460808312e01000000 Worker ID: 1d806191d3304d0dbcc5fabedf3eefd9e6f12694227b34ae602c0203 Node ID: 3d02b42b39be8dbcd291b2611f9c36841f00f38e98c599c55ecfe827 Worker IP address: 192.168.75.4 Worker port: 10059 Worker PID: 446844
(pid=237) 2021-08-22 13:08:42,288 ERROR trial_runner.py:773 -- Trial APEX_BreakoutNoFrameskip-v4_95b82_00015: Error processing event.
(pid=237) Traceback (most recent call last):
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 739, in _process_trial
(pid=237) results = self.trial_executor.fetch_result(trial)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 729, in fetch_result
(pid=237) result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237) return func(*args, **kwargs)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1564, in get
(pid=237) raise value.as_instanceof_cause()
(pid=237) ray.exceptions.RayTaskError: ray::APEX.train_buffered() (pid=220341, ip=192.168.75.4)
(pid=237) File "python/ray/_raylet.pyx", line 534, in ray._raylet.execute_task
(pid=237) File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task.function_executor
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
(pid=237) return method(__ray_actor, *args, **kwargs)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 178, in train_buffered
(pid=237) result = self.train()
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 640, in train
(pid=237) raise e
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 629, in train
(pid=237) result = Trainable.train(self)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 237, in train
(pid=237) result = self.step()
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 170, in step
(pid=237) res = next(self.train_exec_impl)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237) return next(self.built_iterator)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
(pid=237) item = next(it)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
(pid=237) return next(self.built_iterator)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
(pid=237) for item in it:
(pid=237) [Previous line repeated 1 more time]
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
(pid=237) for item in it:
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/iter.py", line 551, in base_iterator
(pid=237) batch = ray.get(obj_ref)
(pid=237) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
(pid=237) return func(*args, **kwargs)
(pid=237) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
CC @wuisawesome
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.