-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
What happened + What you expected to happen
For the past few days, all training runs have been failing between 6 to 10 hours into the training. I get this output:
` (raylet) [2023-09-18 05:59:34,944 C 19680 8204] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
(raylet) *** StackTrace Information ***
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) recalloc
(raylet) BaseThreadInitThunk
(raylet) RtlUserThreadStart
(raylet)
(RolloutWorker pid=21400) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:
(RolloutWorker pid=6816) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
2023-09-18 05:59:38,368 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00001 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15900 │
│ num_env_steps_trained 15900 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -13584.7 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:38,462 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00000 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15600 │
│ num_env_steps_trained 15600 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -17543.1 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:49,841 WARNING worker.py:2071 -- The node with node id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
2023-09-18 05:59:49,857 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: 38a34301b858e9124e3ad4f501000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 127.0.0.1 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:49,888 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: aeca46f4aa7611558d9fd36a01000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its node has died. Node Id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:50,249 WARNING resource_updater.py:262 -- Cluster resources not detected or are 0. Attempt #2..`.
I took note of the resource usage for the all the experiments I've had and the resources are not streched:

I'm using the ray for python 3.11 on Windows. Fisrt, I tried, the stable 2.6.3 release and then the Nightly release, however, both versions have the same outcome. I also tried training on cloud vms and the outcome is the same. What could be the issue
Versions / Dependencies
Windows 11 and Ubuntu 22
Ray 2.6.3 and Ray Nightly
Python 3.11
Reproduction script
`self.exp_name = "Ndovu"
args = "PPO"
self.hp_ranges = HPRanges()
self.trainerHPs = PPOLearnerHPs(params=self.hp_ranges).config
self.algo = PPOConfig()
self.trainer = args
# Training
self.framework = "torch"
self.preprocessor_pref = "rllib"
self.observation_filter = "MeanStdFilter"
self.train_batch_size = self.hp_ranges.train_batch_size
self.max_seq_len = 20
# Rollouts
self.max_iterations = 500
self.num_rollout_workers = 1
self.rollout_fragment_length = round(self.train_batch_size / 3)
self.batch_mode = "truncate_episodes"
self.create_env_on_local_worker = (
True if self.num_rollout_workers == 0 else False
)
self.num_envs_per_worker = 1
self.remote_worker_envs = False
# Remote envs only make sense to use if num_envs > 1 (i.e. environment vectorization is enabled) or env takes long to step thru env but has addittional overhead costs.
# Resources
self.num_learner_workers = 1
self.num_cpus_per_worker = 1
self.num_cpus_for_local_worker = 1
self.num_cpus_per_learner_worker = 1
self.num_gpus = 0
self.num_gpus_per_worker = 0
self.num_gpus_per_learner_worker = 0
self._fake_gpus = False
self.custom_resources_per_worker = None
self.placement_strategy = "SPREAD"
# Evaluation
self.evaluation_parallel_to_training = False
self.evaluation_num_workers = 1
self.evaluation_duration_unit = "episodes"
self.evaluation_duration = 1
self.evaluation_frequency = round(self.max_iterations / 1)
# Exploration
self.random_steps = 250
# Tuner
self.num_samples = 2
self.max_concurrent_trials = 2
self.time_budget_s = None
# Logging
self.sep = "/" # ray is sensitive with file names on windows
self.dir = f"C:{self.sep}Users{self.sep}user{self.sep}ray_results"
self.log_dir, self.log_name = self._log_dir(
Path(self.dir), self.exp_name, self.sep
)
ts = f"trial_summaries{self.sep}{self.exp_name}"
self.summaries_dir = Path(self.dir).joinpath(ts)
if not self.summaries_dir.exists():
self.summaries_dir.mkdir(parents=True)
print(f"Log Name: {self.log_name}")
print(f"Log directory: {self.log_dir.as_posix()}")
print(f"Summaries directory {self.summaries_dir.as_posix()}")
# Metrics
self.metrics = "episode_reward_mean"
self.mode = "max"
# Checkpoints, sync & perturbs
self.score = self.metrics
self.checkpoint_frequency = (
round(self.max_iterations / 10)
if self.max_iterations <= 200
else round(self.max_iterations / 33.33)
)
self.pertub_frequency = self.checkpoint_frequency
self.pertub_burn_period = self.checkpoint_frequency * 2
self.num_to_keep = 3
# Others
self.verbose = 1
# Register Model
self.model_3 = {
"max_seq_len": 10,
"use_lstm": True,
"lstm_cell_size": 2048,
"lstm_use_prev_action": True,
"lstm_use_prev_reward": True,
"fcnet_hiddens": [1024, 2048, 2048, 4096, 4096, 2048, 2048, 1024, 1024],
"post_fcnet_hiddens": [1024, 1024, 1024, 1024],
"fcnet_activation": "swish",
"post_fcnet_activation": "swish",
"vf_share_layers": True,
"no_final_linear": True,
} # 166,043,706 params
self.exploration_config = {
"type": "StochasticSampling", # Default for PG algorithms
# StochasticSampling can be made deterministic by passing explore=False into the call to `get_exploration_action`. Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time.
"random_timesteps": self.random_steps, # Int
"framework": self.framework,
}
self.centered_adam = False
self.optimizer_config = {
"type": "RAdam",
"lr": self.hp_ranges.lr,
"betas": (0.9, 0.999),
#'beta2': 0.999, # Only used if centered=False.
"eps": 1e-08, # Only used if centered=False.
#'weight_decay': #floatself.centered_adam,
#'amsgrad': False # Only used if centered=False.
}
# ENV
self.env = env_name
self.render_env = False
self.evaluation_config_ = self.algo.overrides( # type: ignore
explore=False, render_env=False
)
self.config = (
self.algo.update_from_dict(config_dict=self.trainerHPs.to_dict())
.environment(
env=self.env,
env_config=self.train_env_config,
# env=CartPoleEnv, #testing
render_env=self.render_env,
clip_rewards=None,
auto_wrap_old_gym_envs=False,
disable_env_checking=True,
is_atari=False,
)
.framework(
framework="torch",
torch_compile_learner=True, # For enabling torch-compile during training
torch_compile_learner_dynamo_backend="ipex",
torch_compile_learner_dynamo_mode="default",
torch_compile_worker=True, # For enabling torch-compile during sampling
torch_compile_worker_dynamo_backend="ipex",
torch_compile_worker_dynamo_mode="default",
)
.debugging(log_level="ERROR", log_sys_usage=True) # type: ignore
.rollouts(
num_rollout_workers=self.num_rollout_workers,
num_envs_per_worker=self.num_envs_per_worker,
create_env_on_local_worker=self.create_env_on_local_worker,
enable_connectors=True,
rollout_fragment_length=self.rollout_fragment_length,
batch_mode=self.batch_mode,
# remote_worker_envs=self.remote_worker_envs,
# remote_env_batch_wait_ms=0,
validate_workers_after_construction=True,
preprocessor_pref=self.preprocessor_pref,
observation_filter=self.observation_filter, # TODO: Test NoFilter
update_worker_filter_stats=True,
compress_observations=False, # TODO: Test True
)
.fault_tolerance(
recreate_failed_workers=True,
max_num_worker_restarts=10,
delay_between_worker_restarts_s=30,
restart_failed_sub_environments=True,
num_consecutive_worker_failures_tolerance=10,
worker_health_probe_timeout_s=300,
worker_restore_timeout_s=180,
)
.resources(
num_cpus_per_worker=self.num_cpus_per_worker,
# num_gpus_per_worker= self.num_gpus_per_worker,
num_cpus_for_local_worker=self.num_cpus_for_local_worker,
num_learner_workers=self.num_learner_workers,
num_cpus_per_learner_worker=self.num_cpus_per_learner_worker,
placement_strategy=self.placement_strategy,
)
.exploration(explore=True, exploration_config=self.exploration_config)
.checkpointing(
export_native_model_files=False,
checkpoint_trainable_policies_only=False,
) # Bool
.evaluation(
evaluation_interval=self.evaluation_frequency,
evaluation_duration=self.evaluation_duration,
evaluation_duration_unit=self.evaluation_duration_unit,
evaluation_sample_timeout_s=600,
evaluation_parallel_to_training=self.evaluation_parallel_to_training,
# evaluation_config = self.evaluation_config_,
# off_policy_estimation_methods = {}, # See Notes in Next Cell
# ope_split_batch_by_episode = True, # default
evaluation_num_workers=self.evaluation_num_workers,
# custom_evaluation_function = None,
always_attach_evaluation_results=True,
enable_async_evaluation=True
if self.evaluation_num_workers > 1
else False,
)
.callbacks(MyCallbacks)
.rl_module(
_enable_rl_module_api=False,
# rl_module_spec=module_to_load_spec
)
.training(
# gamma=0.98, # ,
# lr=1e-5,
gamma=self.hp_ranges.gamma, # type: ignore
lr=self.hp_ranges.lr, # type: ignore
grad_clip_by="norm", # type: ignore
grad_clip=0.3,
train_batch_size=self.hp_ranges.train_batch_size, # type: ignore
model=self.model_3, # type: ignore
optimizer=self.optimizer_config,
_enable_learner_api=False,
# learner_class=None
)
)
self.config_dict = self.config.to_dict()
self.stopper = CombinedStopper(
MaximumIterationStopper(max_iter=self.max_iterations),
TrialPlateauStopper(
metric=self.metrics,
std=0.04,
num_results=10,
grace_period=100,
metric_threshold=200,
mode="max",
),
)
self.checkpointer = CheckpointConfig(
num_to_keep=self.num_to_keep,
checkpoint_score_attribute=self.score,
checkpoint_score_order=self.mode,
checkpoint_frequency=self.checkpoint_frequency,
checkpoint_at_end=True,
)
self.failure_check = FailureConfig(max_failures=5, fail_fast=False)
self.sync_config = SyncConfig(
# syncer=None,
sync_period=7200,
sync_timeout=7200,
sync_artifacts=True,
sync_artifacts_on_checkpoint=True,
)
hyper_dict = {
# distribution for resampling
"gamma": self.hp_ranges.gamma,
"lr": self.hp_ranges.lr,
"vf_loss_coeff": self.hp_ranges.vf_loss_coeff,
"kl_coeff": self.hp_ranges.kl_coeff,
"kl_target": self.hp_ranges.kl_target,
"lambda_": self.hp_ranges.lambda_,
"clip_param": self.hp_ranges.clip_param,
"grad_clip": self.hp_ranges.grad_clip,
}
self.pbt_scheduler = PopulationBasedTraining(
time_attr="training_iteration",
perturbation_interval=self.pertub_frequency,
burn_in_period=self.pertub_burn_period,
hyperparam_mutations=hyper_dict, # type:ignore
quantile_fraction=0.50, # Paper default
resample_probability=0.20,
perturbation_factors=(1.2, 0.8), # Paper default
# custom_explore_fn = None
)`
Issue Severity
High: It blocks me from completing my task.