Skip to content

[RLlib; Core; Tune]: Ray keeps crashing during tune run. #39726

@grizzlybearg

Description

@grizzlybearg

What happened + What you expected to happen

For the past few days, all training runs have been failing between 6 to 10 hours into the training. I get this output:
` (raylet) [2023-09-18 05:59:34,944 C 19680 8204] (raylet.exe) dlmalloc.cc:129: Check failed: *handle != nullptr CreateFileMapping() failed. GetLastError() = 1455
(raylet) *** StackTrace Information ***
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) unknown
(raylet) recalloc
(raylet) BaseThreadInitThunk
(raylet) RtlUserThreadStart
(raylet)
(RolloutWorker pid=21400) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:
(RolloutWorker pid=6816) C:\arrow\cpp\src\arrow\filesystem\s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
2023-09-18 05:59:38,368 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00001 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15900 │
│ num_env_steps_trained 15900 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -13584.7 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:38,462 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:38. Total running time: 6hr 29min 28s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
╭───────────────────────────────────────────────────────────╮
│ Trial PPO_CustomEnv-v0_ff24b_00000 result │
├───────────────────────────────────────────────────────────┤
│ episodes_total 15 │
│ evaluation/sampler_results/episode_reward_mean nan │
│ num_env_steps_sampled 15600 │
│ num_env_steps_trained 15600 │
│ sampler_results/episode_len_mean 1031 │
│ sampler_results/episode_reward_mean -17543.1 │
╰───────────────────────────────────────────────────────────╯
2023-09-18 05:59:49,841 WARNING worker.py:2071 -- The node with node id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172 and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
2023-09-18 05:59:49,857 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00000
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: 38a34301b858e9124e3ad4f501000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 127.0.0.1 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Trial PPO_CustomEnv-v0_ff24b_00000 errored after 52 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00000_0_clip_param=0.0628,gamma=0.8613,kl_coeff=0.0089,kl_target=0.0021,lambda=0.9291,lr=0.0013,lr=0.0030_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:49,888 ERROR tune_controller.py:1502 -- Trial task failed for trial PPO_CustomEnv-v0_ff24b_00001
Traceback (most recent call last):
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray\air\execution_internal\event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\AppData\Roaming\Python\Python311\site-packages\ray_private\worker.py", line 2562, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: aeca46f4aa7611558d9fd36a01000000
namespace: 821b5c5c-a045-4b82-b2c8-e052ba786a9c
The actor is dead because its node has died. Node Id: 2628e1894464566f5f0e56ebf56cee56db24835db98d85153a0d0172
The actor never ran - it was cancelled before it started running.

Trial PPO_CustomEnv-v0_ff24b_00001 errored after 53 iterations at 2023-09-18 05:59:49. Total running time: 6hr 29min 39s
Error file: C:/Users/user/ray_results\Ndovu1\PPO_CustomEnv-v0_ff24b_00001_1_clip_param=0.1429,gamma=0.9634,kl_coeff=0.0016,kl_target=0.0028,lambda=0.9993,lr=0.0019,lr=0.0012_2023-09-17_23-30-10\error.txt
2023-09-18 05:59:50,249 WARNING resource_updater.py:262 -- Cluster resources not detected or are 0. Attempt #2..`.
I took note of the resource usage for the all the experiments I've had and the resources are not streched:
image
I'm using the ray for python 3.11 on Windows. Fisrt, I tried, the stable 2.6.3 release and then the Nightly release, however, both versions have the same outcome. I also tried training on cloud vms and the outcome is the same. What could be the issue

Versions / Dependencies

Windows 11 and Ubuntu 22
Ray 2.6.3 and Ray Nightly
Python 3.11

Reproduction script

`self.exp_name = "Ndovu"
args = "PPO"
self.hp_ranges = HPRanges()
self.trainerHPs = PPOLearnerHPs(params=self.hp_ranges).config
self.algo = PPOConfig()
self.trainer = args

    # Training
    self.framework = "torch"
    self.preprocessor_pref = "rllib"
    self.observation_filter = "MeanStdFilter"
    self.train_batch_size = self.hp_ranges.train_batch_size
    self.max_seq_len = 20

    # Rollouts
    self.max_iterations = 500
    self.num_rollout_workers = 1
    self.rollout_fragment_length = round(self.train_batch_size / 3)
    self.batch_mode = "truncate_episodes"
    self.create_env_on_local_worker = (
        True if self.num_rollout_workers == 0 else False
    )
    self.num_envs_per_worker = 1
    self.remote_worker_envs = False
    # Remote envs only make sense to use if num_envs > 1 (i.e. environment vectorization is enabled) or env takes long to step thru env but has addittional overhead costs.

    # Resources
    self.num_learner_workers = 1

    self.num_cpus_per_worker = 1
    self.num_cpus_for_local_worker = 1
    self.num_cpus_per_learner_worker = 1

    self.num_gpus = 0
    self.num_gpus_per_worker = 0
    self.num_gpus_per_learner_worker = 0
    self._fake_gpus = False

    self.custom_resources_per_worker = None

    self.placement_strategy = "SPREAD"

    # Evaluation
    self.evaluation_parallel_to_training = False
    self.evaluation_num_workers = 1
    self.evaluation_duration_unit = "episodes"
    self.evaluation_duration = 1
    self.evaluation_frequency = round(self.max_iterations / 1)

    # Exploration
    self.random_steps = 250

    # Tuner
    self.num_samples = 2
    self.max_concurrent_trials = 2
    self.time_budget_s = None

    # Logging
    self.sep = "/"  # ray is sensitive with file names on windows
    self.dir = f"C:{self.sep}Users{self.sep}user{self.sep}ray_results"
    self.log_dir, self.log_name = self._log_dir(
        Path(self.dir), self.exp_name, self.sep
    )
    ts = f"trial_summaries{self.sep}{self.exp_name}"
    self.summaries_dir = Path(self.dir).joinpath(ts)

    if not self.summaries_dir.exists():
        self.summaries_dir.mkdir(parents=True)

    print(f"Log Name: {self.log_name}")
    print(f"Log directory: {self.log_dir.as_posix()}")
    print(f"Summaries directory {self.summaries_dir.as_posix()}")

    # Metrics
    self.metrics = "episode_reward_mean"  
    self.mode = "max"

    # Checkpoints, sync & perturbs
    self.score = self.metrics
    self.checkpoint_frequency = (
        round(self.max_iterations / 10)
        if self.max_iterations <= 200
        else round(self.max_iterations / 33.33)
    )
    self.pertub_frequency = self.checkpoint_frequency
    self.pertub_burn_period = self.checkpoint_frequency * 2
    self.num_to_keep = 3

    # Others
    self.verbose = 1

    # Register Model
    self.model_3 = {
        "max_seq_len": 10,
        "use_lstm": True,
        "lstm_cell_size": 2048,
        "lstm_use_prev_action": True,
        "lstm_use_prev_reward": True,
        "fcnet_hiddens": [1024, 2048, 2048, 4096, 4096, 2048, 2048, 1024, 1024],
        "post_fcnet_hiddens": [1024, 1024, 1024, 1024],
        "fcnet_activation": "swish",
        "post_fcnet_activation": "swish",
        "vf_share_layers": True,
        "no_final_linear": True,
    }  # 166,043,706 params


    self.exploration_config = {
        "type": "StochasticSampling",  # Default for PG algorithms
        # StochasticSampling can be made deterministic by passing explore=False into the call to `get_exploration_action`. Also allows for scheduled parameters for the distributions, such as lowering stddev, temperature, etc.. over time.
        "random_timesteps": self.random_steps,  # Int
        "framework": self.framework,
    }

    self.centered_adam = False
    self.optimizer_config = {
        "type": "RAdam",
        "lr": self.hp_ranges.lr,
        "betas": (0.9, 0.999),
        #'beta2': 0.999, # Only used if centered=False.
        "eps": 1e-08,  # Only used if centered=False.
        #'weight_decay': #floatself.centered_adam,
        #'amsgrad': False # Only used if centered=False.
    }

    # ENV
    self.env = env_name
    self.render_env = False
    self.evaluation_config_ = self.algo.overrides(  # type: ignore
        explore=False, render_env=False
    )

    self.config = (
        self.algo.update_from_dict(config_dict=self.trainerHPs.to_dict())
        .environment(
            env=self.env,
            env_config=self.train_env_config,
            # env=CartPoleEnv, #testing
            render_env=self.render_env,
            clip_rewards=None,
            auto_wrap_old_gym_envs=False,
            disable_env_checking=True,
            is_atari=False,
        )
        .framework(
            framework="torch",
            torch_compile_learner=True,  # For enabling torch-compile during training
            torch_compile_learner_dynamo_backend="ipex",
            torch_compile_learner_dynamo_mode="default",
            torch_compile_worker=True,  # For enabling torch-compile during sampling
            torch_compile_worker_dynamo_backend="ipex",
            torch_compile_worker_dynamo_mode="default",
        )
        .debugging(log_level="ERROR", log_sys_usage=True)  # type: ignore
        .rollouts(
            num_rollout_workers=self.num_rollout_workers,
            num_envs_per_worker=self.num_envs_per_worker,
            create_env_on_local_worker=self.create_env_on_local_worker,
            enable_connectors=True,
            rollout_fragment_length=self.rollout_fragment_length,
            batch_mode=self.batch_mode,
            # remote_worker_envs=self.remote_worker_envs,
            # remote_env_batch_wait_ms=0,
            validate_workers_after_construction=True,
            preprocessor_pref=self.preprocessor_pref,
            observation_filter=self.observation_filter,  # TODO: Test NoFilter
            update_worker_filter_stats=True,
            compress_observations=False,  # TODO: Test True
        )
        .fault_tolerance(
            recreate_failed_workers=True,
            max_num_worker_restarts=10,
            delay_between_worker_restarts_s=30,
            restart_failed_sub_environments=True,
            num_consecutive_worker_failures_tolerance=10,
            worker_health_probe_timeout_s=300,
            worker_restore_timeout_s=180,
        )
        .resources(
            num_cpus_per_worker=self.num_cpus_per_worker,
            # num_gpus_per_worker= self.num_gpus_per_worker,
            num_cpus_for_local_worker=self.num_cpus_for_local_worker,
            num_learner_workers=self.num_learner_workers,
            num_cpus_per_learner_worker=self.num_cpus_per_learner_worker,
            placement_strategy=self.placement_strategy,
        )
        .exploration(explore=True, exploration_config=self.exploration_config)
        .checkpointing(
            export_native_model_files=False,
            checkpoint_trainable_policies_only=False,
        )  # Bool
        .evaluation(
            evaluation_interval=self.evaluation_frequency,
            evaluation_duration=self.evaluation_duration,
            evaluation_duration_unit=self.evaluation_duration_unit,
            evaluation_sample_timeout_s=600,
            evaluation_parallel_to_training=self.evaluation_parallel_to_training,
            # evaluation_config = self.evaluation_config_,
            # off_policy_estimation_methods = {}, # See Notes in Next Cell
            # ope_split_batch_by_episode = True, # default
            evaluation_num_workers=self.evaluation_num_workers,
            # custom_evaluation_function = None,
            always_attach_evaluation_results=True,
            enable_async_evaluation=True
            if self.evaluation_num_workers > 1
            else False,
        )
        .callbacks(MyCallbacks)
        .rl_module(
            _enable_rl_module_api=False,
            # rl_module_spec=module_to_load_spec
        )
        .training(
            # gamma=0.98,  # ,
            # lr=1e-5,
            gamma=self.hp_ranges.gamma,  # type: ignore
            lr=self.hp_ranges.lr,  # type: ignore
            grad_clip_by="norm",  # type: ignore
            grad_clip=0.3,
            train_batch_size=self.hp_ranges.train_batch_size,  # type: ignore
            model=self.model_3,  # type: ignore
            optimizer=self.optimizer_config,
            _enable_learner_api=False,
            # learner_class=None
        )
    )
    self.config_dict = self.config.to_dict()

    self.stopper = CombinedStopper(
        MaximumIterationStopper(max_iter=self.max_iterations),
        TrialPlateauStopper(
            metric=self.metrics,
            std=0.04,
            num_results=10,
            grace_period=100,
            metric_threshold=200,
            mode="max",
        ),
    )

    self.checkpointer = CheckpointConfig(
        num_to_keep=self.num_to_keep,
        checkpoint_score_attribute=self.score,
        checkpoint_score_order=self.mode,
        checkpoint_frequency=self.checkpoint_frequency,
        checkpoint_at_end=True,
    )

    self.failure_check = FailureConfig(max_failures=5, fail_fast=False)

    self.sync_config = SyncConfig(
        # syncer=None,
        sync_period=7200,
        sync_timeout=7200,
        sync_artifacts=True,
        sync_artifacts_on_checkpoint=True,
    )

hyper_dict = {
# distribution for resampling
"gamma": self.hp_ranges.gamma,
"lr": self.hp_ranges.lr,
"vf_loss_coeff": self.hp_ranges.vf_loss_coeff,
"kl_coeff": self.hp_ranges.kl_coeff,
"kl_target": self.hp_ranges.kl_target,
"lambda_": self.hp_ranges.lambda_,
"clip_param": self.hp_ranges.clip_param,
"grad_clip": self.hp_ranges.grad_clip,
}

    self.pbt_scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=self.pertub_frequency,
        burn_in_period=self.pertub_burn_period,
        hyperparam_mutations=hyper_dict,  # type:ignore
        quantile_fraction=0.50,  # Paper default
        resample_probability=0.20,
        perturbation_factors=(1.2, 0.8),  # Paper default
        # custom_explore_fn = None
    )`

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.P1Issue that should be fixed within a few weeksQSQuantsight triage labelbugSomething that is supposed to be working; but isn'twindows

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions