From d43035971d58ff970fea012f51946142bf50e9e4 Mon Sep 17 00:00:00 2001 From: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue, 23 Sep 2025 17:53:45 +0100 Subject: [PATCH 1/3] Update trl_grpo_reasoning_advanced_reward.ipynb --- notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb index dd8c756c..edff7028 100644 --- a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb +++ b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb @@ -497,6 +497,7 @@ "training_args = GRPOConfig(\n", " # Learning parameters optimized for reasoning tasks\n", " learning_rate=5e-6, # Conservative LR to prevent destabilizing reasoning\n", + " bf16=False, " \n", " # Memory-efficient batch configuration\n", " per_device_train_batch_size=2, # Small batch for GPU memory constraints\n", From 1bec3c8b2ff5ce7c9b2c8867059875b7d570fb1f Mon Sep 17 00:00:00 2001 From: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue, 23 Sep 2025 17:54:33 +0100 Subject: [PATCH 2/3] Update trl_grpo_reasoning_advanced_reward.ipynb Set bf16=False --- notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb index edff7028..94e7cb41 100644 --- a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb +++ b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb @@ -497,7 +497,7 @@ "training_args = GRPOConfig(\n", " # Learning parameters optimized for reasoning tasks\n", " learning_rate=5e-6, # Conservative LR to prevent destabilizing reasoning\n", - " bf16=False, + " bf16=False", " \n", " # Memory-efficient batch configuration\n", " per_device_train_batch_size=2, # Small batch for GPU memory constraints\n", From 984de4a0e59098abec6eefa933d3abd8d74cb119 Mon Sep 17 00:00:00 2001 From: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue, 23 Sep 2025 17:55:41 +0100 Subject: [PATCH 3/3] Update trl_grpo_reasoning_advanced_reward.ipynb add comma --- notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb index 94e7cb41..62b33827 100644 --- a/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb +++ b/notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb @@ -497,7 +497,7 @@ "training_args = GRPOConfig(\n", " # Learning parameters optimized for reasoning tasks\n", " learning_rate=5e-6, # Conservative LR to prevent destabilizing reasoning\n", - " bf16=False", + " bf16=False,", " \n", " # Memory-efficient batch configuration\n", " per_device_train_batch_size=2, # Small batch for GPU memory constraints\n",