Fixed 2 issues regarding tests/trainer/test_data_collator.py::TFDataCollatorIntegrationTest::test_all_mask_replacement:

capemox · Rocketknight1 · commit 00366e036f4d · 2025-02-28T14:43:13.000Z
1. I got the error `RuntimeError: "bernoulli_tensor_cpu_p_" not implemented for 'Long'`. This is because the `mask_replacement_prob=1` and `torch.bernoulli` doesn't accept this type (which would be a `torch.long` dtype instead. I fixed this by manually casting the probability arguments in the `__post_init__` function of `DataCollatorForLanguageModeling`. 2. I also got the error `tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute Equal as input #1(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:Equal]` due to the line `tf.reduce_all((batch["input_ids"] == inputs) | (batch["input_ids"] == tokenizer.mask_token_id))` in `test_data_collator.py`. This occurs because the type of the `inputs` variable is `tf.int32`. Solved this by manually casting it to `tf.int64` in the test, as the expected return type of `batch["input_ids"]` is `tf.int64`.
diff --git a/src/transformers/data/data_collator.py b/src/transformers/data/data_collator.py
@@ -843,6 +843,10 @@ def __post_init__(self):
         if self.random_replace_prob < 0 or self.random_replace_prob > 1:
             raise ValueError("random_replace_prob should be between 0 and 1.")
 
+        self.mlm_probability = float(self.mlm_probability)
+        self.mask_replace_prob = float(self.mask_replace_prob)
+        self.random_replace_prob = float(self.random_replace_prob)
+
         if self.tf_experimental_compile:
             import tensorflow as tf
 
diff --git a/tests/trainer/test_data_collator.py b/tests/trainer/test_data_collator.py
@@ -1052,7 +1052,9 @@ def test_all_mask_replacement(self):
 
         # confirm that every token is either the original token or [MASK]
         self.assertTrue(
-            tf.reduce_all((batch["input_ids"] == inputs) | (batch["input_ids"] == tokenizer.mask_token_id))
+            tf.reduce_all(
+                (batch["input_ids"] == tf.cast(inputs, tf.int64)) | (batch["input_ids"] == tokenizer.mask_token_id)
+            )
         )
 
         # numpy call

Original file line number	Diff line number	Diff line change
`@@ -1052,7 +1052,9 @@ def test_all_mask_replacement(self):`
`1052`	`1052`
`1053`	`1053`	`# confirm that every token is either the original token or [MASK]`
`1054`	`1054`	`self.assertTrue(`
`1055`		`- tf.reduce_all((batch["input_ids"] == inputs) \| (batch["input_ids"] == tokenizer.mask_token_id))`
	`1055`	`+ tf.reduce_all(`
	`1056`	`+ (batch["input_ids"] == tf.cast(inputs, tf.int64)) \| (batch["input_ids"] == tokenizer.mask_token_id)`
	`1057`	`+ )`
`1056`	`1058`	`)`
`1057`	`1059`
`1058`	`1060`	`# numpy call`