-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
Feature request
I opened an issue for setting a seed for DataCollatorForLanguageModeling which would help reproducibility in pretraining encoder-only transformers without setting a global seed at #36357, with a PR that also got approved (#36497).
I see that there is scope to add this seed feature to other data collator classes, namely DataCollatorForWholeWordMask and DataCollatorForPermutationLanguageModeling, where there are pseudo-random functions involved in the collator.
Motivation
This feature would help reproducibility in data preparation without setting a global seed as in transformers.set_seed. This has advantages as I have pointed out in #36357.
Your contribution
Since I was involved with the seed feature in DataCollatorForLanguageModeling, I'm very familiar with this idea and I'd love to make PRs adding this feature to the other two collators!