Skip to content

Add seed to data collator classes #36655

@capemox

Description

@capemox

Feature request

I opened an issue for setting a seed for DataCollatorForLanguageModeling which would help reproducibility in pretraining encoder-only transformers without setting a global seed at #36357, with a PR that also got approved (#36497).

I see that there is scope to add this seed feature to other data collator classes, namely DataCollatorForWholeWordMask and DataCollatorForPermutationLanguageModeling, where there are pseudo-random functions involved in the collator.

Motivation

This feature would help reproducibility in data preparation without setting a global seed as in transformers.set_seed. This has advantages as I have pointed out in #36357.

Your contribution

Since I was involved with the seed feature in DataCollatorForLanguageModeling, I'm very familiar with this idea and I'd love to make PRs adding this feature to the other two collators!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions