Skip to content

Conversation

xiaoda99
Copy link
Contributor

With the original code, all parameters are decayed because the condition "parameter_name in no_decay" will never be satisfied.

@xiaoda99 xiaoda99 changed the title Fix ineffective no_decay bug when using BertAdam Fix ineffective no_decay bug when using BERTAdam Nov 18, 2018
@thomwolf thomwolf merged commit 061eeca into huggingface:master Nov 20, 2018
@thomwolf
Copy link
Member

thanks!

qwang70 pushed a commit to DRL36/pytorch-pretrained-BERT that referenced this pull request Mar 2, 2019
Fix ineffective no_decay bug when using BERTAdam
@xelibrion
Copy link

Question - wouldn't .named_parameters() for the model return a tuple (name, param_tensor), where name looks similar to these

['bert.embeddings.word_embeddings.weight',
 'bert.embeddings.position_embeddings.weight',
 'bert.embeddings.token_type_embeddings.weight',
 'bert.embeddings.LayerNorm.weight',
 'bert.embeddings.LayerNorm.bias',
 'bert.encoder.layer.0.attention.self.query.weight',
 'bert.encoder.layer.0.attention.self.query.bias',
 'bert.encoder.layer.0.attention.self.key.weight',
 'bert.encoder.layer.0.attention.self.key.bias',
 'bert.encoder.layer.0.attention.self.value.weight',
 'bert.encoder.layer.0.attention.self.value.bias',
 'bert.encoder.layer.0.attention.output.dense.weight',
 'bert.encoder.layer.0.attention.output.dense.bias',
 'bert.encoder.layer.0.attention.output.LayerNorm.weight',
 'bert.encoder.layer.0.attention.output.LayerNorm.bias',
...
...
'classifier.linear.weight',
'classifier.linear.bias']

therefore requiring slightly smarter conditions than just in? Something along the lines?

[p for n, p in param_optimizer if any(True for x in no_decay if n.endswith(x))]

optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if n in no_decay], 'weight_decay_rate': 0.0}
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all(nd not in n for nd in no_decay) would be clearer

@xelibrion
Copy link

Don't mind my comment, tested it further this morning and everything seems to work as expected!

SaulLu pushed a commit to SaulLu/transformers that referenced this pull request Jan 11, 2022
xloem pushed a commit to xloem/transformers that referenced this pull request Apr 9, 2023
* Update trainer and model flows to accommodate sparseml

Disable FP16 on QAT start (huggingface#12)

* Override LRScheduler when using LRModifiers

* Disable FP16 on QAT start

* keep wrapped scaler object for training after disabling

Using QATMatMul in DistilBERT model class (huggingface#41)

Removed double quantization of output of context layer. (huggingface#45)

Fix DataParallel validation forward signatures (huggingface#47)

* Fix: DataParallel validation forward signatures

* Update: generalize forward_fn selection

Best model after epoch (huggingface#46)

fix sclaer check for non fp16 mode in trainer (huggingface#38)

Mobilebert QAT (huggingface#55)

* Remove duplicate quantization of vocabulary.

enable a QATWrapper for non-parameterized matmuls in BERT self attention (huggingface#9)

* Utils and auxillary changes

update Zoo stub loading for SparseZoo 1.1 refactor (huggingface#54)

add flag to signal NM integration is active (huggingface#32)

Add recipe_name to file names

* Fix errors introduced in manual cherry-pick upgrade

Co-authored-by: Benjamin Fineran <[email protected]>
jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this pull request Jun 1, 2023
ArthurZucker added a commit that referenced this pull request Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants