Skip to content

Conversation

@sshleifer
Copy link
Contributor

Previously, SelfAttention would always return attn_weights, and then BartDecoder and BartEncoder would decide whether to return them to the user.
The attn_weights tensor is fairly large, with shape = (bs, num_heads, tgt_len, src_len)
This meant that the memory allocated for attn_weights could not be freed until after the forward pass of BartDecoder.

Now: SelfAttention returns (output, None) if config.output_attentions=False and the memory can be freed

Impact: memory can be freed after SelfAttention returns. -600MB peak GPU consumption for batch_size=6, tgt_len=src_len=1024, num_heads=16

Speed impact: negligible

@sshleifer sshleifer marked this pull request as ready for review March 22, 2020 16:11
@sshleifer sshleifer requested review from julien-c and thomwolf March 22, 2020 16:26
@sshleifer sshleifer changed the title [Bart/Memory] SelfAttention only returns weights if needed [Bart/Memory] SelfAttention only returns weights if config.output_attentions Mar 22, 2020
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@sshleifer sshleifer merged commit 63f4d8c into huggingface:master Mar 26, 2020
@sshleifer sshleifer deleted the need-weights-clean branch March 26, 2020 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants