Hi, while working with BioGPT I came across #39016 to not compute logits of the entire sequence when it's not needed. I was wondering if you'd be open to a PR applying the same improvement to BioGPT and, while I'm at it, any other GenerationMixin models that would benefit from this. I understand some of these models might be a bit obsolete and some could benefit from a refactor to modular instead, but I figure while they're there it's worth doing.