Skip to content

Issues with the EncoderDecoderModel for sequence to sequence tasks #4443

@dbaxter240

Description

@dbaxter240

❓ Questions & Help

I have been attempting with various models to try to build an encoder-decoder, sequence to sequence transformer model. For the most part, I have been using BERT (bert-base-cased), but have encountered issues with various models.

The model is intended for an English to English sequence to sequence problem.

For reference, I had been trying to use the seq2seq example in this pull request as a template :

#3402

But have needed to make some modifications to it to account for other recent changes in the EncoderDecoderModel class.

I have a hit a few main issues, three are posted here. I think at least some of them are possibly bugs in the EncoderDecoderModel code.

  1. A recent commit made some major changes to the forward method, and I've been hitting issues with the section that defines the decoder_outputs (around line 253 of modeling_encoder_decoder.py.) The example in the pull request I linked does not provide decoder_input_ids when setting up the model, but that is now required by this code in your recent commit. When training, I modified the code to provide decoder_token_ids as the target tokens shifted one to the right with a PAD token in front, as described in various papers. However, I don't understand why this is required when in eval mode -- shouldn't the model not have any decoder input tokens when in test/eval mode, and only be able to see what the previous tokens it actually output were? I don't understand what I'm supposed to provide as decoder_input_ids when in evaluation mode, and haven't been able to find documentation on it.

The code I'm currently using for training looks something like this :

        for step, batch in enumerate(epoch_iterator):


            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            model.train()
            batch = tuple(t.to(args.device) for t in batch)
            input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch

            # add other inputs here, including kwargs
            **inputs = {"input_ids": input_ids, "attention_mask": input_mask, 'decoder_input_ids': decoder_ids}**

            # The output tuple structure depends on the model used and the arguments invoked
            # For BERT-type models, this is
            # decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
            # For GPT2-type models, this at least starts with the decoder predictions
            # See the EncoderDecoderModel class for more details
            **output = model(**inputs)**

More context is given in the linked pull request, since again this is being copied from there. The initial pull request does not provide the 'decoder_input_ids' parameter, but it seems that is now required. My code is similar in eval mode, but without decoder_input_ids, and this code fails :

**for batch in tqdm(eval_dataloader, desc="Evaluating"):
        batch = tuple(t.to(args.device) for t in batch)
        input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch
        with torch.no_grad():
            inputs = {"input_ids": input_ids, "attention_mask": input_mask}

            # The output tuple structure depends on the model used and the arguments invoked
            # For BERT-type models, this is
            # decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
            # For GPT2-type models, this at least starts with the decoder predictions
            # See the EncoderDecoderModel class for more details
            output = model(**inputs)**

This code fails in modeling_encoder_decoder, line 283 with

ValueError: You have to specify either input_ids or inputs_embeds

  1. The pull request uses a GPT2 model as an example, but that no longer works because the code mentioned from Create DataParallel model if several GPUs #1 requires some parameters like encoder_hidden_states that GPT2 does not take at initialization. When I try to create a GPT2 model I get exceptions regarding this extra parameter. In other words, when I switch from a bert-bert model to a gpt2-gpt2 model, the code posted above fails in the "forward" method of the EncoderDecoderModel (line 283 of modeling_encoder_decoder) because "encoder_hidden_states" is an unexpected param for GPT2. Is this intended / is GPT2 no longer supported for an encoder decoder architecture using this code?

  2. This one is just more of a general question... but since I'm posting the above 2 as issues anyways, I figured I'd add it here in case anybody can clarify and save a separate issue being created..

I believe I'm doing this part correctly, but it was not handled in the example code so want to verify if possible... For the attention mask for the decoder, during training all non-PAD tokens are expected to be unmasked, and during evaluation no mask should be provided and a default causal mask will be used, right?

@patrickvonplaten , tagging you in this issue as requested.

Thank you for your time!! Let me know if you need more code, again my code is 95% or so identical to the run_seq2seq.py example in the linked PR, just with some changes to account for recent modifications in modeling_encoder_decoder.py

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions