Using run_mlm.py to pretrain a roberta base model from scratch outputs do not include <bos> or <eos> tokens

### System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.27.0.dev0
- Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: deepspeed


### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

I am attempting to train a roberta-base model using the defaults on a custom corpus.

deepspeed --num_gpus 8 run_mlm.py
 --model_type roberta
 --max_seq_length 128 
--do_train 
--per_device_train_batch_size 512 
--fp16 
--save_total_limit 3 
--num_train_epochs 30 
--deepspeed ds_config.json 
--learning_rate 1e-4 
--eval_steps 50 
--max_eval_samples 4000 
--evaluation_strategy steps 
--tokenizer "roberta-large" 
--warmup_steps 30000 
--adam_beta1 0.9 
--adam_beta2 0.98 
--adam_epsilon 1e-6 
--weight_decay 0.01 
--lr_scheduler_type linear 
--preprocessing_num_workers 8 
--train_file my_text.txt 
--line_by_line 
--output_dir my_roberta_base

The training works and the loss goes down and the accuracy goes up. However, when I compare the outputs to the original roberta-base I see a behavior that appears to be a glitch or problem with the training.

### Expected behavior

Expected behavior using roberta-base from huggingface hub shows the first and last token of the output being the `<bos>` and `<eos>` tokens, respectively, while my new trained roberta-base model is showing token #8 ( and). I think this was learned instead of being automatically set to <bos> and <eos> like the expected behavior should be for this script.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

model1 = AutoModelForMaskedLM.from_pretrained("roberta-base", torch_dtype=torch.float16).cuda(0)
model2 = AutoModelForMaskedLM.from_pretrained("rob_wiki_base", torch_dtype=torch.float16).cuda(0)

text="The main causes of death for <mask> are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious <mask> has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller <mask>. Natural causes of death include adverse temperatures, predation by <mask> on young, and disease."
input = tokenizer(text, truncation=True, padding=True, return_tensors="pt")

output1=model1(input["input_ids"].cuda(0))
output2 = model2(input["input_ids"].cuda(0))

predicted_token_id1 = output1[0][0].argmax(axis=-1)
predicted_token_id2 = output2[0][0].argmax(axis=-1)

print("Original roberta-base output:")
print(predicted_token_id1)
print(tokenizer.decode(predicted_token_id1))
print("-"*20)
print("My new roberta-base output:")
print(predicted_token_id2)
print(tokenizer.decode(predicted_token_id2))
print("-"*20)
```


Original roberta-base output:
tensor([    0,   133,  1049,  4685,     9,   744,    13, 18018,    32,  1050,
           12,  3368,   743,     6,   215,    25, 14294,  8181,     8,  1050,
         8720,     4,  2667,  2635,    12, 19838,     6, 10691,  3650,    34,
          669,     7,  4153, 25062,    19, 39238, 12853,    12,  9756,  8934,
            8,  7446,     4,   993,   313,   877,   293,    33,    57,   303,
           19,    81,   654, 26172,    15,   106,    31, 39238, 12853,  5315,
            4,  7278,  4685,     9,   744,   680, 12661,  3971,     6, 12574,
         1258,    30, 22139,    15,   664,     6,     8,  2199,     4,     2],
       device='cuda:0')

<s>The main causes of death for whales are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious behavior has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by predators on young, and disease.</s>


My new roberta-base output:
tensor([    8,   133,  1049,  4685,     9,   744,    13,  5868,    32,  1050,
           12,  3368,   743,     6,   215,    25, 14294,  8181,     8,  1050,
         8720,     4,  2667,  2635,    12, 19838,     6, 10691,  2574,    34,
          669,     7,  4153, 25062,    19, 39238, 12853,    12,  9756,  8934,
            8,  7446,     4,   993,   313,   877,   293,    33,    57,   303,
           19,    81,   654, 26172,    15,   106,    31, 39238, 12853,  5315,
            4,  7278,  4685,     9,   744,   680, 12661,  3971,     6, 12574,
         1258,    30,  5868,    15,   664,     6,     8,  2199,     4,     8],
       device='cuda:0')
 andThe main causes of death for humans are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious nature has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by humans on young, and disease. and

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using run_mlm.py to pretrain a roberta base model from scratch outputs do not include <bos> or <eos> tokens #21711

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using run_mlm.py to pretrain a roberta base model from scratch outputs do not include <bos> or <eos> tokens #21711

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions