-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Description
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.27.0.dev0- Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: deepspeed
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am attempting to train a roberta-base model using the defaults on a custom corpus.
deepspeed --num_gpus 8 run_mlm.py
--model_type roberta
--max_seq_length 128
--do_train
--per_device_train_batch_size 512
--fp16
--save_total_limit 3
--num_train_epochs 30
--deepspeed ds_config.json
--learning_rate 1e-4
--eval_steps 50
--max_eval_samples 4000
--evaluation_strategy steps
--tokenizer "roberta-large"
--warmup_steps 30000
--adam_beta1 0.9
--adam_beta2 0.98
--adam_epsilon 1e-6
--weight_decay 0.01
--lr_scheduler_type linear
--preprocessing_num_workers 8
--train_file my_text.txt
--line_by_line
--output_dir my_roberta_base
The training works and the loss goes down and the accuracy goes up. However, when I compare the outputs to the original roberta-base I see a behavior that appears to be a glitch or problem with the training.
Expected behavior
Expected behavior using roberta-base from huggingface hub shows the first and last token of the output being the <bos>
and <eos>
tokens, respectively, while my new trained roberta-base model is showing token #8 ( and). I think this was learned instead of being automatically set to and like the expected behavior should be for this script.
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model1 = AutoModelForMaskedLM.from_pretrained("roberta-base", torch_dtype=torch.float16).cuda(0)
model2 = AutoModelForMaskedLM.from_pretrained("rob_wiki_base", torch_dtype=torch.float16).cuda(0)
text="The main causes of death for <mask> are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious <mask> has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller <mask>. Natural causes of death include adverse temperatures, predation by <mask> on young, and disease."
input = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
output1=model1(input["input_ids"].cuda(0))
output2 = model2(input["input_ids"].cuda(0))
predicted_token_id1 = output1[0][0].argmax(axis=-1)
predicted_token_id2 = output2[0][0].argmax(axis=-1)
print("Original roberta-base output:")
print(predicted_token_id1)
print(tokenizer.decode(predicted_token_id1))
print("-"*20)
print("My new roberta-base output:")
print(predicted_token_id2)
print(tokenizer.decode(predicted_token_id2))
print("-"*20)
Original roberta-base output:
tensor([ 0, 133, 1049, 4685, 9, 744, 13, 18018, 32, 1050,
12, 3368, 743, 6, 215, 25, 14294, 8181, 8, 1050,
8720, 4, 2667, 2635, 12, 19838, 6, 10691, 3650, 34,
669, 7, 4153, 25062, 19, 39238, 12853, 12, 9756, 8934,
8, 7446, 4, 993, 313, 877, 293, 33, 57, 303,
19, 81, 654, 26172, 15, 106, 31, 39238, 12853, 5315,
4, 7278, 4685, 9, 744, 680, 12661, 3971, 6, 12574,
1258, 30, 22139, 15, 664, 6, 8, 2199, 4, 2],
device='cuda:0')
The main causes of death for whales are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious behavior has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by predators on young, and disease.
My new roberta-base output:
tensor([ 8, 133, 1049, 4685, 9, 744, 13, 5868, 32, 1050,
12, 3368, 743, 6, 215, 25, 14294, 8181, 8, 1050,
8720, 4, 2667, 2635, 12, 19838, 6, 10691, 2574, 34,
669, 7, 4153, 25062, 19, 39238, 12853, 12, 9756, 8934,
8, 7446, 4, 993, 313, 877, 293, 33, 57, 303,
19, 81, 654, 26172, 15, 106, 31, 39238, 12853, 5315,
4, 7278, 4685, 9, 744, 680, 12661, 3971, 6, 12574,
1258, 30, 5868, 15, 664, 6, 8, 2199, 4, 8],
device='cuda:0')
andThe main causes of death for humans are human-related issues, such as habitat destruction and human objects. Their slow-moving, curious nature has led to violent collisions with propeller-driven boats and ships. Some manatees have been found with over 50 scars on them from propeller strikes. Natural causes of death include adverse temperatures, predation by humans on young, and disease. and