Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ You are not required to read the following guidelines before opening an issue. H

```bash
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
torchrun --nproc_per_node=2 ./finetune_trainer.py \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/de/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ Der [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) unt
- Legen Sie die Anzahl der zu verwendenden GPUs mit dem Argument `nproc_per_node` fest.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/main_classes/deepspeed.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ The information in this section isn't not specific to the DeepSpeed integration

For the duration of this section let's assume that you have 2 nodes with 8 gpus each. And you can reach the first node with `ssh hostname1` and second node with `ssh hostname2`, and both must be able to reach each other via ssh locally without a password. Of course, you will need to rename these host (node) names to the actual host names you are working with.

#### The torch.distributed.run launcher
#### The torch.distributed.run(torchrun) launcher


For example, to use `torch.distributed.run`, you could do:
Expand Down
8 changes: 4 additions & 4 deletions docs/source/en/main_classes/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ Let's discuss how you can tell your program which GPUs are to be used and in wha
When using [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) to use only a subset of your GPUs, you simply specify the number of GPUs to use. For example, if you have 4 GPUs, but you wish to use the first 2 you can do:

```bash
python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ...
torchrun --nproc_per_node=2 trainer-program.py ...
```

if you have either [`accelerate`](https://github.com/huggingface/accelerate) or [`deepspeed`](https://github.com/microsoft/DeepSpeed) installed you can also accomplish the same by using one of:
Expand All @@ -233,15 +233,15 @@ If you have multiple GPUs and you'd like to use only 1 or a few of those GPUs, s
For example, let's say you have 4 GPUs: 0, 1, 2 and 3. To run only on the physical GPUs 0 and 2, you can do:

```bash
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ...
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
```

So now pytorch will see only 2 GPUs, where your physical GPUs 0 and 2 are mapped to `cuda:0` and `cuda:1` correspondingly.

You can even change their order:

```bash
CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ...
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
```

Here your physical GPUs 0 and 2 are mapped to `cuda:1` and `cuda:0` correspondingly.
Expand All @@ -263,7 +263,7 @@ As with any environment variable you can, of course, export those instead of add

```bash
export CUDA_VISIBLE_DEVICES=0,2
python -m torch.distributed.launch trainer-program.py ...
torchrun trainer-program.py ...
```

but this approach can be confusing since you may forget you set up the environment variable earlier and not understand why the wrong GPUs are used. Therefore, it's a common practice to set the environment variable just for a specific run on the same command line as it's shown in most examples of this section.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/perf_hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ Here is the full benchmark code and outputs:
```bash
# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand All @@ -143,7 +143,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/perf_train_gpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ python examples/pytorch/language-modeling/run_clm.py \

```
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

Expand All @@ -164,7 +164,7 @@ python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-

```
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ The [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) sup
- Set the number of GPUs to use with the `nproc_per_node` argument.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/es/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ python examples/tensorflow/summarization/run_summarization.py \
- Establece la cantidad de GPU que se usará con el argumento `nproc_per_node`.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
4 changes: 2 additions & 2 deletions docs/source/it/perf_hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ Ecco il codice benchmark completo e gli output:
```bash
# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand All @@ -143,7 +143,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand Down
2 changes: 1 addition & 1 deletion docs/source/it/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ Il [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) supp
- Imposta un numero di GPU da usare con l'argomento `nproc_per_node`.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
8 changes: 4 additions & 4 deletions docs/source/ja/main_classes/trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ _python_、_numpy_、および _pytorch_ の RNG 状態は、そのチェック
[`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.Parallel.DistributedDataParallel.html) を使用して GPU のサブセットのみを使用する場合、使用する GPU の数を指定するだけです。 。たとえば、GPU が 4 つあるが、最初の 2 つを使用したい場合は、次のようにします。

```bash
python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ...
torchrun --nproc_per_node=2 trainer-program.py ...
```

[`accelerate`](https://github.com/huggingface/accelerate) または [`deepspeed`](https://github.com/microsoft/DeepSpeed) がインストールされている場合は、次を使用して同じことを達成することもできます。の一つ:
Expand All @@ -223,15 +223,15 @@ deepspeed --num_gpus 2 trainer-program.py ...
たとえば、4 つの GPU (0、1、2、3) があるとします。物理 GPU 0 と 2 のみで実行するには、次のようにします。

```bash
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ...
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
```

したがって、pytorch は 2 つの GPU のみを認識し、物理 GPU 0 と 2 はそれぞれ `cuda:0` と `cuda:1` にマッピングされます。

順序を変更することもできます。

```bash
CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ...
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
```

ここでは、物理 GPU 0 と 2 がそれぞれ`cuda:1`と`cuda:0`にマッピングされています。
Expand All @@ -253,7 +253,7 @@ CUDA_VISIBLE_DEVICES= python trainer-program.py ...

```bash
export CUDA_VISIBLE_DEVICES=0,2
python -m torch.distributed.launch trainer-program.py ...
torchrun trainer-program.py ...
```

ただし、この方法では、以前に環境変数を設定したことを忘れて、なぜ間違った GPU が使用されているのか理解できない可能性があるため、混乱を招く可能性があります。したがって、このセクションのほとんどの例で示されているように、同じコマンド ラインで特定の実行に対してのみ環境変数を設定するのが一般的です。
Expand Down
4 changes: 2 additions & 2 deletions docs/source/ja/perf_hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ NVLinkを使用すると、トレーニングが約23%速く完了すること
```bash
# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand All @@ -148,7 +148,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand Down
4 changes: 2 additions & 2 deletions docs/source/ja/perf_train_gpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,15 +143,15 @@ python examples/pytorch/language-modeling/run_clm.py \

# DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}

# DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

Expand Down
2 changes: 1 addition & 1 deletion docs/source/ja/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ python examples/tensorflow/summarization/run_summarization.py \
以下は提供されたBashコードです。このコードの日本語訳をMarkdown形式で記載します。

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
4 changes: 2 additions & 2 deletions docs/source/ko/perf_hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ NVLink 사용 시 훈련이 약 23% 더 빠르게 완료됨을 확인할 수 있
```bash
# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand All @@ -144,7 +144,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand Down
4 changes: 2 additions & 2 deletions docs/source/ko/perf_train_gpu_many.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,15 +145,15 @@ python examples/pytorch/language-modeling/run_clm.py \

# DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}

# DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

Expand Down
2 changes: 1 addition & 1 deletion docs/source/ko/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ python examples/tensorflow/summarization/run_summarization.py \
- `nproc_per_node` 인수를 추가해 사용할 GPU 개수를 설정합니다.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/pt/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ O [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) ofere
- Defina o número de GPUs a serem usadas com o argumento `nproc_per_node`.

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
4 changes: 2 additions & 2 deletions docs/source/zh/perf_hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ GPU1 PHB X 0-11 N/A
```bash
# DDP w/ NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand All @@ -144,7 +144,7 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch

# DDP w/o NVLink

rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
Expand Down
2 changes: 1 addition & 1 deletion docs/source/zh/run_scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ python examples/tensorflow/summarization/run_summarization.py \


```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path t5-small \
Expand Down
6 changes: 3 additions & 3 deletions examples/legacy/question-answering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ in Huang et al. [Improve Transformer Models with Better Relative Position Embedd

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
--dataset_name squad \
--do_train \
Expand Down Expand Up @@ -46,7 +46,7 @@ gpu training leads to the f1 score of 90.71.

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
--dataset_name squad \
--do_train \
Expand All @@ -68,7 +68,7 @@ Training with the above command leads to the f1 score of 93.52, which is slightl
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

```bash
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--dataset_name squad \
--do_train \
Expand Down
4 changes: 2 additions & 2 deletions examples/legacy/seq2seq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ python finetune_trainer.py --help

For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus:
```bash
python -m torch.distributed.launch --nproc_per_node=2 finetune_trainer.py ...
torchrun --nproc_per_node=2 finetune_trainer.py ...
```

**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**
Expand Down Expand Up @@ -214,7 +214,7 @@ because it uses SortishSampler to minimize padding. You can also use it on 1 GPU
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.

```bash
python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \
torchrun --nproc_per_node=8 run_distributed_eval.py \
--model_name sshleifer/distilbart-large-xsum-12-3 \
--save_dir xsum_generations \
--data_dir xsum \
Expand Down
4 changes: 2 additions & 2 deletions examples/pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
use the following command:

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node number_of_gpu_you_have path_to_script.py \
--all_arguments_of_the_script
```
Expand All @@ -107,7 +107,7 @@ As an example, here is how you would fine-tune the BERT large model (with whole
classification MNLI task using the `run_glue` script, with 8 GPUs:

```bash
python -m torch.distributed.launch \
torchrun \
--nproc_per_node 8 pytorch/text-classification/run_glue.py \
--model_name_or_path bert-large-uncased-whole-word-masking \
--task_name mnli \
Expand Down
Loading