Skip to content

[BUG]: colossalai run failed with unknown reason #2441

@FrankLeeeee

Description

@FrankLeeeee

🐛 Describe the bug

Some users have reported that they encounter launch failure when using colossalai run, but there is no error message to tell exactly what went wrong. A sample output is given below.

Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --config config.py -s on 127.0.0.1

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions