change: [smdataparallel] better messages to establish the SSH connection between workers #103

karan6181 · 2021-04-08T21:21:40Z

Description of changes:

This change provide better error/info messages to the user for socket connection establishment. The change doesn't throw the error message [Traceback] during first or subsequent attempt. It only throws the error message when it tries multiple times for socket connection and times out (default: 1 hour).

Before:

You can see Cannot connect to host algo-2 at port 22 followed by paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on on the first try and then Can connect to host algo-2 at port 22 just after that. The error message confuses the user if the script has failed or not. But the actual reason is that the workers tries to handshake on the first attempt and if it fails, it retries again.

2021-04-06 23:03:28,091 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2021-04-06 23:03:28,763 sagemaker-training-toolkit INFO     Starting MPI run as worker node.
2021-04-06 23:03:28,763 sagemaker-training-toolkit INFO     Creating SSH daemon.
2021-04-06 23:03:28,771 sagemaker-training-toolkit INFO     Waiting for MPI workers to establish their SSH connections
2021-04-06 23:03:28,773 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22
2021-04-06 23:03:28,773 sagemaker-training-toolkit ERROR    Connection failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/smdataparallel.py", line 317, in _can_connect
    client.connect(host, port=port)
  File "/usr/local/lib/python3.7/site-packages/paramiko/client.py", line 368, in connect
    raise NoValidConnectionsError(errors)
paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on xx.xx.xxx.xxx
2021-04-06 23:03:28,774 sagemaker-training-toolkit INFO     Connection closed
2021-04-06 23:03:29,855 paramiko.transport INFO     Authentication (publickey) successful!
2021-04-06 23:03:29,856 sagemaker-training-toolkit INFO     Can connect to host algo-2 at port 22
2021-04-06 23:03:29,856 sagemaker-training-toolkit INFO     Connection closed
2021-04-06 23:03:29,856 sagemaker-training-toolkit INFO     Worker algo-2 available for communication

After: (Success)

You can see the sentence Cannot connect to host algo-2 at port 22. Retrying... with word Retrying.... So it doesn't throw the error message during the socket creation attempt

2021-04-08 21:15:37,862 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22. Retrying...
2021-04-08 21:15:37,863 sagemaker-training-toolkit INFO     Connection closed
2021-04-08 21:15:38,864 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22. Retrying...
2021-04-08 21:15:38,865 sagemaker-training-toolkit INFO     Connection closed
2021-04-08 21:15:39,873 paramiko.transport INFO     Connected (version 2.0, client OpenSSH_7.6p1)
2021-04-08 21:15:39,952 paramiko.transport INFO     Authentication (publickey) successful!
2021-04-08 21:15:39,952 sagemaker-training-toolkit INFO     Can connect to host algo-2 at port 22
2021-04-08 21:15:39,952 sagemaker-training-toolkit INFO     Connection closed
2021-04-08 21:15:39,952 sagemaker-training-toolkit INFO     Worker algo-2 available for communication

After: (failure)

2021-04-08 22:44:19,309 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22. Retrying...
2021-04-08 22:44:19,309 sagemaker-training-toolkit INFO     Connection closed
2021-04-08 22:44:20,310 sagemaker-training-toolkit INFO     Cannot connect to host algo-2 at port 22. Retrying...
2021-04-08 22:44:20,311 sagemaker-training-toolkit INFO     Connection closed
2021-04-08 22:44:21,298 sagemaker-training-toolkit ERROR    Connection between the hosts couldn't established. Aborting the training.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/smdataparallel.py", line 93, in _wait_for_workers
    time.sleep(self._interval)
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/timeout.py", line 46, in handler
    raise TimeoutError("timed out after {} seconds".format(limit))
sagemaker_training.timeout.TimeoutError: timed out after 10 seconds
2021-04-08 22:44:21,301 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2021-04-08 22:44:21,301 sagemaker-training-toolkit ERROR    framework error: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/trainer.py", line 85, in train
    entrypoint()
  File "/usr/local/lib/python3.7/site-packages/sagemaker_tensorflow_container/training.py", line 235, in main
    train(env, mapping.to_cmd_args(user_hyperparameters))
  File "/usr/local/lib/python3.7/site-packages/sagemaker_tensorflow_container/training.py", line 173, in train
    runner_type=runner_type,
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/smdataparallel.py", line 261, in run
    self._setup()
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/smdataparallel.py", line 83, in _setup
    self._wait_for_workers()
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/smdataparallel.py", line 93, in _wait_for_workers
    time.sleep(self._interval)
  File "/usr/local/lib/python3.7/site-packages/sagemaker_training/timeout.py", line 46, in handler
    raise TimeoutError("timed out after {} seconds".format(limit))
sagemaker_training.timeout.TimeoutError: timed out after xx seconds

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I used the commit message format described in CONTRIBUTING
I have used the regional endpoint when creating S3 and/or STS clients (if appropriate)
I have updated any necessary documentation, including READMEs

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have checked that my tests are not configured for a specific region or account (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…nection between workers

sagemaker-bot · 2021-04-08T21:25:49Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-training-toolkit-pr
Commit ID: 0dcd738
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ChaiBapchya

Thanks for the change and the explanation!

ChaiBapchya · 2021-04-08T21:31:39Z

Let's run the tox -e flake8,black-check,pylint --parallel all locally & fix 'em up so that the linter gods are happy!

rondogency

That's very helpful!

src/sagemaker_training/smdataparallel.py

sagemaker-bot · 2021-04-08T21:58:48Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-training-toolkit-pr
Commit ID: 1605c6a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot · 2021-04-08T22:58:47Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-training-toolkit-pr
Commit ID: 886a345
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

roywei

LGTM!

src/sagemaker_training/smdataparallel.py

* fix: propagate log level to aws services (aws#79) * fix: propagate log level to aws services * drop py27 and add py38 support * update unit test * recover buildspeck * remove py38 build * install latest sagemaker 1.x version * fix: removing py27/py38 * fix arg name Co-authored-by: Chuyang Deng <[email protected]> * prepare release v3.6.3 * update development version to v3.6.4.dev0 * doc: fix typo in ENVIRONMENT_VARIABLES.md (aws#81) Removed typo ')'. Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.3.post0 * update development version to v3.6.4.dev0 * infra: use ECR-hosted image for ubuntu:16.04 (aws#87) * infra: use ECR-hosted image for ubuntu:16.04 * use public ECR repo * disable prompts in Docker build * fix: workaround to print stderr when capturing (aws#86) Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.4 * update development version to v3.6.5.dev0 * feature: add data parallelism support (aws#3) (aws#8) * change: use format in place of f-strings and use comment style type annotations (aws#10) * change: update tox to use sagemaker 2.18.0 for tests * prepare release v3.7.0 * update development version to v3.7.1.dev0 * fix:decode binary stderr string before dumping it out (aws#89) * fix:decode binary stderr string before dumping it out * fix failing test Co-authored-by: Rui Wang Napieralski <[email protected]> * prepare release v3.7.1 * update development version to v3.7.2.dev0 * change: set btl_vader_single_copy_mechanism to none (aws#90) * prepare release v3.7.2 * update development version to v3.7.3.dev0 * change: set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages (aws#95) * prepare release v3.7.3 * update development version to v3.7.4.dev0 * Update Dockerfile to accomomdate Rust dependency. (aws#98) * Update Dockerfile to accomomdate Rust dependency. cryptography module has added RUST as its dependency. Upgrading PIP to solve this dependency. * pinning to particular version of pip pinned to pip version 21.0.1 which solves the Rust dependency * prepare release v3.7.4 * update development version to v3.7.5.dev0 * Change: smdataparallel change FI_PROVIDER to efa from sockets (aws#96) * prepare release v3.7.5 * update development version to v3.7.6.dev0 * feature: smdataparallel custom mpi options support (aws#99) * feature: smdataparallel custom mpi options support * Fixed pylint * Fixed black-check * Fixed unit test * prepare release v3.8.0 * update development version to v3.8.1.dev0 * feature: smdataparallel enable EFA RDMA flag (aws#101) * feature: smdataparallel enable EFA RDMA flag * added changes to unit test * updated the flag to use only for ml.p4d.24xlarge instance * prepare release v3.9.0 * update development version to v3.9.1.dev0 * change: [smdataparallel] better messages to establish the SSH connection between workers (aws#103) * change: [smdataparallel] better messages for to establish the SSH connection between workers * python timeout.timeout raises TimeoutError * Added detailed error message * prepare release v3.9.1 * update development version to v3.9.2.dev0 * Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training (aws#106) * prepare release v3.9.2 * update development version to v3.9.3.dev0 * Fix logging issues (aws#108) * Fix logging issues Use asyncio to read stdout and stderr streams in realtime Report Exit code on failures Convey user informative message if process gets OOM Killed Filter out stderr to look for error messages and report Prepend tags to the log files to enable easy filtering in CloudWatch Update Amazon Licensing Update SM doc urls Support - Added Py38, Removed py36 and py27 Added unittests for asyncio APIs Install libssl1.1 and openssl packages * prepare release v3.9.3 * update development version to v3.9.4.dev0 * breaking: Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR aws#108) (aws#109) * prepare release v4.0.0 * update development version to v4.0.1.dev0 * Fix: Enable custom failure logging (aws#118) * prepare release v4.0.1 * update development version to v4.0.2.dev0 * feature: add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 (aws#121) fix: fixed the black lint, upgraded black to version 21.3.0 fix: remove u prefix of strings, as python3 defaults to unicode strings note: EFA is only available on p3dn or p4dn instances note: EFA version 1.15.1 and OFI 1.1.5-aws have the issue fixed note: black format reference on remove u prefix https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings * prepare release v4.1.0 * update development version to v4.1.1.dev0 * fix: missing args when shell script is used (aws#122) * prepare release v4.1.1 * update development version to v4.1.2.dev0 Co-authored-by: Chuyang <[email protected]> Co-authored-by: Chuyang Deng <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Pedro Martins <[email protected]> Co-authored-by: Ajay Karpur <[email protected]> Co-authored-by: sboshin <[email protected]> Co-authored-by: ChaiBapchya <[email protected]> Co-authored-by: Dan <[email protected]> Co-authored-by: icywang86rui <[email protected]> Co-authored-by: Rui Wang Napieralski <[email protected]> Co-authored-by: Eric Johnson <[email protected]> Co-authored-by: Karan Jariwala <[email protected]> Co-authored-by: Rajan Singh <[email protected]> Co-authored-by: Piyush Ghai <[email protected]> Co-authored-by: Daiming Yang <[email protected]>

* fix: propagate log level to aws services (aws#79) * fix: propagate log level to aws services * drop py27 and add py38 support * update unit test * recover buildspeck * remove py38 build * install latest sagemaker 1.x version * fix: removing py27/py38 * fix arg name Co-authored-by: Chuyang Deng <[email protected]> * prepare release v3.6.3 * update development version to v3.6.4.dev0 * doc: fix typo in ENVIRONMENT_VARIABLES.md (aws#81) Removed typo ')'. Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.3.post0 * update development version to v3.6.4.dev0 * infra: use ECR-hosted image for ubuntu:16.04 (aws#87) * infra: use ECR-hosted image for ubuntu:16.04 * use public ECR repo * disable prompts in Docker build * fix: workaround to print stderr when capturing (aws#86) Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.4 * update development version to v3.6.5.dev0 * feature: add data parallelism support (aws#3) (aws#8) * change: use format in place of f-strings and use comment style type annotations (aws#10) * change: update tox to use sagemaker 2.18.0 for tests * prepare release v3.7.0 * update development version to v3.7.1.dev0 * fix:decode binary stderr string before dumping it out (aws#89) * fix:decode binary stderr string before dumping it out * fix failing test Co-authored-by: Rui Wang Napieralski <[email protected]> * prepare release v3.7.1 * update development version to v3.7.2.dev0 * change: set btl_vader_single_copy_mechanism to none (aws#90) * prepare release v3.7.2 * update development version to v3.7.3.dev0 * change: set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages (aws#95) * prepare release v3.7.3 * update development version to v3.7.4.dev0 * Update Dockerfile to accomomdate Rust dependency. (aws#98) * Update Dockerfile to accomomdate Rust dependency. cryptography module has added RUST as its dependency. Upgrading PIP to solve this dependency. * pinning to particular version of pip pinned to pip version 21.0.1 which solves the Rust dependency * prepare release v3.7.4 * update development version to v3.7.5.dev0 * Change: smdataparallel change FI_PROVIDER to efa from sockets (aws#96) * prepare release v3.7.5 * update development version to v3.7.6.dev0 * feature: smdataparallel custom mpi options support (aws#99) * feature: smdataparallel custom mpi options support * Fixed pylint * Fixed black-check * Fixed unit test * prepare release v3.8.0 * update development version to v3.8.1.dev0 * feature: smdataparallel enable EFA RDMA flag (aws#101) * feature: smdataparallel enable EFA RDMA flag * added changes to unit test * updated the flag to use only for ml.p4d.24xlarge instance * prepare release v3.9.0 * update development version to v3.9.1.dev0 * change: [smdataparallel] better messages to establish the SSH connection between workers (aws#103) * change: [smdataparallel] better messages for to establish the SSH connection between workers * python timeout.timeout raises TimeoutError * Added detailed error message * prepare release v3.9.1 * update development version to v3.9.2.dev0 * Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training (aws#106) * prepare release v3.9.2 * update development version to v3.9.3.dev0 * Fix logging issues (aws#108) * Fix logging issues Use asyncio to read stdout and stderr streams in realtime Report Exit code on failures Convey user informative message if process gets OOM Killed Filter out stderr to look for error messages and report Prepend tags to the log files to enable easy filtering in CloudWatch Update Amazon Licensing Update SM doc urls Support - Added Py38, Removed py36 and py27 Added unittests for asyncio APIs Install libssl1.1 and openssl packages * prepare release v3.9.3 * update development version to v3.9.4.dev0 * breaking: Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR aws#108) (aws#109) * prepare release v4.0.0 * update development version to v4.0.1.dev0 * Fix: Enable custom failure logging (aws#118) * prepare release v4.0.1 * update development version to v4.0.2.dev0 * feature: add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 (aws#121) fix: fixed the black lint, upgraded black to version 21.3.0 fix: remove u prefix of strings, as python3 defaults to unicode strings note: EFA is only available on p3dn or p4dn instances note: EFA version 1.15.1 and OFI 1.1.5-aws have the issue fixed note: black format reference on remove u prefix https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings * prepare release v4.1.0 * update development version to v4.1.1.dev0 * fix: missing args when shell script is used (aws#122) * prepare release v4.1.1 * update development version to v4.1.2.dev0 * fix: fix flaky issue with incorrect rc being given (aws#124) * fix: fix flaky issue with incorrect rc being given * Add logging around proc.wait. * prepare release v4.1.2 * update development version to v4.1.3.dev0 * Feature: Adding new parameter for TF Multi Worker Mirrored Strategy (aws#130) * feature: Adding new parameter for TF Multi Worker Mirrored Strategy * fix: changing variable name for MWMS * fix: freezing protobuf version and renaming variable for MWMS * fix: linting * prepare release v4.1.3 * update development version to v4.1.4.dev0 Co-authored-by: Chuyang <[email protected]> Co-authored-by: Chuyang Deng <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Pedro Martins <[email protected]> Co-authored-by: Ajay Karpur <[email protected]> Co-authored-by: sboshin <[email protected]> Co-authored-by: ChaiBapchya <[email protected]> Co-authored-by: Dan <[email protected]> Co-authored-by: icywang86rui <[email protected]> Co-authored-by: Rui Wang Napieralski <[email protected]> Co-authored-by: Eric Johnson <[email protected]> Co-authored-by: Karan Jariwala <[email protected]> Co-authored-by: Rajan Singh <[email protected]> Co-authored-by: Piyush Ghai <[email protected]> Co-authored-by: Daiming Yang <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: Loki <[email protected]>

* fix: propagate log level to aws services (aws#79) * fix: propagate log level to aws services * drop py27 and add py38 support * update unit test * recover buildspeck * remove py38 build * install latest sagemaker 1.x version * fix: removing py27/py38 * fix arg name Co-authored-by: Chuyang Deng <[email protected]> * prepare release v3.6.3 * update development version to v3.6.4.dev0 * doc: fix typo in ENVIRONMENT_VARIABLES.md (aws#81) Removed typo ')'. Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.3.post0 * update development version to v3.6.4.dev0 * infra: use ECR-hosted image for ubuntu:16.04 (aws#87) * infra: use ECR-hosted image for ubuntu:16.04 * use public ECR repo * disable prompts in Docker build * fix: workaround to print stderr when capturing (aws#86) Co-authored-by: Ajay Karpur <[email protected]> * prepare release v3.6.4 * update development version to v3.6.5.dev0 * feature: add data parallelism support (aws#3) (aws#8) * change: use format in place of f-strings and use comment style type annotations (aws#10) * change: update tox to use sagemaker 2.18.0 for tests * prepare release v3.7.0 * update development version to v3.7.1.dev0 * fix:decode binary stderr string before dumping it out (aws#89) * fix:decode binary stderr string before dumping it out * fix failing test Co-authored-by: Rui Wang Napieralski <[email protected]> * prepare release v3.7.1 * update development version to v3.7.2.dev0 * change: set btl_vader_single_copy_mechanism to none (aws#90) * prepare release v3.7.2 * update development version to v3.7.3.dev0 * change: set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages (aws#95) * prepare release v3.7.3 * update development version to v3.7.4.dev0 * Update Dockerfile to accomomdate Rust dependency. (aws#98) * Update Dockerfile to accomomdate Rust dependency. cryptography module has added RUST as its dependency. Upgrading PIP to solve this dependency. * pinning to particular version of pip pinned to pip version 21.0.1 which solves the Rust dependency * prepare release v3.7.4 * update development version to v3.7.5.dev0 * Change: smdataparallel change FI_PROVIDER to efa from sockets (aws#96) * prepare release v3.7.5 * update development version to v3.7.6.dev0 * feature: smdataparallel custom mpi options support (aws#99) * feature: smdataparallel custom mpi options support * Fixed pylint * Fixed black-check * Fixed unit test * prepare release v3.8.0 * update development version to v3.8.1.dev0 * feature: smdataparallel enable EFA RDMA flag (aws#101) * feature: smdataparallel enable EFA RDMA flag * added changes to unit test * updated the flag to use only for ml.p4d.24xlarge instance * prepare release v3.9.0 * update development version to v3.9.1.dev0 * change: [smdataparallel] better messages to establish the SSH connection between workers (aws#103) * change: [smdataparallel] better messages for to establish the SSH connection between workers * python timeout.timeout raises TimeoutError * Added detailed error message * prepare release v3.9.1 * update development version to v3.9.2.dev0 * Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training (aws#106) * prepare release v3.9.2 * update development version to v3.9.3.dev0 * Fix logging issues (aws#108) * Fix logging issues Use asyncio to read stdout and stderr streams in realtime Report Exit code on failures Convey user informative message if process gets OOM Killed Filter out stderr to look for error messages and report Prepend tags to the log files to enable easy filtering in CloudWatch Update Amazon Licensing Update SM doc urls Support - Added Py38, Removed py36 and py27 Added unittests for asyncio APIs Install libssl1.1 and openssl packages * prepare release v3.9.3 * update development version to v3.9.4.dev0 * breaking: Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR aws#108) (aws#109) * prepare release v4.0.0 * update development version to v4.0.1.dev0 * Fix: Enable custom failure logging (aws#118) * prepare release v4.0.1 * update development version to v4.0.2.dev0 * feature: add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 (aws#121) fix: fixed the black lint, upgraded black to version 21.3.0 fix: remove u prefix of strings, as python3 defaults to unicode strings note: EFA is only available on p3dn or p4dn instances note: EFA version 1.15.1 and OFI 1.1.5-aws have the issue fixed note: black format reference on remove u prefix https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings * prepare release v4.1.0 * update development version to v4.1.1.dev0 * fix: missing args when shell script is used (aws#122) * prepare release v4.1.1 * update development version to v4.1.2.dev0 * fix: fix flaky issue with incorrect rc being given (aws#124) * fix: fix flaky issue with incorrect rc being given * Add logging around proc.wait. * prepare release v4.1.2 * update development version to v4.1.3.dev0 * Feature: Adding new parameter for TF Multi Worker Mirrored Strategy (aws#130) * feature: Adding new parameter for TF Multi Worker Mirrored Strategy * fix: changing variable name for MWMS * fix: freezing protobuf version and renaming variable for MWMS * fix: linting * prepare release v4.1.3 * update development version to v4.1.4.dev0 * Use framework provided error class and stack trace as error message (aws#123) * log smddp exceptions * update exception class * clean up error msg * address comments * Add Error Categorization for SMMP * add pytorch errors for SMMP && minor fixes * feature: allow framework libraries to supply exceptions to track and report as failure reason. Added support for SMDDP and SMMP custom exceptions. Include custom exception as error class and de-duplicated stack trace as error message. Added tests for wacthing single, list of exceptions and also support existing internal exceptions. Co-authored-by: haohanchen-yagao <[email protected]> Co-authored-by: Joe Evans <[email protected]> * prepare release v4.1.4 * update development version to v4.1.5.dev0 * Fix none exception class issue for mpi (aws#131) * fix: Fix none exception class issue for mpi * Add unit test for SMP exception import * reformat * fix import format * rename get exception function * prepare release v4.1.5 * update development version to v4.1.6.dev0 * update: protobuf version to overlap with TF requirements (aws#134) * update: protobuf version to overlap with TF requirements * fix: upper bound * prepare release v4.1.6 * update development version to v4.1.7.dev0 * feature: Heterogeneous cluster changes (aws#135) * prepare release v4.2.0 * update development version to v4.2.1.dev0 * fix: handle utf-8 decoding exceptions while processing stdout and stderr streams * prepare release v4.2.1 * update development version to v4.2.2.dev0 * fix: specify flake8 config explicitly (aws#138) * change: update distribution_instance_group for pytorch ddp * fix: Removed version hardcoding for sagemaker test dependency (aws#141) * prepare release v4.2.2 * update development version to v4.2.3.dev0 * change: update num_processes_per_host for smdataparallel runner * prepare release v4.2.3 * update development version to v4.2.4.dev0 * Feature: Create a new distribution mechanism for PT-XLA (aws#137) * Create a new distribution mechanism for PT-XLA * Adding new unit tests targetting PT-XLA distributed training * Reformatting according to guidelines * Linting changes * Linting changes * Linting changes * Test Mock syntax fix * Test Mock syntax fix * Fixing syntax error * Fixing syntax error * Revert "Fixing syntax error" This reverts commit 48a10c5. * Fixing syntax error * Fixing syntax error * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * + new test to target the PT-XLA Distributed runner * Add verbose reporting for tox tests * Fixing syntax errors * Fixing syntax errors * Fixing syntax errors * Fixing syntax errors * Adding more tests targeting PT-XLA DT mechanism * edits for flake8 * edits for black * fixing test errors * fixing test errors * fixing test errors * fixing test errors * fixing test errors * fixing container build for unit testing * fixing container build for unit testing * retry tests * fixing container build for unit testing * fixing container build for unit testing * fixing container execution for unit testing * fixing container execution for unit testing * Refactoring some tests as integration tests * Refactoring some tests as integration tests * Refactoring some tests as integration tests * Refactoring some tests as integration tests * Refactoring some tests as integration tests * Removing stale files * Removing stale test container * Fix: adding EFA specific setup to distributed training runner for PT-XLA (aws#143) * fix: adding EFA specific setup to distributed training runner for PT-XLA * test: testing new env variables for PT-XLA on EFA * prepare release v4.2.4 * update development version to v4.2.5.dev0 * relax exception type (aws#140) no qa * prepare release v4.2.5 * update development version to v4.2.6.dev0 * fix: Enable PT XLA distributed training on homogeneous clusters (aws#144) * fix: adding bypass for PT XLA distributed training on homogeneous cluster * fix: linting * prepare release v4.2.6 * update development version to v4.2.7.dev0 * fix: improve worker node wait logic and update EFA flags (aws#145) * prepare release v4.2.7 * update development version to v4.2.8.dev0 * Fix: Args for worker nodes in smdataparallel jobs (aws#147) * fix worker args for sm dataparallel jobs * prepare release v4.2.8 * update development version to v4.2.9.dev0 * Fix merge conflicts Co-authored-by: Chuyang <[email protected]> Co-authored-by: Chuyang Deng <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Pedro Martins <[email protected]> Co-authored-by: Ajay Karpur <[email protected]> Co-authored-by: sboshin <[email protected]> Co-authored-by: ChaiBapchya <[email protected]> Co-authored-by: Dan <[email protected]> Co-authored-by: icywang86rui <[email protected]> Co-authored-by: Rui Wang Napieralski <[email protected]> Co-authored-by: Eric Johnson <[email protected]> Co-authored-by: Karan Jariwala <[email protected]> Co-authored-by: Rajan Singh <[email protected]> Co-authored-by: Piyush Ghai <[email protected]> Co-authored-by: Daiming Yang <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: Loki <[email protected]> Co-authored-by: Lai Wei <[email protected]> Co-authored-by: haohanchen-yagao <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: haohanchen-yagao <[email protected]> Co-authored-by: Nishanth Hegde <[email protected]> Co-authored-by: Vishwa Karia <[email protected]> Co-authored-by: Nishanth Hegde <[email protected]> Co-authored-by: Jihyeong Lee <[email protected]> Co-authored-by: Loki <[email protected]>

change: [smdataparallel] better messages for to establish the SSH con…

0dcd738

…nection between workers

karan6181 force-pushed the smddp_socket_conn branch from a842798 to 0dcd738 Compare April 8, 2021 21:23

karan6181 changed the title ~~change: [smdataparallel] better messages for to establish the SSH connection between workers~~ change: [smdataparallel] better messages to establish the SSH connection between workers Apr 8, 2021

ChaiBapchya previously approved these changes Apr 8, 2021

View reviewed changes

rondogency previously approved these changes Apr 8, 2021

View reviewed changes

src/sagemaker_training/smdataparallel.py Outdated Show resolved Hide resolved

roywei previously approved these changes Apr 8, 2021

View reviewed changes

python timeout.timeout raises TimeoutError

1605c6a

karan6181 dismissed stale reviews from roywei, rondogency, and ChaiBapchya via 1605c6a April 8, 2021 21:47

Added detailed error message

886a345

roywei approved these changes Apr 9, 2021

View reviewed changes

piyushghai reviewed Apr 9, 2021

View reviewed changes

src/sagemaker_training/smdataparallel.py Show resolved Hide resolved

piyushghai approved these changes Apr 9, 2021

View reviewed changes

ChaiBapchya approved these changes Apr 9, 2021

View reviewed changes

rajanksin approved these changes Apr 12, 2021

View reviewed changes

rajanksin merged commit 4d0f14b into aws:master Apr 12, 2021

karan6181 deleted the smddp_socket_conn branch April 12, 2021 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change: [smdataparallel] better messages to establish the SSH connection between workers #103

change: [smdataparallel] better messages to establish the SSH connection between workers #103

Uh oh!

karan6181 commented Apr 8, 2021 •

edited

Loading

Uh oh!

sagemaker-bot commented Apr 8, 2021

Uh oh!

ChaiBapchya left a comment

Uh oh!

ChaiBapchya commented Apr 8, 2021 •

edited

Loading

Uh oh!

rondogency left a comment

Uh oh!

Uh oh!

sagemaker-bot commented Apr 8, 2021

Uh oh!

sagemaker-bot commented Apr 8, 2021

Uh oh!

roywei left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

change: [smdataparallel] better messages to establish the SSH connection between workers #103

change: [smdataparallel] better messages to establish the SSH connection between workers #103

Uh oh!

Conversation

karan6181 commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Checklist

General

Tests

Uh oh!

sagemaker-bot commented Apr 8, 2021

AWS CodeBuild CI Report

Uh oh!

ChaiBapchya left a comment

Choose a reason for hiding this comment

Uh oh!

ChaiBapchya commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rondogency left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sagemaker-bot commented Apr 8, 2021

AWS CodeBuild CI Report

Uh oh!

sagemaker-bot commented Apr 8, 2021

AWS CodeBuild CI Report

Uh oh!

roywei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

karan6181 commented Apr 8, 2021 •

edited

Loading

ChaiBapchya commented Apr 8, 2021 •

edited

Loading