Skip to content

Conversation

@Johnson9009
Copy link
Contributor

By using RPC server in NPU board, at some time a compiled model will hang the NPU, because of the buggy operator libraries of NPU toolchain, so we must to use the session_timeout to ensure the board resource can be released by the hang jobs.

Currently the handling of session timeout error in RPC server is not good, it just kill the server loop sub process, then in the destructor of class RPCEndpoint will send the code of kShutdown to the RPC client, but the RPC client expect receive the code of kReturn or kException, so users will see the error message that like the one reported in #15151, this error report will make users very confused and don't know what's happened.

When using tuning to search a good schedule for operators, we only want to ignore the RPC session timeout error that indicate the schedule generated is an illegal one, but other error reported by the RPC server may help us find the potential bug of our tool chain built on top of TVM, so the RPC session timeout error should be split to a standalone TVM error class.

This PR implemented these requirements by sending the RPC session timeout error message as a PRC server exception to the RPC client before kill the server loop sub process.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jun 30, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: rpc See #10317 for details

Generated by tvm-bot

@junrushao
Copy link
Member

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

@Johnson9009
Copy link
Contributor Author

This is a nice addition! I'm curious what its implication is to the existing auto tuning system though - for example, will it affect AutoTVM's time out mechanism? CC @zxybazh

@junrushao In my opinion, there isn't any effect for the existing auto tuning system, because currently they all catch the very general error type, e.g., Exception, TVMError, and the RPCSessionTimeoutError is a subclass of TVMError, so it will be caught too.

for future in futures:
try:
res = future.result()
results.append(res)
except Exception as ex: # pylint: disable=broad-except
tb = traceback.format_exc()
results.append(

costs.sort()
costs = tuple(costs[1:-1])
except TVMError as exc:
msg = str(exc)
if "Stack trace returned" in msg:
msg = msg[: msg.index("Stack trace returned")]
if "CUDA Source" in msg:
msg = msg[: msg.index("CUDA Source")]
costs = (traceback.format_exc(), RuntimeError(msg[:1024]))
errno = MeasureErrorNo.RUNTIME_DEVICE
tstamp = time.time()

remote.remove("")
dev.free_raw_stream(stream)
# pylint: disable=broad-except
except Exception:
dev.free_raw_stream(stream)
costs = (MAX_FLOAT,)
error_no = MeasureErrorNo.RUNTIME_DEVICE
error_msg = make_traceback_info()
shutil.rmtree(os.path.dirname(build_res.filename))

@junrushao
Copy link
Member

Yep, and that's why I am curious. Thanks for the explanation!

@junrushao junrushao merged commit 683dfb0 into apache:main Jul 2, 2023
@Johnson9009 Johnson9009 deleted the rpc_timeout branch July 2, 2023 07:52
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
gmeeker added a commit to gmeeker/tvm that referenced this pull request Jan 6, 2024
Fix regression in (apache#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.
Johnson9009 pushed a commit that referenced this pull request Jan 12, 2024
* [RPC] Fix tuning on macOS and Windows (#15771)

Fix regression in (#15187) when multiprocessing start method is not 'fork',
which prevented tuning from working. This affects macOS and Windows.
Also in python 3.14 the default start method will be 'spawn'.

* [RPC] clean up _serve_loop function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants