Skip to content

Conversation

@yongwww
Copy link
Contributor

@yongwww yongwww commented Mar 21, 2024

The changes in apache/tvm#16750 modified the signature of the Storage, this pull request updates the caller code in mlc-llm to accommodate the new Storage class signature. Ran into build error w/o the change.

mlc-llm/cpp/serve/model.cc:67:96: error: no matching function for call to ‘tvm::runtime::memory::Storage::Storage(tvm::runtime::memory::Buffer)’
   67 |         memory::Storage(allocator->Alloc(device_host, {prefill_chunk_size_}, DataType::Int(32)));

cc: @MasterJH5574 @vinx13 @tqchen

MasterJH5574 and others added 2 commits March 20, 2024 19:57
This PR introduces the IPC memory and customized all-reduce kernel
dispatches for tensor parallelism. We add a new compiler flag
`--allreduce-strategy`, which supports `"ring"`, `"one-shot"` and
`"two-shot"`. The flag defaults to `"ring"`, which means this PR
makes no difference if people do not manually change the all-reduce
strategy.

As of now the IPC-memory-backed customized all-reduce kernels are
only available on CUDA.

To enable all-reduce strategies other than "ring", here are some
example compile commands:
```python
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=one-shot" -o model/lib.so
python -m mlc_llm compile model/mlc-chat-config.json --device cuda --opt "allreduce-strategy=two-shot" -o model/lib.so
```

Please be aware that, you probably also need to specify other
compiler flags, for example, like `--opt "cublas_gemm=1;allreduce-strategy=one-shot"`.
@yongwww yongwww closed this Mar 21, 2024
@yongwww
Copy link
Contributor Author

yongwww commented Mar 21, 2024

failed to reopen it, will create a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants