Support MPI #752

mofeing · 2025-02-15T18:52:47Z

This PR...

Registers MPI routine symbol address when MPI.jl gets loaded
Specializes MPI.jl methods to be traced by Reactant

unresolved questions

~~how can we represent MPI_Request with tensor and stablehlo types?~~
mmm stablehlo.custom_call has a backend attribute that could be useful during lowering; e.g. if we want to lower to NCCL instead of MPI, since both have a similar API, we could potentially add our own custom c-functions that use NCCL but adapt them to MPI-like API
@wsmoses can we create @cfunctions in Julia and pass them to the symbol table? some MPI routines might need a lil bit of adaption and writing them in Julia would be easier, faster (and also, would use the correct symbols from MPI.jl-loaded libmpi)

tested

to do

MPI communicators
sharding
more MPI routines
custom reduction operators

cc @JBlaschke @hhkit

wsmoses · 2025-02-15T23:05:50Z

you won't, instead you'll emit something like


function send_wrap(%arg : memref<axb>) {
    mpi.send %arg
}

function main() {
    ...
    enzymexla.jit_call @set_wrap(%x : tensor<...>)
}

And then lower-jit will convert into a custom call. however you will need to define a lowering of mpi.send into a corresponding MPI_Send call [which will use the symbol you just registered here]

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

ext/ReactantMPIExt/Overrides.jl

mofeing · 2025-02-16T11:19:50Z

mmm from our last discussion on this a couple of weeks ago, i understood that we would emit this

function main() {
    ...
    mpi.send(%arg0, ...)
    ...
}

and it would get lowered to

function send_wrap(%arg : memref<axb>) {
    llvm.call <0xffff> (%arg)
}

function main() {
    ...
    enzymexla.jit_call @send_wrap(%x : tensor<...>)
    ...
}

which will finally lower to the following with the enzymexla.jit pass

function main() {
    ...
    stablehlo.custom_call @mpi_send_wrap(%x : tensor<...>)
    ...
}

is this correct or do we need to emit the enzymexla.jit_call directly from Reactant?

ahh or do you mean that any wrapping we need to do around MPI should be done in this way?

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

okay, this will probably be required for NCCL

ext/ReactantMPIExt/Overrides.jl

Co-authored-by: Paul Berg <[email protected]>

ext/ReactantMPIExt/ReactantMPIExt.jl

is_pure stuff

right way/place to do it

ext/ReactantMPIExt/Ops.jl

ext/ReactantMPIExt/Overrides.jl

test/integration/mpi.jl

ext/ReactantMPIExt/Ops.jl

romanlee · 2025-09-19T18:50:33Z

@wsmoses are there any other features we want to add before merging? If not, this might be ready for review

ext/ReactantMPIExt/Ops.jl

wsmoses · 2025-09-19T21:29:22Z

ext/ReactantMPIExt/ReactantMPIExt.jl

+using Libdl
+
+# https://github.com/jax-ml/jax/blob/b0117366686ab084d38ad2657d9a2ae3a581ca7e/jax/_src/clusters/mpi4py_cluster.py
+Distributed.is_env_present(::Distributed.MPIEnvDetector) = MPI.Initialized()


@avik-pal can you review this?

This looks good to me

wsmoses · 2025-09-19T21:30:18Z

test/integration/mpi.jl

+using Test, MPI, Reactant
+
+# # MPI only works on cpu currently --- is this the right way/place to enforce that?
+# Reactant.set_default_backend("cpu")


@avik-pal re integration bits

For safety I would do client = Reactant.default_backend(); Reactant.set_default_backend("cpu") and at the end of the script set the client back with Reactant.set_default_backend(client).

Techinically we are running on a separate process so it shouldn't matter but in case during testing (local/ci) we include the file it will make debugging harder.

wsmoses · 2025-09-19T21:31:46Z

did a quick review and it looks good, though @avik-pal and @giordano should comemnt.

ext/ReactantMPIExt/ReactantMPIExt.jl

giordano · 2025-09-19T22:23:03Z

test/integration/mpi.jl

+# # MPI only works on cpu currently --- is this the right way/place to enforce that?
+# Reactant.set_default_backend("cpu")
+
+MPI.Init()


Do I understand correctly this (and Finalize) can't be @compiled at the moment?

Yeah that's right. Although it looks like there are overrides in Overrides.jl. Let me see if I can get these working easily, otherwise maybe we just remove them unless it's a priority

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

instead

wsmoses

generally lgtm, but I want @avik-pal to look atthe Distributed.is_env_present and related to double check

romanlee · 2025-09-22T23:32:10Z

Sounds good. There are a couple things I'm trying to clean up in the meantime

romanlee · 2025-09-22T23:38:03Z

ext/ReactantMPIExt/Overrides.jl

-    tag::Integer,
-    comm::MPI.Comm
-)
+function MPI.Recv!(buf::TracedRArray, source::Integer, tag::Integer, comm::MPI.Comm)


@giordano I don't understand the idea behind the formatting changes in 4899d6d. E.g., here, Recv! has a different format (args all on one line), from Isend above (all args on different lines). Could you explain?

That's JuliaFormatter (I only accepted the suggestions), whether to wrap or not depends on what would be the length of all the arguments on a single line.

Interesting alright fair enough

Pangoraw reviewed Feb 16, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

avik-pal reviewed Feb 22, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

mofeing commented Feb 22, 2025

View reviewed changes

ext/ReactantMPIExt/Overrides.jl Outdated Show resolved Hide resolved

mofeing force-pushed the ss/mpi branch from b42cc8f to af77526 Compare February 27, 2025 16:25

mofeing and others added 24 commits March 6, 2025 10:25

Register MPI symbols on load

c5f72cd

ops

d5eaa2d

Update ext/ReactantMPIExt/Overrides.jl

5f800fe

Co-authored-by: Paul Berg <[email protected]>

Update ext/ReactantMPIExt/Overrides.jl

5e4a8cd

register MPI constants

215cb1d

Fix MPI specializations

4b54755

fix some symbol registration

e7fd20e

refactor MPI Ops

774982a

Add functionality for parsing single operations (Julia code)

d1fe99b

Update Ops.comm_rank to use handwritten MLIR injection

81dd8f2

comment

2ebb52a

Update Ops.comm_size

11ef645

fixes

406fe08

Refactor Ops.barrier

108c679

Refactor to try_inject_to_top_block!

bfac727

Refactor MLIR injection

20fadc2

Refactor MPI constante registration

2babb79

Fix type inference in Ops.hlo_call on empty args

2c8e95c

Fix MLIR of Ops.comm_rank

a6738f5

Fix MLIR injection C-functions

9710e59

Go back to Cint for registering symbols

f865e3e

Add tryinjectop!

2778f1d

Add tryinject!, inject! methods

eb5da6e

Update comm_rank

5995a7b

romanlee added 3 commits September 16, 2025 15:47

Add launcher.jl

0395fb3

Merge branch 'main' into ss/mpi-new

514a03e

Don't need this until we need to let a Request cross compile barrier

a1b9218

github-actions bot reviewed Sep 17, 2025

View reviewed changes

ext/ReactantMPIExt/ReactantMPIExt.jl Outdated Show resolved Hide resolved

ext/ReactantMPIExt/ReactantMPIExt.jl Outdated Show resolved Hide resolved

romanlee added 6 commits September 18, 2025 14:12

Set size of status to max necessary hopefully

5ee8762

Add Isend/Irecv/Wait unit test

4c9e2ac

Cleanup return vals in tests

81d2d58

No need to retun error codes from Send/Recv! now that we have this

bf35523

is_pure stuff

No need to return errorcode from wait either

8a8dcf7

Remove set_default_backend("cpu") from mpi.jl - Not sure that's the

1024888

right way/place to do it

github-actions bot reviewed Sep 18, 2025

View reviewed changes

romanlee added 2 commits September 18, 2025 16:09

Use tryinject instead of inject in wait for MPI_COMM_WORLD

0a1258d

UPdate tests

c017627

github-actions bot reviewed Sep 18, 2025

View reviewed changes

ext/ReactantMPIExt/Ops.jl Outdated Show resolved Hide resolved

Delete debug files

bc8f9c4

wsmoses reviewed Sep 19, 2025

View reviewed changes

ext/ReactantMPIExt/Ops.jl Outdated Show resolved Hide resolved

wsmoses reviewed Sep 19, 2025

View reviewed changes

giordano reviewed Sep 19, 2025

View reviewed changes

ext/ReactantMPIExt/ReactantMPIExt.jl Outdated Show resolved Hide resolved

giordano reviewed Sep 19, 2025

View reviewed changes

giordano and others added 3 commits September 19, 2025 23:28

Apply suggestions from code review

4899d6d

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Use the already dlopened libmpi

a4e159f

Remove function convert_julia_type_to_mpi_datatype(), use MPI.Dataype()

c6def59

instead

wsmoses approved these changes Sep 22, 2025

View reviewed changes

Cleanup/remove some comments

53112ca

romanlee reviewed Sep 22, 2025

View reviewed changes

Cleanup some comments

d77a4d9

Support MPI #752

Are you sure you want to change the base?

Support MPI #752

Conversation

mofeing commented Feb 15, 2025 • edited by romanlee Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

unresolved questions

tested

to do

Uh oh!

wsmoses commented Feb 15, 2025

Uh oh!

Uh oh!

mofeing commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

romanlee commented Sep 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsmoses commented Sep 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsmoses left a comment

Choose a reason for hiding this comment

Uh oh!

romanlee commented Sep 22, 2025

Uh oh!

romanlee Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mofeing commented Feb 15, 2025 •

edited by romanlee

Loading

mofeing commented Feb 16, 2025 •

edited

Loading

romanlee Sep 22, 2025 •

edited

Loading