Skip to content

Conversation

mofeing
Copy link
Collaborator

@mofeing mofeing commented Feb 15, 2025

This PR...

  • Registers MPI routine symbol address when MPI.jl gets loaded
  • Specializes MPI.jl methods to be traced by Reactant

unresolved questions

  • how can we represent MPI_Request with tensor and stablehlo types?
  • mmm stablehlo.custom_call has a backend attribute that could be useful during lowering; e.g. if we want to lower to NCCL instead of MPI, since both have a similar API, we could potentially add our own custom c-functions that use NCCL but adapt them to MPI-like API
  • @wsmoses can we create @cfunctions in Julia and pass them to the symbol table? some MPI routines might need a lil bit of adaption and writing them in Julia would be easier, faster (and also, would use the correct symbols from MPI.jl-loaded libmpi)

tested

to do

  • MPI communicators
  • sharding
  • more MPI routines
  • custom reduction operators

cc @JBlaschke @hhkit

@wsmoses
Copy link
Member

wsmoses commented Feb 15, 2025

you won't, instead you'll emit something like


function send_wrap(%arg : memref<axb>) {
    mpi.send %arg
}

function main() {
    ...
    enzymexla.jit_call @set_wrap(%x : tensor<...>)
}

And then lower-jit will convert into a custom call. however you will need to define a lowering of mpi.send into a corresponding MPI_Send call [which will use the symbol you just registered here]

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

@mofeing
Copy link
Collaborator Author

mofeing commented Feb 16, 2025

mmm from our last discussion on this a couple of weeks ago, i understood that we would emit this

function main() {
    ...
    mpi.send(%arg0, ...)
    ...
}

and it would get lowered to

function send_wrap(%arg : memref<axb>) {
    llvm.call <0xffff> (%arg)
}

function main() {
    ...
    enzymexla.jit_call @send_wrap(%x : tensor<...>)
    ...
}

which will finally lower to the following with the enzymexla.jit pass

function main() {
    ...
    stablehlo.custom_call @mpi_send_wrap(%x : tensor<...>)
    ...
}

is this correct or do we need to emit the enzymexla.jit_call directly from Reactant?

ahh or do you mean that any wrapping we need to do around MPI should be done in this way?

Re CUDA though we also need to ensure we are sync'd wrt the current custream which you can get via enzymexla.get_stream

okay, this will probably be required for NCCL

@romanlee
Copy link
Collaborator

@wsmoses are there any other features we want to add before merging? If not, this might be ready for review

using Libdl

# https://github.com/jax-ml/jax/blob/b0117366686ab084d38ad2657d9a2ae3a581ca7e/jax/_src/clusters/mpi4py_cluster.py
Distributed.is_env_present(::Distributed.MPIEnvDetector) = MPI.Initialized()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avik-pal can you review this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

using Test, MPI, Reactant

# # MPI only works on cpu currently --- is this the right way/place to enforce that?
# Reactant.set_default_backend("cpu")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avik-pal re integration bits

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety I would do client = Reactant.default_backend(); Reactant.set_default_backend("cpu") and at the end of the script set the client back with Reactant.set_default_backend(client).

Techinically we are running on a separate process so it shouldn't matter but in case during testing (local/ci) we include the file it will make debugging harder.

@wsmoses
Copy link
Member

wsmoses commented Sep 19, 2025

did a quick review and it looks good, though @avik-pal and @giordano should comemnt.

# # MPI only works on cpu currently --- is this the right way/place to enforce that?
# Reactant.set_default_backend("cpu")

MPI.Init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly this (and Finalize) can't be @compiled at the moment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right. Although it looks like there are overrides in Overrides.jl. Let me see if I can get these working easily, otherwise maybe we just remove them unless it's a priority

giordano and others added 3 commits September 19, 2025 23:28
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Member

@wsmoses wsmoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, but I want @avik-pal to look atthe Distributed.is_env_present and related to double check

@romanlee
Copy link
Collaborator

Sounds good. There are a couple things I'm trying to clean up in the meantime

tag::Integer,
comm::MPI.Comm
)
function MPI.Recv!(buf::TracedRArray, source::Integer, tag::Integer, comm::MPI.Comm)
Copy link
Collaborator

@romanlee romanlee Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giordano I don't understand the idea behind the formatting changes in 4899d6d. E.g., here, Recv! has a different format (args all on one line), from Isend above (all args on different lines). Could you explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's JuliaFormatter (I only accepted the suggestions), whether to wrap or not depends on what would be the length of all the arguments on a single line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting alright fair enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants