-
Notifications
You must be signed in to change notification settings - Fork 36
Run interactions in separate kernels for TestEm3 #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds launch bounds with a safe configuration that still allows a maximum of 256 registers per thread and 1 block per SM. VecGeom requires 254 registers per thread right now, so this is the highest number of threads that is safe to use without needing to force VecGeom to use less registers at compile time by adding, e.g. --maxrregcount 128 to the CUDA compile options. If --maxrregcount 128 is used, then the number of blocks can be increased to 2 for 256 threads (on the RTX 2070), or a combination of 128 threads/4 blocks can be used. There is a tradeoff between more threads/block and more blocks/SM for a constant number of threads/SM. Using more threads/block reduces scheduling overhead, and this is good for when many tracks are in flight, as there are less blocks overall. Using more blocks/SM allows to better distribute work across the GPU when there are very few tracks, as a block is always scheduled on a single SM and smaller blocks can be distributed more easily into more SMs.
Since interactions happen relatively rarely, splitting them off allows for shorter physics+geometry kernels that run more coherently. The interaction kernels are now a two-step process that first finds all tracks that will undergo an interaction, and then runs the one specific interaction per kernel.
|
Can one of the admins verify this patch? |
| __device__ char g_nextInteractionForEl[8'000'000]; | ||
| __device__ char g_nextInteractionForPos[8'000'000]; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you write in the summary, this isn't very nice. What is the plan here, move the winnerProcessIndex into the Track structure?
Another approach (that I think we had in the past? not sure) is having queues for the discrete processes. That would even save us from determining which Track need to run in which of the (split) kernels.
examples/TestEm3/electrons.cu
Outdated
|
|
||
| switch (winnerProcessIndex) { | ||
| case 0: { | ||
| RanluxppDouble newRNG(currentTrack.rngState.BranchNoAdvance()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must use Branch() (with Advanceing) or newRNG has correlations to the primary's RNG state, see #129. In turn, the calls to newRNG.Advance(); and currentTrack.rngState.Advance(); in the Transport kernel above are not needed anymore.
| counter = 0; | ||
| noopCounter = 0; | ||
|
|
||
| const int activeSize = active->size(); | ||
| for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < activeSize; i += blockDim.x * gridDim.x) { | ||
| const auto winnerProcess = g_nextInteractionForGamma[i]; | ||
|
|
||
| if (winnerProcess == ProcessIndex) { | ||
| const auto destination = atomicInc(&counter, (unsigned int)-1); | ||
| candidates[destination % sharedSize] = i; | ||
| } else atomicInc(&noopCounter, (unsigned int)-1); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a race condition, there needs to be a __syncthreads() in between. This is at least one of the reasons I was seeing non-reproducible results on my machine (when forcing -batch 104, the previous default)
| const auto winnerProcess = IsElectron ? g_nextInteractionForEl[i] : g_nextInteractionForPos[i]; | ||
|
|
||
| if (winnerProcess == ProcessIndex) { | ||
| const auto destination = atomicInc(&counter, (unsigned int)-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find atomicAdd(&counter, 1) much easier to understand. For atomicInc, you need to know that the second argument is the "maximum" where the atomic increment is clamped to...
Requested by Jonas to reduce correlations between RNG states of branched particles.
de1157a to
eb6f73a
Compare
| 0xff00ffff, 0xffff0000, 0xffffffff}; | ||
| std::string _name; | ||
| nvtxRangeId_t _id; | ||
| std::array<unsigned long, 5> _lastOccups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::array<unsigned long, 5> _lastOccups; | |
| std::array<unsigned long, 5> _lastOccups{}; |
| eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; | ||
| eventAttrib.message.ascii = name; | ||
| _id = nvtxRangeStartEx(&eventAttrib); | ||
| _lastOccups.fill(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _lastOccups.fill(0); |
| *_occupIt = occupancy; | ||
| if (++_occupIt == _lastOccups.end()) _occupIt = _lastOccups.begin(); | ||
|
|
||
| const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size(); | |
| const auto meanOccup = double(std::reduce(_lastOccups.begin(), _lastOccups.end())) / _lastOccups.size(); |
| constexpr size_t TracksSize = sizeof(Track) * Capacity; | ||
| constexpr size_t ManagerSize = sizeof(SlotManager); | ||
| const size_t QueueSize = adept::MParray::SizeOfInstance(Capacity); | ||
| size_t TracksSize = sizeof(Track) * Capacity; | ||
| size_t ManagerSize = sizeof(SlotManager); | ||
| size_t QueueSize = adept::MParray::SizeOfInstance(Capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you make these variables mutable? I do not see where you need to change them later.
| double initialRange; | ||
| double dynamicRangeFactor; | ||
| double tlimitMin; | ||
| double fPEmxSec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not experienced in this domain, but even then, this looks like a very cryptic variable name.
| __device__ char g_nextInteractionForEl[8'000'000]; | ||
| __device__ char g_nextInteractionForPos[8'000'000]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you quickly elaborate to me please, why we cannot add this information to the Track itself? Then we would not need this additional global state and the checking of whether you have allocated enough of it.
| constexpr unsigned int sharedSize = 12250; | ||
| __shared__ int candidates[sharedSize]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this number coming from? The size of candidates is 49KiB, which seems like an odd size to me.
| assert(counter < sharedSize); | ||
| if (threadIdx.x == 0 && counter >= sharedSize) { | ||
| printf("Error: Shared queue for %d exhausted to %d in %s:%d\n", ProcessIndex, counter, __FILE__, __LINE__); | ||
| asm("trap;"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you not just wrap another loop around the entire kernel that fills a chunk of shared memory, processes it, and then starts with the next chunk?
|
The current branch has a build failure on my machine, which I fixed upstream: Could you please rebase onto a more recent master? Thank you! |
|
I just benchmarked this branch with |
|
In my opinion, we should have a new example with these changes; not a variant of TestEm3, but starting from an example that can load arbitrary GDML files so it's possible to study the performance in more realistic geometries. As said many times, TestEm3 is for physics validation and because its geometry is trivial, it really stresses / over-represents the physics part. |
|
For the reason given here I am still super interested in having these changes on TestEm3. But sure, they should also definitely land in more complex examples! |
| constexpr int Capacity = 1024 * 1024; | ||
| // Use 2/7 of GPU memory for each of e+/e-/gammas, leaving 1/7 for the rest. | ||
| int Capacity = deviceProp.totalGlobalMem / sizeof(Track) * (2.0 / 7.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I realized that the choice of Capacity has a gigantic impact on the runtime, because it directly determines the number of batches that will be iterated on later. We should definitely separate this change from the rest, so we can benchmark the impact of the kernel separation alone.
Demo of how one can split interactions from the rest of the workflow, i.e. splitting off highly divergent code.
There's a few things that should still be worked on, namely