Skip to content

Conversation

@hageboeck
Copy link
Contributor

@hageboeck hageboeck commented Apr 28, 2022

Demo of how one can split interactions from the rest of the workflow, i.e. splitting off highly divergent code.

There's a few things that should still be worked on, namely

  • Split off the build system/NVTX and related things in a different PR
  • Split off / rebase the launch bounds work
  • Count how much work is actually scheduled into each kernel to compute optimal number of blocks
  • Don't use hard-coded buffers (probably in conjunction with track buffer), but this won't change timings

amadio and others added 8 commits April 20, 2022 10:44
This adds launch bounds with a safe configuration that still
allows a maximum of 256 registers per thread and 1 block per
SM. VecGeom requires 254 registers per thread right now, so
this is the highest number of threads that is safe to use
without needing to force VecGeom to use less registers at
compile time by adding, e.g. --maxrregcount 128 to the CUDA
compile options. If --maxrregcount 128 is used, then the number
of blocks can be increased to 2 for 256 threads (on the RTX 2070),
or a combination of 128 threads/4 blocks can be used.

There is a tradeoff between more threads/block and more blocks/SM for a
constant number of threads/SM. Using more threads/block reduces
scheduling overhead, and this is good for when many tracks are in
flight, as there are less blocks overall. Using more blocks/SM allows to
better distribute work across the GPU when there are very few tracks, as
a block is always scheduled on a single SM and smaller blocks can be
distributed more easily into more SMs.
Since interactions happen relatively rarely, splitting them off allows
for shorter physics+geometry kernels that run more coherently.
The interaction kernels are now a two-step process that first finds all
tracks that will undergo an interaction, and then runs the one specific
interaction per kernel.
@phsft-bot
Copy link

Can one of the admins verify this patch?

Comment on lines 31 to 33
__device__ char g_nextInteractionForEl[8'000'000];
__device__ char g_nextInteractionForPos[8'000'000];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you write in the summary, this isn't very nice. What is the plan here, move the winnerProcessIndex into the Track structure?

Another approach (that I think we had in the past? not sure) is having queues for the discrete processes. That would even save us from determining which Track need to run in which of the (split) kernels.


switch (winnerProcessIndex) {
case 0: {
RanluxppDouble newRNG(currentTrack.rngState.BranchNoAdvance());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must use Branch() (with Advanceing) or newRNG has correlations to the primary's RNG state, see #129. In turn, the calls to newRNG.Advance(); and currentTrack.rngState.Advance(); in the Transport kernel above are not needed anymore.

Comment on lines +128 to +139
counter = 0;
noopCounter = 0;

const int activeSize = active->size();
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < activeSize; i += blockDim.x * gridDim.x) {
const auto winnerProcess = g_nextInteractionForGamma[i];

if (winnerProcess == ProcessIndex) {
const auto destination = atomicInc(&counter, (unsigned int)-1);
candidates[destination % sharedSize] = i;
} else atomicInc(&noopCounter, (unsigned int)-1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a race condition, there needs to be a __syncthreads() in between. This is at least one of the reasons I was seeing non-reproducible results on my machine (when forcing -batch 104, the previous default)

const auto winnerProcess = IsElectron ? g_nextInteractionForEl[i] : g_nextInteractionForPos[i];

if (winnerProcess == ProcessIndex) {
const auto destination = atomicInc(&counter, (unsigned int)-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find atomicAdd(&counter, 1) much easier to understand. For atomicInc, you need to know that the second argument is the "maximum" where the atomic increment is clamped to...

Requested by Jonas to reduce correlations between RNG states of
branched particles.
@hageboeck hageboeck force-pushed the TestEm3-performanceNumbers branch from de1157a to eb6f73a Compare April 29, 2022 13:41
0xff00ffff, 0xffff0000, 0xffffffff};
std::string _name;
nvtxRangeId_t _id;
std::array<unsigned long, 5> _lastOccups;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::array<unsigned long, 5> _lastOccups;
std::array<unsigned long, 5> _lastOccups{};

eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
eventAttrib.message.ascii = name;
_id = nvtxRangeStartEx(&eventAttrib);
_lastOccups.fill(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_lastOccups.fill(0);

*_occupIt = occupancy;
if (++_occupIt == _lastOccups.end()) _occupIt = _lastOccups.begin();

const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size();
const auto meanOccup = double(std::reduce(_lastOccups.begin(), _lastOccups.end())) / _lastOccups.size();

Comment on lines -218 to +229
constexpr size_t TracksSize = sizeof(Track) * Capacity;
constexpr size_t ManagerSize = sizeof(SlotManager);
const size_t QueueSize = adept::MParray::SizeOfInstance(Capacity);
size_t TracksSize = sizeof(Track) * Capacity;
size_t ManagerSize = sizeof(SlotManager);
size_t QueueSize = adept::MParray::SizeOfInstance(Capacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make these variables mutable? I do not see where you need to change them later.

double initialRange;
double dynamicRangeFactor;
double tlimitMin;
double fPEmxSec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not experienced in this domain, but even then, this looks like a very cryptic variable name.

Comment on lines +31 to +32
__device__ char g_nextInteractionForEl[8'000'000];
__device__ char g_nextInteractionForPos[8'000'000];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you quickly elaborate to me please, why we cannot add this information to the Track itself? Then we would not need this additional global state and the checking of whether you have allocated enough of it.

Comment on lines +297 to +298
constexpr unsigned int sharedSize = 12250;
__shared__ int candidates[sharedSize];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this number coming from? The size of candidates is 49KiB, which seems like an odd size to me.

Comment on lines +321 to +325
assert(counter < sharedSize);
if (threadIdx.x == 0 && counter >= sharedSize) {
printf("Error: Shared queue for %d exhausted to %d in %s:%d\n", ProcessIndex, counter, __FILE__, __LINE__);
asm("trap;");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you not just wrap another loop around the entire kernel that fills a chunk of shared memory, processes it, and then starts with the next chunk?

@bernhardmgruber
Copy link
Contributor

The current branch has a build failure on my machine, which I fixed upstream:

/home/bgruber/dev/AdePT/magneticfield/inc/fieldPropagatorConstBz.h(139): error: expression must be a modifiable lvalue

Could you please rebase onto a more recent master? Thank you!

@bernhardmgruber
Copy link
Contributor

I just benchmarked this branch with --batch 1000 on my RTX 2060 and the results are awesome:

master:
	Run time: 2.7413
	Run time: 2.65707
	Run time: 2.65526
	Run time: 2.65597
	Run time: 2.65526
	Mean: 2.67297
	Uncertainty: 0.0341704

separate_interactions:
	Run time: 1.84203
	Run time: 1.77294
	Run time: 1.77532
	Run time: 1.77132
	Run time: 1.76796
	Mean: 1.78591
	Uncertainty: 0.0281599

@hahnjo
Copy link
Contributor

hahnjo commented Jun 13, 2022

In my opinion, we should have a new example with these changes; not a variant of TestEm3, but starting from an example that can load arbitrary GDML files so it's possible to study the performance in more realistic geometries. As said many times, TestEm3 is for physics validation and because its geometry is trivial, it really stresses / over-represents the physics part.

@bernhardmgruber
Copy link
Contributor

For the reason given here I am still super interested in having these changes on TestEm3. But sure, they should also definitely land in more complex examples!

Comment on lines -197 to +203
constexpr int Capacity = 1024 * 1024;
// Use 2/7 of GPU memory for each of e+/e-/gammas, leaving 1/7 for the rest.
int Capacity = deviceProp.totalGlobalMem / sizeof(Track) * (2.0 / 7.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I realized that the choice of Capacity has a gigantic impact on the runtime, because it directly determines the number of batches that will be iterated on later. We should definitely separate this change from the rest, so we can benchmark the impact of the kernel separation alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants