Run interactions in separate kernels for TestEm3 #203

hageboeck · 2022-04-28T16:49:50Z

Demo of how one can split interactions from the rest of the workflow, i.e. splitting off highly divergent code.

There's a few things that should still be worked on, namely

Split off the build system/NVTX and related things in a different PR
Split off / rebase the launch bounds work
Count how much work is actually scheduled into each kernel to compute optimal number of blocks
Don't use hard-coded buffers (probably in conjunction with track buffer), but this won't change timings

This adds launch bounds with a safe configuration that still allows a maximum of 256 registers per thread and 1 block per SM. VecGeom requires 254 registers per thread right now, so this is the highest number of threads that is safe to use without needing to force VecGeom to use less registers at compile time by adding, e.g. --maxrregcount 128 to the CUDA compile options. If --maxrregcount 128 is used, then the number of blocks can be increased to 2 for 256 threads (on the RTX 2070), or a combination of 128 threads/4 blocks can be used. There is a tradeoff between more threads/block and more blocks/SM for a constant number of threads/SM. Using more threads/block reduces scheduling overhead, and this is good for when many tracks are in flight, as there are less blocks overall. Using more blocks/SM allows to better distribute work across the GPU when there are very few tracks, as a block is always scheduled on a single SM and smaller blocks can be distributed more easily into more SMs.

Since interactions happen relatively rarely, splitting them off allows for shorter physics+geometry kernels that run more coherently. The interaction kernels are now a two-step process that first finds all tracks that will undergo an interaction, and then runs the one specific interaction per kernel.

phsft-bot · 2022-04-28T16:49:52Z

Can one of the admins verify this patch?

hahnjo · 2022-04-29T06:48:05Z

examples/TestEm3/electrons.cu

+__device__ char g_nextInteractionForEl[8'000'000];
+__device__ char g_nextInteractionForPos[8'000'000];
+


As you write in the summary, this isn't very nice. What is the plan here, move the winnerProcessIndex into the Track structure?

Another approach (that I think we had in the past? not sure) is having queues for the discrete processes. That would even save us from determining which Track need to run in which of the (split) kernels.

hahnjo · 2022-04-29T06:49:46Z

examples/TestEm3/electrons.cu


-    switch (winnerProcessIndex) {
-    case 0: {
+    RanluxppDouble newRNG(currentTrack.rngState.BranchNoAdvance());


This must use Branch() (with Advanceing) or newRNG has correlations to the primary's RNG state, see #129. In turn, the calls to newRNG.Advance(); and currentTrack.rngState.Advance(); in the Transport kernel above are not needed anymore.

hahnjo · 2022-04-29T06:51:57Z

examples/TestEm3/gammas.cu

+  counter = 0;
+  noopCounter = 0;
+
+  const int activeSize = active->size();
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < activeSize; i += blockDim.x * gridDim.x) {
+    const auto winnerProcess = g_nextInteractionForGamma[i];
+
+    if (winnerProcess == ProcessIndex) {
+      const auto destination = atomicInc(&counter, (unsigned int)-1);
+      candidates[destination % sharedSize] = i;
+    } else atomicInc(&noopCounter, (unsigned int)-1);
+  }


This is a race condition, there needs to be a __syncthreads() in between. This is at least one of the reasons I was seeing non-reproducible results on my machine (when forcing -batch 104, the previous default)

hahnjo · 2022-04-29T06:52:51Z

examples/TestEm3/electrons.cu

+    const auto winnerProcess = IsElectron ? g_nextInteractionForEl[i] : g_nextInteractionForPos[i];
+
+    if (winnerProcess == ProcessIndex) {
+      const auto destination = atomicInc(&counter, (unsigned int)-1);


I find atomicAdd(&counter, 1) much easier to understand. For atomicInc, you need to know that the second argument is the "maximum" where the atomic increment is clamped to...

Requested by Jonas to reduce correlations between RNG states of branched particles.

bernhardmgruber · 2022-06-10T08:34:39Z

base/inc/AdePT/NVTX.h

                                                       0xff00ffff, 0xffff0000, 0xffffffff};
  std::string _name;
  nvtxRangeId_t _id;
+  std::array<unsigned long, 5> _lastOccups;


Suggested change

std::array<unsigned long, 5> _lastOccups;

std::array<unsigned long, 5> _lastOccups{};

bernhardmgruber · 2022-06-10T08:34:46Z

base/inc/AdePT/NVTX.h

    eventAttrib.messageType           = NVTX_MESSAGE_TYPE_ASCII;
    eventAttrib.message.ascii         = name;
    _id                               = nvtxRangeStartEx(&eventAttrib);
+    _lastOccups.fill(0);


Suggested change

_lastOccups.fill(0);

bernhardmgruber · 2022-06-10T08:35:47Z

base/inc/AdePT/NVTX.h

+    *_occupIt = occupancy;
+    if (++_occupIt == _lastOccups.end()) _occupIt = _lastOccups.begin();
+
+    const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size();


Suggested change

const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size();

const auto meanOccup = double(std::reduce(_lastOccups.begin(), _lastOccups.end())) / _lastOccups.size();

bernhardmgruber · 2022-06-10T08:37:22Z

examples/TestEm3/TestEm3.cu

-  constexpr size_t TracksSize  = sizeof(Track) * Capacity;
-  constexpr size_t ManagerSize = sizeof(SlotManager);
-  const size_t QueueSize       = adept::MParray::SizeOfInstance(Capacity);
+  size_t TracksSize  = sizeof(Track) * Capacity;
+  size_t ManagerSize = sizeof(SlotManager);
+  size_t QueueSize   = adept::MParray::SizeOfInstance(Capacity);


Why did you make these variables mutable? I do not see where you need to change them later.

bernhardmgruber · 2022-06-10T08:39:35Z

examples/TestEm3/TestEm3.cuh

  double initialRange;
  double dynamicRangeFactor;
  double tlimitMin;
+  double fPEmxSec;


I am not experienced in this domain, but even then, this looks like a very cryptic variable name.

bernhardmgruber · 2022-06-10T08:46:29Z

examples/TestEm3/electrons.cu

+__device__ char g_nextInteractionForEl[8'000'000];
+__device__ char g_nextInteractionForPos[8'000'000];


Could you quickly elaborate to me please, why we cannot add this information to the Track itself? Then we would not need this additional global state and the checking of whether you have allocated enough of it.

bernhardmgruber · 2022-06-10T08:48:38Z

examples/TestEm3/electrons.cu

+  constexpr unsigned int sharedSize = 12250;
+  __shared__ int candidates[sharedSize];


Where is this number coming from? The size of candidates is 49KiB, which seems like an odd size to me.

bernhardmgruber · 2022-06-10T08:51:32Z

examples/TestEm3/electrons.cu

+  assert(counter < sharedSize);
+  if (threadIdx.x == 0 && counter >= sharedSize) {
+    printf("Error: Shared queue for %d exhausted to %d in %s:%d\n", ProcessIndex, counter, __FILE__, __LINE__);
+    asm("trap;");
+  }


Could you not just wrap another loop around the entire kernel that fills a chunk of shared memory, processes it, and then starts with the next chunk?

bernhardmgruber · 2022-06-10T09:22:50Z

The current branch has a build failure on my machine, which I fixed upstream:

/home/bgruber/dev/AdePT/magneticfield/inc/fieldPropagatorConstBz.h(139): error: expression must be a modifiable lvalue

Could you please rebase onto a more recent master? Thank you!

bernhardmgruber · 2022-06-10T09:48:25Z

I just benchmarked this branch with --batch 1000 on my RTX 2060 and the results are awesome:

master:
	Run time: 2.7413
	Run time: 2.65707
	Run time: 2.65526
	Run time: 2.65597
	Run time: 2.65526
	Mean: 2.67297
	Uncertainty: 0.0341704

separate_interactions:
	Run time: 1.84203
	Run time: 1.77294
	Run time: 1.77532
	Run time: 1.77132
	Run time: 1.76796
	Mean: 1.78591
	Uncertainty: 0.0281599

hahnjo · 2022-06-13T09:15:51Z

In my opinion, we should have a new example with these changes; not a variant of TestEm3, but starting from an example that can load arbitrary GDML files so it's possible to study the performance in more realistic geometries. As said many times, TestEm3 is for physics validation and because its geometry is trivial, it really stresses / over-represents the physics part.

bernhardmgruber · 2022-06-13T09:37:33Z

For the reason given here I am still super interested in having these changes on TestEm3. But sure, they should also definitely land in more complex examples!

bernhardmgruber · 2022-06-23T09:14:35Z

examples/TestEm3/TestEm3.cu

-  constexpr int Capacity = 1024 * 1024;
+  // Use 2/7 of GPU memory for each of e+/e-/gammas, leaving 1/7 for the rest.
+  int Capacity = deviceProp.totalGlobalMem / sizeof(Track) * (2.0 / 7.0);


So I realized that the choice of Capacity has a gigantic impact on the runtime, because it directly determines the number of batches that will be iterated on later. We should definitely separate this change from the rest, so we can benchmark the impact of the kernel separation alone.

amadio and others added 8 commits April 20, 2022 10:44

TestEm3: automatically adjust buffer capacity

2fe0fb3

[REVERT] Verbose ptxas in TestEm3.

609a654

Add automatic occupancy annotations to NVTX tracer.

79a0d98

Add NVTX tracing to TestEM3

e8a2b68

Fix linker error for energy loss fluctuation.

556f45f

Add three cuda streams to compute interactions in parallel.

3b6dda4

hahnjo reviewed Apr 29, 2022

View reviewed changes

hageboeck added 2 commits April 29, 2022 11:08

Advance RNG for electron interactions.

5137b1c

Requested by Jonas to reduce correlations between RNG states of branched particles.

Demo: Dedicated interaction kernels for gamma interactions.

eb6f73a

hageboeck force-pushed the TestEm3-performanceNumbers branch from de1157a to eb6f73a Compare April 29, 2022 13:41

bernhardmgruber reviewed Jun 10, 2022

View reviewed changes

bernhardmgruber reviewed Jun 23, 2022

View reviewed changes

This was referenced Jun 24, 2022

Keep some Track members in registers #213

Merged

Use double buffering for storing tracks in TestEm3 #207

Merged

hageboeck closed this Jul 6, 2022

hahnjo mentioned this pull request Sep 2, 2022

Add a SplitKernels example #216

Merged

		__device__ char g_nextInteractionForEl[8'000'000];
		__device__ char g_nextInteractionForPos[8'000'000];

	std::array<unsigned long, 5> _lastOccups;
	std::array<unsigned long, 5> _lastOccups{};

	const auto meanOccup = double(std::accumulate(_lastOccups.begin(), _lastOccups.end(), 0)) / _lastOccups.size();
	const auto meanOccup = double(std::reduce(_lastOccups.begin(), _lastOccups.end())) / _lastOccups.size();

		constexpr unsigned int sharedSize = 12250;
		__shared__ int candidates[sharedSize];

Run interactions in separate kernels for TestEm3 #203

Run interactions in separate kernels for TestEm3 #203

Uh oh!

Conversation

hageboeck commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsft-bot commented Apr 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Jun 10, 2022

Uh oh!

bernhardmgruber commented Jun 10, 2022

Uh oh!

hahnjo commented Jun 13, 2022

Uh oh!

bernhardmgruber commented Jun 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hageboeck commented Apr 28, 2022 •

edited

Loading