Skip to content

Conversation

@wschenck
Copy link
Contributor

@wschenck wschenck commented Sep 24, 2020

UPDATE: Previous TODOs are all done!

This PR is a suggestion how to meet the requirements stated in issue #1471 ("Basic Instrumentation") regarding timers for profiling and benchmarking.

In the current version, the "basic timers" are always enabled (without any switch; this design decision is the result of a discussion between suku248, jarsi and me). The "detailed timers" can be enabled by the cmake option "-Dwith-detailed-timers=ON". To give an example of a cmake command with detailed timers enabled:
cmake ../.. -DCMAKE_INSTALL_PREFIX:PATH=~/opt/NEST/zam884_nest3-timers -Dwith-mpi=ON -Dwith-detailed-timers=ON

In contrast to the requirements in issue #1471, even more detailed timers are provided. The reason is that these timers are required for the dry-run development for 5G to compare real and dry runs of NEST simulations. But in the end, we think that these additional timers are useful for many additional benchmarking purposes. The complete list is:

  • time_construction_create (basic timer, always enabled)
  • time_construction_connect (basic timer, always enabled)
  • time_communicate_prepare (basic timer, always enabled)
    • nested in time_communicate_prepare:
      • time_gather_target_data
        • nested in time_gather_target_data:
          • time_communicate_target_data
  • time_simulate (basic timer, always enabled)
    • nested in time_simulate:
      • time_update
      • time_gather_spike_data
        • nested in time_gather_spike_data:
          • time_collocate_spike_data
          • time_communicate_spike_data
          • time_deliver_spike_data

@stinebuu stinebuu added I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: Enhancement New functionality, model or documentation labels Sep 30, 2020
@suku248 suku248 marked this pull request as draft October 14, 2020 13:13
@suku248
Copy link
Contributor

suku248 commented Oct 14, 2020

@hakonsbm could you please check that all high-level connect functions contain timers?
@jarsi could you please check that time measurements for this PR are comparable to previous measurements?
@terhorstd do you think the issue #1471 is fully addressed by this PR?

@suku248 suku248 marked this pull request as ready for review October 14, 2020 17:08
Copy link
Contributor

@hakonsbm hakonsbm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, thanks for implementing this! There are some TODOs that should be fixed or removed, see my comments below.

@suku248 There are essentially only three connection functions in the C++ API, where all high-level connect functions converge, and these all contain timers. I have also checked that timing works with the high-level functions. However, It would be nice to have a proper unittest of the instrumentation.

// Needs to be called /after/ set_end_and_invalid_markers_.
set_complete_marker_spike_data_( assigned_ranks, send_buffer_position, send_buffer );
#pragma omp barrier
// TODO: Discuss, if the barrier above should be better placed after if-clause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A decision should be made here and the todo should be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All threads should come to the same result for gather_completed_checker_.all_true().
Hence, the barrier can stay inside the if-clause. Now removed comment.

Comment on lines 395 to 397
// TODO_SDR: Following function call commented out because it causes real runs to hang!
// kernel().mpi_manager.synchronize(); // to get an accurate time measurement
// across ranks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we shouldn't introduce a MPI synchronization point. This would yield significant overhead, which is also not accounted for in any timer at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Now removed synch point and comment.

Comment on lines 685 to 687
// TODO_SDR: Following function call commented out because it causes real runs to hang!
// kernel().mpi_manager.synchronize(); // to get an accurate time measurement
// // across ranks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we shouldn't introduce a MPI synchronization point. This would yield significant overhead, which is also not accounted for in any timer at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Now removed synch point and comment.

@jarsi
Copy link
Contributor

jarsi commented Oct 20, 2020

Thank you very much for implementing these!

The timers and their placement are comparable with what we have used so far.

I am wondering about some new omp barrriers. I think their purpose is to make sure every thread takes the same amount of time finishing some timed interval. But won't a bunch of new barriers considerably slow down heavily multithreaded (100 + threads) measurements and consequently distort the measurement?

Would it make sense to either not introduce new barriers at all, thus assuming the work is balanced evenly across all threads, or to let every thread have its own stopwatch and taking either the mean or the max across one mpi process as measured time.

}

#ifdef TIMER_DETAILED
#pragma omp barrier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this barrier is placed here in order to guarantee that all threads take the same amount of time and tid = 0 represents all other threads. But I think every new barrier would introduce significant synchronization overhead and distort the overall measurement. This would be especially severe if we used these timers on architectures with high core counts allowing a huge number of threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Now removed barrier.

kernel().connection_manager.clean_source_table( tid );

#ifdef TIMER_DETAILED
#pragma omp barrier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Now removed barrier.

kernel().event_delivery_manager.gather_target_data( tid );

#ifdef TIMER_DETAILED
#pragma omp barrier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not frequently called. Keep this barrier.

@suku248
Copy link
Contributor

suku248 commented Nov 16, 2020

@wschenk please check the removed barriers and MPI synch points and ping the reviewers ( @hakonsbm @jarsi @terhorstd ) if you are happy with the changes.

@suku248
Copy link
Contributor

suku248 commented Nov 16, 2020

@suku248 There are essentially only three connection functions in the C++ API, where all high-level connect functions converge, and these all contain timers. I have also checked that timing works with the high-level functions. However, It would be nice to have a proper unittest of the instrumentation.

@hakonsbm Thanks for checking the connect functions! I don't think it's possible to create a unit test for this, but including the detailed-timers case in CI might make sense.

kernel().connection_manager.clean_source_table( tid );

#ifdef TIMER_DETAILED
#pragma omp barrier
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think meanwhile that it is more logical at this point to put sw_communicate_target_data.start()/stop() into the "omp single" region.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I am happy with the changes, thank you! (@hakonsbm @jarsi @terhorstd @suku248)

@wschenck
Copy link
Contributor Author

wschenck commented Feb 3, 2021

@hakonsbm, @jarsi, @terhorstd, @suku248 : From my perspective, the pull request is ready to go. If you have no further objections, please pull the code into master.

Copy link
Contributor

@hakonsbm hakonsbm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now.

@heplesser heplesser requested review from jasperalbers and removed request for terhorstd March 9, 2021 10:44
@heplesser
Copy link
Contributor

@suku248 @jarsi Are you happy with this PR or do you see the need for further changes?

@suku248
Copy link
Contributor

suku248 commented Mar 9, 2021

We agreed in the last Hackathon that with the fixes to time_construction_connect by @wschenck the PR is now ready for the final steps of the review.

@heplesser
Copy link
Contributor

@jarsi @jasperalbers Could you review this soon?

@jasperalbers
Copy link
Contributor

@heplesser I will be on it soon, hopefully at the end of the week, or latest by the start of next week.

Copy link
Contributor

@jasperalbers jasperalbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test with the multi-area model which confirms that the timing data is reasonably similar to the old timers implemented by @jarsi, thus I don't see any necessary changes in the implementation.
I would suggest to add the info about

  • how to set the correct compilation flag
  • how to access the timers
  • the precise definition of the measured times

somewhere in the documentation (sorry if this already exists, I wasn't able to find it).

@suku248
Copy link
Contributor

suku248 commented Mar 26, 2021

Thanks @jasperalbers ! There is a separate issue for the documentation #1977.
If you are happy with this PR, could you please approve?

@suku248 suku248 merged commit caafd0e into nest:master Mar 26, 2021
@suku248 suku248 linked an issue Apr 12, 2021 that may be closed by this pull request
@jarsi jarsi mentioned this pull request Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: Enhancement New functionality, model or documentation

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Basic Instrumentation

7 participants