-
Notifications
You must be signed in to change notification settings - Fork 389
Timers for Benchmarking (related to issue #1471 "Basic Instrumentation") #1778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…imers Use cmake option with-detailed-timers
…on-connect Additional timer stuff from suku248
|
@hakonsbm could you please check that all high-level connect functions contain timers? |
hakonsbm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, thanks for implementing this! There are some TODOs that should be fixed or removed, see my comments below.
@suku248 There are essentially only three connection functions in the C++ API, where all high-level connect functions converge, and these all contain timers. I have also checked that timing works with the high-level functions. However, It would be nice to have a proper unittest of the instrumentation.
| // Needs to be called /after/ set_end_and_invalid_markers_. | ||
| set_complete_marker_spike_data_( assigned_ranks, send_buffer_position, send_buffer ); | ||
| #pragma omp barrier | ||
| // TODO: Discuss, if the barrier above should be better placed after if-clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A decision should be made here and the todo should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All threads should come to the same result for gather_completed_checker_.all_true().
Hence, the barrier can stay inside the if-clause. Now removed comment.
| // TODO_SDR: Following function call commented out because it causes real runs to hang! | ||
| // kernel().mpi_manager.synchronize(); // to get an accurate time measurement | ||
| // across ranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we shouldn't introduce a MPI synchronization point. This would yield significant overhead, which is also not accounted for in any timer at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Now removed synch point and comment.
| // TODO_SDR: Following function call commented out because it causes real runs to hang! | ||
| // kernel().mpi_manager.synchronize(); // to get an accurate time measurement | ||
| // // across ranks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we shouldn't introduce a MPI synchronization point. This would yield significant overhead, which is also not accounted for in any timer at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Now removed synch point and comment.
|
Thank you very much for implementing these! The timers and their placement are comparable with what we have used so far. I am wondering about some new omp barrriers. I think their purpose is to make sure every thread takes the same amount of time finishing some timed interval. But won't a bunch of new barriers considerably slow down heavily multithreaded (100 + threads) measurements and consequently distort the measurement? Would it make sense to either not introduce new barriers at all, thus assuming the work is balanced evenly across all threads, or to let every thread have its own stopwatch and taking either the mean or the max across one mpi process as measured time. |
| } | ||
|
|
||
| #ifdef TIMER_DETAILED | ||
| #pragma omp barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this barrier is placed here in order to guarantee that all threads take the same amount of time and tid = 0 represents all other threads. But I think every new barrier would introduce significant synchronization overhead and distort the overall measurement. This would be especially severe if we used these timers on architectures with high core counts allowing a huge number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Now removed barrier.
| kernel().connection_manager.clean_source_table( tid ); | ||
|
|
||
| #ifdef TIMER_DETAILED | ||
| #pragma omp barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Now removed barrier.
| kernel().event_delivery_manager.gather_target_data( tid ); | ||
|
|
||
| #ifdef TIMER_DETAILED | ||
| #pragma omp barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not frequently called. Keep this barrier.
Remove barriers and synch points as suggested by reviewers
|
@wschenk please check the removed barriers and MPI synch points and ping the reviewers ( @hakonsbm @jarsi @terhorstd ) if you are happy with the changes. |
@hakonsbm Thanks for checking the connect functions! I don't think it's possible to create a unit test for this, but including the detailed-timers case in CI might make sense. |
| kernel().connection_manager.clean_source_table( tid ); | ||
|
|
||
| #ifdef TIMER_DETAILED | ||
| #pragma omp barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think meanwhile that it is more logical at this point to put sw_communicate_target_data.start()/stop() into the "omp single" region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise I am happy with the changes, thank you! (@hakonsbm @jarsi @terhorstd @suku248)
|
@hakonsbm, @jarsi, @terhorstd, @suku248 : From my perspective, the pull request is ready to go. If you have no further objections, please pull the code into master. |
hakonsbm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now.
|
We agreed in the last Hackathon that with the fixes to |
|
@jarsi @jasperalbers Could you review this soon? |
|
@heplesser I will be on it soon, hopefully at the end of the week, or latest by the start of next week. |
jasperalbers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test with the multi-area model which confirms that the timing data is reasonably similar to the old timers implemented by @jarsi, thus I don't see any necessary changes in the implementation.
I would suggest to add the info about
- how to set the correct compilation flag
- how to access the timers
- the precise definition of the measured times
somewhere in the documentation (sorry if this already exists, I wasn't able to find it).
|
Thanks @jasperalbers ! There is a separate issue for the documentation #1977. |
UPDATE: Previous TODOs are all done!
This PR is a suggestion how to meet the requirements stated in issue #1471 ("Basic Instrumentation") regarding timers for profiling and benchmarking.
In the current version, the "basic timers" are always enabled (without any switch; this design decision is the result of a discussion between suku248, jarsi and me). The "detailed timers" can be enabled by the cmake option "-Dwith-detailed-timers=ON". To give an example of a cmake command with detailed timers enabled:
cmake ../.. -DCMAKE_INSTALL_PREFIX:PATH=~/opt/NEST/zam884_nest3-timers -Dwith-mpi=ON -Dwith-detailed-timers=ONIn contrast to the requirements in issue #1471, even more detailed timers are provided. The reason is that these timers are required for the dry-run development for 5G to compare real and dry runs of NEST simulations. But in the end, we think that these additional timers are useful for many additional benchmarking purposes. The complete list is:
time_construction_create(basic timer, always enabled)time_construction_connect(basic timer, always enabled)time_communicate_prepare(basic timer, always enabled)time_communicate_prepare:time_gather_target_datatime_gather_target_data:time_communicate_target_datatime_simulate(basic timer, always enabled)time_simulate:time_updatetime_gather_spike_datatime_gather_spike_data:time_collocate_spike_datatime_communicate_spike_datatime_deliver_spike_data