You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once #379 is merged, there are still improvements that can be made to the use of streams to improve performance through the use of streams, within a single simulation and when used as part of an ensemble.
Some but not all of the potential improvements / outstanding todos
the use of various CUDAScatter methods which are currently just passed the default stream (0).
Better ways of passing streams around, where the stream belongs to a simulation (or an ensemble?).
Memory Pinning
Async memcpy block unless the memory is pinned.
Cannot pin everything, as pinning too much memory can cause systems to lock up (by preventing the OS from paging anything)
_async variants of some methods (i.e. some CUDAScatter methods)
Allows these to be used without synchronisation when streams are passed. Less syncs are better (where possible) but this should be opt in (and clear)
The non async methods can just call the _async version + add a stream sync, so minimal overhead of maintaining this.
Some of these return values copied back, so require the sync. In that case switching to a batch operation to process N reductions concurrently may be required.
Expanded Testing
Test(s) for each communication strategy
Make the tests check for more than just performance
More RTC test coverage
Performance test(s) within an ensemble
Attempt to test the concurrency of pre/post processing (i.e. scatter) although this may be difficult to time accurately
More refactoring of stepLayer - it's still a huge method.
Possibly use methods in an unnamed namespace to prevent them being called by users.
Per layer timing
Additional syncing/events might have a negative impact on perf, + potentially high memory requirements (one element per layer per step (per simulation in an ensemble)). May be inaccurate on WDDM devices?
Timing within Ensembles (Logging)
Timing of individual parts of individual simulations is less important when part of an ensemble, but might still be useful.
It should be made accessible through logging (or as part of the ensemble object?)
Use a dynamic range of per-stream elements, rather than a hard cap at 128. This was naively used as it is the limit on the number of concurrent streams which can execute, but models could have more than 128 individual kernels launched within a layer, they would just be serialised.
The text was updated successfully, but these errors were encountered:
Once #379 is merged, there are still improvements that can be made to the use of streams to improve performance through the use of streams, within a single simulation and when used as part of an ensemble.
Some but not all of the potential improvements / outstanding todos
Use non-default streams in more places
RandomManager::resizeDeviceArray
/RandomManager::resize
CUDAScanCompaction::zero
mapNewRuntimeVariables
CUDAScatter
methods which are currently just passed the default stream (0
).Better ways of passing streams around, where the stream belongs to a simulation (or an ensemble?).
Memory Pinning
_async
variants of some methods (i.e. someCUDAScatter
methods)_async
version + add a stream sync, so minimal overhead of maintaining this.Expanded Testing
More refactoring of
stepLayer
- it's still a huge method.Per layer timing
Timing within Ensembles (Logging)
Use a dynamic range of per-stream elements, rather than a hard cap at 128. This was naively used as it is the limit on the number of concurrent streams which can execute, but models could have more than 128 individual kernels launched within a layer, they would just be serialised.
The text was updated successfully, but these errors were encountered: