-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
fd64b96
to
b8b942e
Compare
src/storage/storage.cc
Outdated
LOG(INFO) << "Using GPUPooledRoundedStorageManager."; | ||
} else { | ||
if (strategy != "Naive") { | ||
LOG(INFO) << "Unknown memory pool strategy specified: " << strategy << "."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log(fatal)?
bcba6e2
to
de2a823
Compare
Still no clue what's going wrong with this PR. Nothing specific to windows, weirdly python2-GPU-win is good. |
@@ -71,7 +78,7 @@ class GPUPooledStorageManager final : public StorageManager { | |||
private: | |||
void DirectFreeNoLock(Storage::Handle handle) { | |||
cudaError_t err = cudaFree(handle.dptr); | |||
size_t size = handle.size + NDEV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure + NDEV is not needed any more? what if NDEV=32 and min_chunk=33 and handle.size=30? Original code would allocate 62. New code would allocate 33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
src/storage/pooled_storage_manager.h
Outdated
@@ -52,6 +54,11 @@ class GPUPooledStorageManager final : public StorageManager { | |||
*/ | |||
GPUPooledStorageManager() { | |||
reserve_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_RESERVE", 5); | |||
min_chunk_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_MIN_CHUNK", 4096); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page size instead of min chunk?
src/storage/pooled_storage_manager.h
Outdated
@@ -82,19 +89,19 @@ class GPUPooledStorageManager final : public StorageManager { | |||
private: | |||
void ReleaseAll(); | |||
// used memory | |||
size_t used_memory_ = 0; | |||
size_t used_memory_ = 0, min_chunk_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line
src/storage/pooled_storage_manager.h
Outdated
private: | ||
#if __SIZEOF_SIZE_T__ == __SIZEOF_LONG__ | ||
|
||
#if defined(__clang__) || defined(__GNUC__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be so complicated? You just need to take the highest bit and shift left by 1 if it's smaller than size.
This is called the finding the MSB. See https://www.google.com/search?ei=__UNW-DMG6iF0wLqyr4g&q=how+to+find+most+significant+bit+in+c&oq=take+highest+bit&gs_l=psy-ab.1.0.0i71k1l8.0.0.0.4417.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.LUbIFjlZyeU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these builtins would utilize hardware instructions when available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really faster? It looks too complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the default implementation with pow and log is really slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the default implementation to use bit shifting and then do a comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compared my current solution, the bit shifting, and static_cast<int>(std::ceil(std::log2(s)))
, with -O3
is turned on on my mac (clang), the speed looks like the following:
Running 10000000 iters.
Addr width 64
It took me 0.00981569 seconds. result: 223222785
It took me 0.128623 seconds. result: 223222785
It took me 0.0801588 seconds. result: 223222785
0319b42
to
63aac3f
Compare
I've simplified the implementation to exclude optimization using intrinsics and bit scans. They are backed up in https://github.com/szha/mxnet/tree/mem_strategy_backup |
amalgamation/amalgamation.py
Outdated
@@ -23,7 +23,7 @@ | |||
import platform | |||
|
|||
blacklist = [ | |||
'Windows.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh', | |||
'Windows.h', 'intrin.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
e57bae9
to
9b39b72
Compare
tests/cpp/storage/storage_test.cc
Outdated
|
||
TEST(GPUStorage, Round_GPU) { | ||
if (mxnet::test::unitTestsWithCuda) { | ||
putenv("MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=20"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this variable persist? It could have side effects on other tests
tests/cpp/storage/storage_test.cc
Outdated
#include <gtest/gtest.h> | ||
#include <dmlc/logging.h> | ||
#include <mxnet/storage.h> | ||
#include <cstdio> | ||
#include "test_util.h" | ||
#include "storage/pooled_storage_manager.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import? I think it's already part of the storage namespace at mxnet/storage.h
d0d8bf7
to
00086f1
Compare
@@ -16,7 +16,7 @@ | |||
# under the License. | |||
|
|||
from mxnet.test_utils import * | |||
from common import setup_module, with_seed | |||
from common import setup_module, with_seed, teardown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to import this in every single test? Looks a bit ugly tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applying this change would allow all tests within a module to finish before moving onto the next test, thus eliminating the case where side effect of tests in another module spills over to the next. In terms of testing practice, including a setup/teardown is common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but we're not actually using it in most files, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah in common.py :) But isn't it sufficient to import it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately no. it is the same case as setup_module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh :/
37ecc98
to
72b386f
Compare
size_t free, total; | ||
cudaMemGetInfo(&free, &total); | ||
if (free <= total * reserve_ / 100 || size > free - total * reserve_ / 100) | ||
ReleaseAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen to the storage handles currently pointing to some of the memory?
std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU)); | ||
int bucket = get_bucket(handle->size); | ||
size_t size = get_size(bucket); | ||
auto&& reuse_pool = memory_pool_[bucket]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's no error (the rvalue reference will de deduced to normal lvalue reference) it's better to use it explicitly as auto&
@szha should we document this new env variable or is it still experimental? |
@ThomasDelteil I intended to have people experiment with this first. |
* use nearest power of 2 for gpu memory pool sizes * add linear * add test
* use nearest power of 2 for gpu memory pool sizes * add linear * add test
Description
adjust GPU memory pool strategy
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
MXNET_GPU_MEM_POOL_TYPE="Round"
) for using nearest power of 2 size for better memory reuseComments