Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in blas_server_omp.c #1536

Merged
merged 2 commits into from
May 11, 2018
Merged

Conversation

dzyong
Copy link
Contributor

@dzyong dzyong commented Apr 24, 2018

Multi thread call cblas_sgemm will cause a race condition in blas_server_omp.c.
Maybe #1416 is the same issue, but I'm not sure.
I only meet this issue when I use openblas in caffe with multi thread enabled.
This commit try to separate the buffer for each thread which call exec_blas in parallel.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

Hi @martin-frbg

The CI built failed(https://travis-ci.org/xianyi/OpenBLAS/jobs/370388725). It seems that gcc version is too low. It doesn't have stdatomic.h.

Would you please upgrade the gcc to 4.9 or above, or would you please give me some suggestion for how to fix this?

@martin-frbg
Copy link
Collaborator

Can you try with #if _STDC_VERSION__ >= 201112L like I did in #1486 ?

@martin-frbg
Copy link
Collaborator

Is there a (preferably simple) test code that shows the problem ? The general impression so far has been that using OpenMP "solves" the locking problems seen elsewhere in the code, and using tools like helgrind to debug OMP threads (as was tried in #1416) will only generate lots of false positives.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

It can fix no stdatomic.h issue, but there is still no atomic operation function, such as: atomic_compare_exchange_weak.

Does some lock mechanism acceptable?

This issue can be reproduced by multi thread call cblas_sgemm at the same time. Maybe #1416 only solved issue between OMP internal thread.

@martin-frbg
Copy link
Collaborator

Falling back to the old "unsafe" code is certainly better than just telling everybody to use a newer/better compiler. Locking is acceptable, but so far the impression has been that OMP already seemed to take care of that.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

OK, then I will fall back to old code when compiler STDC_VERSION_ < 201112L.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

I uploaded the new version of code, would you please check if that is OK now.
This part code is still needed:

#ifndef _Atomic
#define _Atomic volatile
#endif

Because _Atomic is disabled by default when we enable OpenMP:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65467

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

Hi @martin-frbg
The failed CI(https://ci.appveyor.com/project/xianyi/openblas/build/0.2.19.1158/job/49n1pq4aoqnwvxde) is because of:

Build execution time has reached the maximum allowed time for your plan (60 minutes).

It isn't because of error. Would you please help me to review this commit is OK?

@martin-frbg
Copy link
Collaborator

Do you happen to have a test case that fails without your changes, or did you find this by reviewing the current implementation ?

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

I use openblas in Caffe. It calculate convolution with cblas_sgemm(https://github.com/BVLC/caffe/blob/master/src/caffe/util/math_functions.cpp). When I start two threads to do Caffe forward which will call cblas_sgemm in parallel, the cblas_sgemm's result will be different on each time.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 24, 2018

From the current implementation, we can also see that if two threads enter exec_blas at the same time, both of them will use the same blas_thread_buffer[pos] for their own OMP threads, and the result will be unsettling.
As you can see the falling back solution here also has a race condition:

 if(blas_buffer_inuse[i] == false) {
   blas_buffer_inuse[i] = true;

@brada4
Copy link
Contributor

brada4 commented Apr 24, 2018

It is unlimited hierarchy levels in recent OMP versions. If not holding per-thread register , even less n^2 sized , one could just rely on omp thread detection. Usually OMP_NUM_THREADS=N,1 is default assumption, one might want number of numa nodes followed by processors in numa node on huge or CoD machine

@dzyong
Copy link
Contributor Author

dzyong commented Apr 25, 2018

Hi @brada4 , I'm sorry that I'm not very clear about what you mean.
It seems that if OMP_NUM_THREADS=1, the code will not run into blas_server_omp, and this problem will not exist. If someone use this on NUMA, the new implementation can also handle of it.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 25, 2018

Hi @martin-frbg ,
Base on above analysis, I think the new solution can handle multi threads call exec_blas at the same time, and the falling back solution can reduce the race condition chance.

@brada4
Copy link
Contributor

brada4 commented Apr 25, 2018

Will it handle deep nesting? e.g. OMP_NUM_THREADS=2,2,10,2,1

@dzyong
Copy link
Contributor Author

dzyong commented Apr 25, 2018

I only see OMP_NUM_THREADS equal to a single integer in OpenBLAS, such as: OMP_NUM_THREADS=2.

Is there anyplace use deep nesting OMP_NUM_THREADS in OpenBLAS?

@dzyong
Copy link
Contributor Author

dzyong commented Apr 26, 2018

Hi @martin-frbg
Would you please help me to check if this commit is OK, or is there anything that I need to improve?

Hi @brada4
Base on above description, I think we don't need to support deep nesting OpenMP, because it hasn't been used in OpenBLAS.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

Could someone help me to review this commit please?

@martin-frbg
Copy link
Collaborator

Looks good to me on first glance, but I lack the time to really review it at the moment. Could you add your NUM_PARALLEL parameter to Makefile.rule with a short explanation as well ?
(In general it worries me when such changes appear to be necessary in code that was assumed to work for a long time, but I guess it is the combination of fast multicore systems becoming more the norm, and OpenBLAS getting more exposure)

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

If we have like 8-way 28-core machine you make 1.5MB N^2 structure that does not fit in fast caches.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

@martin-frbg
OK, it will add it soon.

@brada4
I think each way only has to keep its own structure, so it is much smaller, besides right result is more important than wrong result with faster speed.

Change-Id: Ic896276cd073d6b41930c7c5a29d66348cd1725d
@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

Hi @martin-frbg
I added some explanation to Makefile.rule and uploaded it. But my english is so poor that I'm not sure if I have writen some mistaken grammar, so would you please help me to check that, thanks so much.

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

OMP already has such functionality built in, without recompiling library.
https://msdn.microsoft.com/en-us/library/xdeb73hc.aspx

@martin-frbg
Copy link
Collaborator

@dzyong I have made some small changes, but I am not a native speaker either. (And not sure I got the description of what happens in the "if more threads call" case correct).
@brada4 I do not see how omp_get_num_threads addresses this particular problem ?

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

You want to address overwriting array with list of threads. O(n^Inf) complexity is not a solution.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

@martin-frbg Thanks. It's much better now.

@brada4 omp_get_num_threads can only get the number of threads in the parallel region. This problem is caused by multi threads enter the particular parallel region at the same time which access some identical memory. Do you have any suggestion for less complexity? I think this solution is better than use lock mechanism, lock mechanism is much slower. The new code's complexity is O(N), and N is very small, so it will not cost any time.

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

The output data itself is independent and will not clash.
The insignificant megabyte is actually very significant as it is dragged around between caches in (now becoming) fat loop.
Actually omp_get_num_threads can get threads in current hierarchy level.

#omp parallel
get_num() ==> nproc
#omp parallel
get_num() ==> One

@martin-frbg
Copy link
Collaborator

Seems to me this fixes a use case that was previously "don't do that", without affecting anything else (as long as the new option is not set).
And I still do not understand how knowing the number of OMP threads at the current level will tell anything about how many "upper level" caffe threads are (con)currently trying to call a multithreaded OpenBLAS function.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

New code adds only (N-1) * MAX_CPU_NUMBER bytes compared to old code where N is MAX_PARALLEL_NUMBER, and N * MAX_CPU_NUMBER is a few bytes not megabyte. Beside the loop is very simple, it shouldn't affect the efficiency.

I don't know how omp_get_num_threads could get the application's thread number, I think it can only get the parallel region's number, would you explain it in more detail?

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2018

@martin-frbg if the caffe threads are OpenMP threads then they know (with default being NPROC,1, tunable with variables, not with build static structure....)
It is like dozen places global thread number is used and other dozen with thread accounting arrays.... if that could be cleaned up it would work very elegantly and mkl-like ever after.
i.e defect is not thread scoreboard atomic acces, but the mere presence of thread scoreboard IMO

@dzyong
Copy link
Contributor Author

dzyong commented Apr 27, 2018

I think we can't make sure every application(which use OpenBLAS) would call OpenBLAS API in OpenMP thread(Maybe it would call it in pthread or something else), so we can't make sure omp_get_num_threads would work in all situation.

Would you please submit a better solution code, then I can close this pull request.

@dzyong
Copy link
Contributor Author

dzyong commented Apr 28, 2018

Hi @martin-frbg
Would you please merge this pull request for now?
If someone has better solution in future, then he can implement it base on this, or revert this and implement a new one.
Thanks.

@dzyong
Copy link
Contributor Author

dzyong commented May 7, 2018

Hi @martin-frbg
In order to ensure this pull request would work well and no new issues are introduced, would you have
some time to test it recently?
This new solution does fix my problem and no new issues are introduced, when I test it with Caffe.
Thank you in advance.

@martin-frbg
Copy link
Collaborator

I lack the time (and serious hardware) for appropriate testing at the moment, also waiting for other comments (including answer from xianyi if he wants this to go in before or after the - hopefully - upcoming release)

@xianyi
Copy link
Collaborator

xianyi commented May 11, 2018

@dzyong

How do you call caffe in two thread? If you build OpenBLAS with OpenMP, we suppose your application is OpenMP, too. If you application calls sgemm OpenMP version by pthreads, you may meet the error.

@dzyong
Copy link
Contributor Author

dzyong commented May 11, 2018

Yes, my application start two pthreads, each of them has a unique caffe instance. Both of my application and OpenBLAS use OpenMP.
When my application's two pthreads call sgemm at the same will meet this error. I call sgemm in each pthread, not in OpenMP context.
This pull request is intending to solve this situation.

@xianyi xianyi merged commit 50acc40 into OpenMathLib:develop May 11, 2018
@xianyi
Copy link
Collaborator

xianyi commented May 11, 2018

I see. I will merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants