Fix race condition in blas_server_omp.c #1536

dzyong · 2018-04-24T02:44:28Z

Multi thread call cblas_sgemm will cause a race condition in blas_server_omp.c.
Maybe #1416 is the same issue, but I'm not sure.
I only meet this issue when I use openblas in caffe with multi thread enabled.
This commit try to separate the buffer for each thread which call exec_blas in parallel.

dzyong · 2018-04-24T03:24:26Z

Hi @martin-frbg

The CI built failed(https://travis-ci.org/xianyi/OpenBLAS/jobs/370388725). It seems that gcc version is too low. It doesn't have stdatomic.h.

Would you please upgrade the gcc to 4.9 or above, or would you please give me some suggestion for how to fix this?

martin-frbg · 2018-04-24T06:48:33Z

Can you try with #if _STDC_VERSION__ >= 201112L like I did in #1486 ?

martin-frbg · 2018-04-24T06:54:59Z

Is there a (preferably simple) test code that shows the problem ? The general impression so far has been that using OpenMP "solves" the locking problems seen elsewhere in the code, and using tools like helgrind to debug OMP threads (as was tried in #1416) will only generate lots of false positives.

dzyong · 2018-04-24T07:06:32Z

It can fix no stdatomic.h issue, but there is still no atomic operation function, such as: atomic_compare_exchange_weak.

Does some lock mechanism acceptable?

This issue can be reproduced by multi thread call cblas_sgemm at the same time. Maybe #1416 only solved issue between OMP internal thread.

martin-frbg · 2018-04-24T07:13:56Z

Falling back to the old "unsafe" code is certainly better than just telling everybody to use a newer/better compiler. Locking is acceptable, but so far the impression has been that OMP already seemed to take care of that.

dzyong · 2018-04-24T07:16:39Z

OK, then I will fall back to old code when compiler STDC_VERSION_ < 201112L.

dzyong · 2018-04-24T07:29:40Z

I uploaded the new version of code, would you please check if that is OK now.
This part code is still needed:

#ifndef _Atomic
#define _Atomic volatile
#endif

Because _Atomic is disabled by default when we enable OpenMP:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65467

dzyong · 2018-04-24T09:18:32Z

Hi @martin-frbg
The failed CI(https://ci.appveyor.com/project/xianyi/openblas/build/0.2.19.1158/job/49n1pq4aoqnwvxde) is because of:

Build execution time has reached the maximum allowed time for your plan (60 minutes).

It isn't because of error. Would you please help me to review this commit is OK?

martin-frbg · 2018-04-24T12:31:30Z

Do you happen to have a test case that fails without your changes, or did you find this by reviewing the current implementation ?

dzyong · 2018-04-24T14:36:13Z

I use openblas in Caffe. It calculate convolution with cblas_sgemm(https://github.com/BVLC/caffe/blob/master/src/caffe/util/math_functions.cpp). When I start two threads to do Caffe forward which will call cblas_sgemm in parallel, the cblas_sgemm's result will be different on each time.

dzyong · 2018-04-24T14:45:54Z

From the current implementation, we can also see that if two threads enter exec_blas at the same time, both of them will use the same blas_thread_buffer[pos] for their own OMP threads, and the result will be unsettling.
As you can see the falling back solution here also has a race condition:

 if(blas_buffer_inuse[i] == false) {
   blas_buffer_inuse[i] = true;

brada4 · 2018-04-24T21:06:55Z

It is unlimited hierarchy levels in recent OMP versions. If not holding per-thread register , even less n^2 sized , one could just rely on omp thread detection. Usually OMP_NUM_THREADS=N,1 is default assumption, one might want number of numa nodes followed by processors in numa node on huge or CoD machine

dzyong · 2018-04-25T01:09:32Z

Hi @brada4 , I'm sorry that I'm not very clear about what you mean.
It seems that if OMP_NUM_THREADS=1, the code will not run into blas_server_omp, and this problem will not exist. If someone use this on NUMA, the new implementation can also handle of it.

dzyong · 2018-04-25T07:35:25Z

Hi @martin-frbg ,
Base on above analysis, I think the new solution can handle multi threads call exec_blas at the same time, and the falling back solution can reduce the race condition chance.

brada4 · 2018-04-25T12:43:31Z

Will it handle deep nesting? e.g. OMP_NUM_THREADS=2,2,10,2,1

dzyong · 2018-04-25T12:50:24Z

I only see OMP_NUM_THREADS equal to a single integer in OpenBLAS, such as: OMP_NUM_THREADS=2.

Is there anyplace use deep nesting OMP_NUM_THREADS in OpenBLAS?

dzyong · 2018-04-26T01:02:40Z

Hi @martin-frbg
Would you please help me to check if this commit is OK, or is there anything that I need to improve?

Hi @brada4
Base on above description, I think we don't need to support deep nesting OpenMP, because it hasn't been used in OpenBLAS.

dzyong · 2018-04-27T01:07:30Z

Could someone help me to review this commit please?

martin-frbg · 2018-04-27T08:15:21Z

Looks good to me on first glance, but I lack the time to really review it at the moment. Could you add your NUM_PARALLEL parameter to Makefile.rule with a short explanation as well ?
(In general it worries me when such changes appear to be necessary in code that was assumed to work for a long time, but I guess it is the combination of fast multicore systems becoming more the norm, and OpenBLAS getting more exposure)

brada4 · 2018-04-27T08:19:37Z

If we have like 8-way 28-core machine you make 1.5MB N^2 structure that does not fit in fast caches.

dzyong · 2018-04-27T08:25:27Z

@martin-frbg
OK, it will add it soon.

@brada4
I think each way only has to keep its own structure, so it is much smaller, besides right result is more important than wrong result with faster speed.

Change-Id: Ic896276cd073d6b41930c7c5a29d66348cd1725d

dzyong · 2018-04-27T09:06:57Z

Hi @martin-frbg
I added some explanation to Makefile.rule and uploaded it. But my english is so poor that I'm not sure if I have writen some mistaken grammar, so would you please help me to check that, thanks so much.

brada4 · 2018-04-27T10:08:45Z

OMP already has such functionality built in, without recompiling library.
https://msdn.microsoft.com/en-us/library/xdeb73hc.aspx

martin-frbg · 2018-04-27T10:12:03Z

@dzyong I have made some small changes, but I am not a native speaker either. (And not sure I got the description of what happens in the "if more threads call" case correct).
@brada4 I do not see how omp_get_num_threads addresses this particular problem ?

brada4 · 2018-04-27T10:24:38Z

You want to address overwriting array with list of threads. O(n^Inf) complexity is not a solution.

dzyong · 2018-04-27T10:28:54Z

@martin-frbg Thanks. It's much better now.

@brada4 omp_get_num_threads can only get the number of threads in the parallel region. This problem is caused by multi threads enter the particular parallel region at the same time which access some identical memory. Do you have any suggestion for less complexity? I think this solution is better than use lock mechanism, lock mechanism is much slower. The new code's complexity is O(N), and N is very small, so it will not cost any time.

brada4 · 2018-04-27T11:28:17Z

The output data itself is independent and will not clash.
The insignificant megabyte is actually very significant as it is dragged around between caches in (now becoming) fat loop.
Actually omp_get_num_threads can get threads in current hierarchy level.

#omp parallel
get_num() ==> nproc
#omp parallel
get_num() ==> One

martin-frbg · 2018-04-27T11:45:57Z

Seems to me this fixes a use case that was previously "don't do that", without affecting anything else (as long as the new option is not set).
And I still do not understand how knowing the number of OMP threads at the current level will tell anything about how many "upper level" caffe threads are (con)currently trying to call a multithreaded OpenBLAS function.

dzyong · 2018-04-27T11:54:42Z

New code adds only (N-1) * MAX_CPU_NUMBER bytes compared to old code where N is MAX_PARALLEL_NUMBER, and N * MAX_CPU_NUMBER is a few bytes not megabyte. Beside the loop is very simple, it shouldn't affect the efficiency.

I don't know how omp_get_num_threads could get the application's thread number, I think it can only get the parallel region's number, would you explain it in more detail?

brada4 · 2018-04-27T14:18:51Z

@martin-frbg if the caffe threads are OpenMP threads then they know (with default being NPROC,1, tunable with variables, not with build static structure....)
It is like dozen places global thread number is used and other dozen with thread accounting arrays.... if that could be cleaned up it would work very elegantly and mkl-like ever after.
i.e defect is not thread scoreboard atomic acces, but the mere presence of thread scoreboard IMO

dzyong · 2018-04-27T15:00:42Z

I think we can't make sure every application(which use OpenBLAS) would call OpenBLAS API in OpenMP thread(Maybe it would call it in pthread or something else), so we can't make sure omp_get_num_threads would work in all situation.

Would you please submit a better solution code, then I can close this pull request.

dzyong · 2018-04-28T00:55:04Z

Hi @martin-frbg
Would you please merge this pull request for now?
If someone has better solution in future, then he can implement it base on this, or revert this and implement a new one.
Thanks.

dzyong · 2018-05-07T05:13:37Z

Hi @martin-frbg
In order to ensure this pull request would work well and no new issues are introduced, would you have
some time to test it recently?
This new solution does fix my problem and no new issues are introduced, when I test it with Caffe.
Thank you in advance.

martin-frbg · 2018-05-07T14:31:47Z

I lack the time (and serious hardware) for appropriate testing at the moment, also waiting for other comments (including answer from xianyi if he wants this to go in before or after the - hopefully - upcoming release)

xianyi · 2018-05-11T01:53:58Z

@dzyong

How do you call caffe in two thread? If you build OpenBLAS with OpenMP, we suppose your application is OpenMP, too. If you application calls sgemm OpenMP version by pthreads, you may meet the error.

dzyong · 2018-05-11T02:05:23Z

Yes, my application start two pthreads, each of them has a unique caffe instance. Both of my application and OpenBLAS use OpenMP.
When my application's two pthreads call sgemm at the same will meet this error. I call sgemm in each pthread, not in OpenMP context.
This pull request is intending to solve this situation.

xianyi · 2018-05-11T02:09:18Z

I see. I will merge this PR.

dzyong force-pushed the develop branch from ee5e11c to 3670da0 Compare April 24, 2018 07:25

Fix race condition in blas_server_omp.c

1b83341

Change-Id: Ic896276cd073d6b41930c7c5a29d66348cd1725d

dzyong force-pushed the develop branch from 3670da0 to 1b83341 Compare April 27, 2018 09:02

Update Makefile.rule

894433a

ViralBShah mentioned this pull request May 9, 2018

v0.6.2 build broken on Mac OS X with USE_SYSTEM_BLAS = 1 JuliaLang/LinearAlgebra.jl#521

Closed

xianyi merged commit 50acc40 into OpenMathLib:develop May 11, 2018

martin-frbg mentioned this pull request May 22, 2018

flaky functionals psi4/psi4#1009

Closed

martin-frbg mentioned this pull request Aug 13, 2018

Undocumented Feature NUM_PARALLEL #1736

Closed

martin-frbg mentioned this pull request Oct 31, 2018

OpenBLAS threadsafety issues for downstream libraries (NumPy) #1844

Closed

martin-frbg mentioned this pull request Nov 19, 2018

Thread safety issue with dgemm and >2 cores #1851

Closed

martin-frbg mentioned this pull request Mar 13, 2019

openBLAS nested parallelism #2052

Open

martin-frbg mentioned this pull request Feb 25, 2020

Race condition in memory.c while running dgemm many times from omp parallel region #2444

Closed

martin-frbg mentioned this pull request Jan 5, 2024

Inquiry and Suggestions Regarding OpenBLAS Code Flow with OpenMP #4418

Open

Fix race condition in blas_server_omp.c #1536

Fix race condition in blas_server_omp.c #1536

Conversation

dzyong commented Apr 24, 2018

dzyong commented Apr 24, 2018 • edited Loading

martin-frbg commented Apr 24, 2018

martin-frbg commented Apr 24, 2018

dzyong commented Apr 24, 2018

martin-frbg commented Apr 24, 2018

dzyong commented Apr 24, 2018

dzyong commented Apr 24, 2018 • edited Loading

dzyong commented Apr 24, 2018

martin-frbg commented Apr 24, 2018

dzyong commented Apr 24, 2018

dzyong commented Apr 24, 2018

brada4 commented Apr 24, 2018

dzyong commented Apr 25, 2018

dzyong commented Apr 25, 2018

brada4 commented Apr 25, 2018

dzyong commented Apr 25, 2018

dzyong commented Apr 26, 2018

dzyong commented Apr 27, 2018

martin-frbg commented Apr 27, 2018

brada4 commented Apr 27, 2018

dzyong commented Apr 27, 2018

dzyong commented Apr 27, 2018

brada4 commented Apr 27, 2018

martin-frbg commented Apr 27, 2018

brada4 commented Apr 27, 2018

dzyong commented Apr 27, 2018 • edited Loading

brada4 commented Apr 27, 2018

martin-frbg commented Apr 27, 2018

dzyong commented Apr 27, 2018

brada4 commented Apr 27, 2018 • edited Loading

dzyong commented Apr 27, 2018

dzyong commented Apr 28, 2018

dzyong commented May 7, 2018

martin-frbg commented May 7, 2018

xianyi commented May 11, 2018

dzyong commented May 11, 2018 • edited Loading

xianyi commented May 11, 2018

dzyong commented Apr 24, 2018 •

edited

Loading

dzyong commented Apr 24, 2018 •

edited

Loading

dzyong commented Apr 27, 2018 •

edited

Loading

brada4 commented Apr 27, 2018 •

edited

Loading

dzyong commented May 11, 2018 •

edited

Loading