Multiplex worker implementation limits parallelism #494

jdai8 · 2021-02-23T17:04:27Z

The current multiplex worker implementation (at HEAD) sequences reads and writes:

rules_kotlin/src/main/kotlin/io/bazel/worker/PersistentWorker.kt

Lines 83 to 113 in 51fe508

    
           blockable { 
        
             generateSequence { WorkRequest.parseDelimitedFrom(io.input) } 
        
           }.asFlow() 
        
             .map { request -> 
        
               info { "received req: ${request.requestId}" } 
        
               doTask("request ${request.requestId}") { ctx -> 
        
                 request.argumentsList.run { 
        
                   execute(ctx, toList()) 
        
                 } 
        
               }.let { result -> 
        
                 this@run.info { "task result ${result.status}" } 
        
                 WorkerProtocol.WorkResponse.newBuilder().apply { 
        
                   output = 
        
                     listOf( 
        
                       result.log.out.toString(), 
        
                       io.captured.toByteArray().toString(UTF_8) 
        
                     ).filter { it.isNotBlank() }.joinToString("\n") 
        
                   exitCode = result.status.exit 
        
                   requestId = request.requestId 
        
                 }.build() 
        
               } 
        
             } 
        
             .collect { response -> 
        
               blockable { 
        
                 info { 
        
                   response.toString() 
        
                 } 
        
                 response.writeDelimitedTo(io.output) 
        
                 io.output.flush() 
        
               } 
        
             }

I think this artificially limits performance in certain cases. For example, if we receive a large work request first, all subsequent work requests will have to wait for it to finish before we write their work responses to stdout.

I tested this on a simple reproducer project here: https://github.com/jdai8/rules_kotlin_coroutine_repro. This project has one large source file Big.kt and several small SmallX.kt files. As you can see from this screenshot, //src:small and //src:small4 have to wait for //src:big to finish, even though the other small compilation actions have already finished (in <5s).

Let me know if I'm understanding this correctly. If so, I'm happy to put up a PR.

The text was updated successfully, but these errors were encountered:

jongerrish · 2021-02-23T21:50:50Z

Fascinating, great detective work! Want to put up a PR and I can test against our repo (multiplex workers performance regressed so I'm super keen to try your fix)

jdai8 · 2021-02-23T22:31:58Z

@jongerrish If enabling multiplex workers was the only change at play for you, I think this shouldn't regress from normal persistent workers - at worst you'd be performing the same (doing everything in serial).

That being said, I'll try to have a PR up in the next few days so you can test it out.

jongerrish · 2021-02-23T22:45:53Z

Well regular persistent workers we give like 8 max instances so 8 processes for the KotlinCompile mnemonic so we'd have 8 parallel Kotlinc jobs running at a time without multiplexed-workers.

jdai8 · 2021-02-23T22:49:11Z

Yes, I meant within each persistent worker work would happen serially. There might be some other behavior at play - maybe Bazel tries to schedule more work on a worker if it's known to support multiplexing, or something like that.

jongerrish · 2021-02-23T22:53:22Z

Yeah, that makes sense. There should be just a single process for KotlinCompile and I'm not sure if there are any upper bounds for max number of concurrent requests Bazel will schedule at a time, but it makes sense that Bazel would schedule more work on that process.

jeffzoch · 2021-02-24T17:28:11Z

@jdai8 how did you generate that graph out of curiosity? Would like to have that tool in my tool belt :)

jdai8 · 2021-02-24T18:49:34Z

@jeffzoch take a look at the profiling section in the bazel docs: https://docs.bazel.build/versions/master/skylark/performance.html#performance-profiling. It's definitely a useful tool to have!

Specifically, I ran:

bazel build src:all --worker_verbose --worker_max_instances=1 --worker_quit_after_build

in my reproducer project. Limiting the number of max instances sends every work request to one worker (forcing us to multiplex).

restingbull · 2021-02-25T16:29:33Z

Welp. I feel stupid. Lemme go fix that -- coroutines are still a bit obtuse for me.

restingbull · 2021-02-25T21:41:19Z

#496

jdai8 · 2021-03-02T21:06:52Z

I don't think #496 solves this.

When running in my repro project

bazel build //src:all --worker_verbose --experimental_worker_max_multiplex_instances=KotlinCompile=5

I would expect a profile to look like this:

(I generated this with my fix at #495).

However, with the fix on master, I get something that looks like this, suggesting actions are still running sequentially:

chancila · 2021-03-02T21:41:14Z

Reading docs on flows...

This operator retains a sequential nature of flow if changing the context does not call for changing the dispatcher. Otherwise, if changing dispatcher is required, it collects flow emissions in one coroutine that is run using a specified context and emits them from another coroutines with the original collector’s context using a channel with a default buffer size between two coroutines similarly to buffer operator, unless buffer operator is explicitly called before or after flowOn, which requests buffering behavior and specifies channel size.

emphasis on retains a sequential nature of flow if changing the context does not call for changing the dispatcher

jeffzoch · 2021-03-03T17:46:22Z

Yeah for parallel requests in flow you need to do something like this (I use this helper all the time):

/**
 * suspendingParallelMap
 * @param scope - CoroutineScope
 * @param f - suspending function
 */
fun <A, B> Flow<A>.suspendingParallelMap(scope: CoroutineScope, f: suspend (A) -> B): Flow<B> { 
    return flowOn(Dispatchers.IO)
        .map { scope.async { f(it) } }
        .buffer() //default concurrency limit of 64
        .map { it.await() }
}
/*

You can pass a size to the buffer to limit concurrency @restingbull

restingbull · 2021-03-08T14:17:01Z

Second pass -- #498

Thanks, @jeffzoch. I sorted through the options and it appears they aren't quite up scaling worker pool primitives yet -- so your helper is by far the cleanest option.

The alternatives are to either:

Go to streams
Build a channel fan-out, fan-in with a scaling worker pool.

jeffzoch · 2021-03-08T17:43:10Z

Personally I find streams (assuming you are referring to Java 8 Streams) lackluster for this kind of thing since the amount of control you have over the executor is not great - controlling the level of concurrency per task you want to run isnt as straightforward and it tends to shine in cpu-intensive tasks (since by default it runs on the FJP). Channels are a good approach too not too dissimilar from Flow. Some of the channel api is getting deprecated though as Flow gets more features and I find working with flow to be more pleasant. YMMV

jdai8 · 2021-03-10T22:02:23Z

While #498 parallelizes the compilation, it still sequences writing the work responses to stdout, since we're serially await'ing each async result.

I'm seeing this when enabling the multiplex flag:

The small actions compile in parallel, but they're blocked on writing to stdout until the big action finishes. Once the big action finishes, they all write very quickly. This is slightly better than the previous profile, where there is still a delay (to do the compilation) before each subsequent small action finishes.

Since the underlying work here is thread-blocking, I'm not sure what value coroutines and flow offer - it seems to be tripping us up more than it's helping. Wouldn't it be simpler just to use threads? This is what Bazel does for Java compilation.

jeffzoch · 2021-03-11T05:12:47Z

@jdai8 are you referring to this https://github.com/restingbull/rules_kotlin/blob/6998aba9ee01198e04a47d39302939ecbd7fda34/src/main/kotlin/io/bazel/worker/PersistentWorker.kt#L115-L122 being serial? If so, that would make sense - collection is done serially. If we want this part to be parallelize we should apply the same parallelMap'ing strategy i outlined above to writing to stdout (and then we can just call collect() at the very end if we dont need any output).

I am also a bit confused on the usage of the private ThreadAwareDispatcher but im also not intimately familiar with exactly how this compilation works. Normally you can just use the Dispatchers.IO as your Dispatcher for this work and call it a day but again I admit im not familiar with how this code is called

jeffzoch · 2021-03-11T05:56:42Z

@jdai8 mind trying #501?

@restingbull this PR is just for sharing ideas - but I was wondering if things would still work (performantly) by tweaking the persistent worker a bit

I am getting the following with #501 which I think is what we want (ran bazel build src:all --worker_verbose --worker_max_instances=1 --worker_quit_after_build --profile=profile.gz)

jdai8 · 2021-03-11T17:58:15Z

Thanks @jeffzoch! Performance-wise, #501 looks good to me 👍

I still think using threads directly (as in #495) is a simpler implementation, since it doesn't look like we're using any coroutine/flow-specific features. I'll leave it up to @restingbull and others though.

jeffzoch · 2021-03-15T17:47:27Z

Given that #501 was merged can we close this?

jdai8 · 2021-03-16T01:13:01Z

Yeah. Thanks!

jdai8 mentioned this issue Feb 25, 2021

Parallelize multiplex work execution #495

Closed

jeffzoch mentioned this issue Mar 11, 2021

Parallelize stdout and clean up PersistentWorker #501

Merged

jdai8 closed this as completed Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplex worker implementation limits parallelism #494

Multiplex worker implementation limits parallelism #494

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jeffzoch commented Feb 24, 2021

jdai8 commented Feb 24, 2021

restingbull commented Feb 25, 2021

restingbull commented Feb 25, 2021

jdai8 commented Mar 2, 2021

chancila commented Mar 2, 2021 •

edited

Loading

jeffzoch commented Mar 3, 2021 •

edited

Loading

restingbull commented Mar 8, 2021

jeffzoch commented Mar 8, 2021

jdai8 commented Mar 10, 2021 •

edited

Loading

jeffzoch commented Mar 11, 2021

jeffzoch commented Mar 11, 2021 •

edited

Loading

jdai8 commented Mar 11, 2021 •

edited

Loading

jeffzoch commented Mar 15, 2021

jdai8 commented Mar 16, 2021

Multiplex worker implementation limits parallelism #494

Multiplex worker implementation limits parallelism #494

Comments

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jdai8 commented Feb 23, 2021

jongerrish commented Feb 23, 2021

jeffzoch commented Feb 24, 2021

jdai8 commented Feb 24, 2021

restingbull commented Feb 25, 2021

restingbull commented Feb 25, 2021

jdai8 commented Mar 2, 2021

chancila commented Mar 2, 2021 • edited Loading

jeffzoch commented Mar 3, 2021 • edited Loading

restingbull commented Mar 8, 2021

jeffzoch commented Mar 8, 2021

jdai8 commented Mar 10, 2021 • edited Loading

jeffzoch commented Mar 11, 2021

jeffzoch commented Mar 11, 2021 • edited Loading

jdai8 commented Mar 11, 2021 • edited Loading

jeffzoch commented Mar 15, 2021

jdai8 commented Mar 16, 2021

chancila commented Mar 2, 2021 •

edited

Loading

jeffzoch commented Mar 3, 2021 •

edited

Loading

jdai8 commented Mar 10, 2021 •

edited

Loading

jeffzoch commented Mar 11, 2021 •

edited

Loading

jdai8 commented Mar 11, 2021 •

edited

Loading