Job is activated but is not received by job worker (intermittent issue) #177

jetdream · 2020-09-15T07:55:35Z

I have an issue which sounds exactly the same as the one described here camunda/camunda#3585

Logs show that jobs are created/activated, but nodejs batch worker does not receive the jobs in ~1% of the cases. All the cases detected have happened under high load when thousands of workflow instances created.

I increased verbose level to the maximum possible level and see that the worker does not receive those jobs, just skipping them.

Last time I detected this issue when started around 2000 instances and the first activity in the workflow (which is the service task) did not receive 12 jobs.
From what I discovered I think that all the jobs skipped belongs to a single batch: exported records positions (jobs activated) are very close to each other (difference from 4 to 8):

2443938
2443930
2443886
2443882
...

I see possible reasons:

broker does not send the batch to the client
client ignores received batch

In general case I would also suggested that a network interruption might cause an effect when broker thinks that jobs have been sent, but client actually does not receive it, but in my case this is impossible since broker and client are on the same server.

I tried to call zbc.completeJob() for those jobs, and broker successfully processed it and continued workflow execution. That means that broker thinks that job is actually taken by worker before.

My application:

zeebe-node 0.23.2
zeebe 0.24.1
Single Zeebe node, 10 cpu + 10 io threads, 10 partitions.

I have very long running tasks (up to months, or even years), I cannot wait for job timeout.
I use batch worker, all the jobs are forwarded to external system.
worker config :

    maxJobsToActivate: 200,
    jobBatchMinSize: 32,
    jobBatchMaxTime: 3,
    timeout: Duration.days.of(365), // yep, this is 1 year

The text was updated successfully, but these errors were encountered:

jetdream · 2020-09-16T01:01:12Z

From what I see, node zeebe client implementation heavily relies on jobs being timed out by server in case something went wrong.
In my case (when jobs by design are really long) this approach does not work.

I see two places where jobs can be lost:

In case long polling gRPC connection timed out during actual transmitting jobs from server to client.
I think this is my case. At least it explains why it happens so rare and have no any log traces of the lost jobs.
In case when client is closing:
https://github.com/zeebe-io/zeebe-client-node-js/blob/b31600b7a3b86692c535488ef90968dd31025ae7/src/lib/ZBWorkerBase.ts#L503-L507

I'm actually surprised that Zeebe does not use acknowledgment-based approach for the jobs to guarantee that jobs actually delivered to job handlers.

If this issue is not fixable in a reasonable time I see a workaround:
I can detect the lost jobs by monitoring exported JOB ACTIVATED records and the jobs actually received and processed by job handlers. In case when activated job is not received withing a certain time, I can consider this job as lost and send zbc.failJob() request to force server to retry it.

jwulf · 2020-09-17T07:42:56Z

Hi @jetdream, thanks for reporting this.

It is probably not closing - this should only happen when you call close on the client or the batchworker. The close handler code is designed for application shutdown and only executes when you call either ZBBatchWorker.close() or ZBClient.close() in your code.

It is used in tests to complete the test - otherwise the polling loop will keep the code running forever.

The timeout of the gRPC long poll is not managed on the client side. It could be a race condition in the batch collection timeout and execution. Let me look into it further.

jwulf · 2020-09-18T00:59:42Z

I think this is due to a race condition in the Batch processing. It passes a copy of the array of jobs for the batch to the handler. It looks like the original array could be updated asynchronously while this is happening. That's my hypothesis.

I've changed the "passing a copy of the array of batched jobs" to passing a slice of the array. This means that any jobs that are added to the batch while the handler is executing, will be added to the next batch.

I will release the 0.24.1 version soon for you to test. It's challenging to reproduce an edge case at volume like that.

KojanAbzakh · 2020-09-27T09:46:18Z

Hi @jwulf , are you sure this is a client issue ? we have the exact issue but we are using go client.

jwulf · 2020-09-30T08:20:44Z

No, I'm not sure that it is the client. I haven't been able to reproduce it to check.

jwulf · 2020-10-23T01:23:01Z

Closing this for now. If you still see the issue with 0.25.0 of the client, please reopen.

jetdream changed the title ~~Job is activated but is not received by client (intermittent issue)~~ Job is activated but is not received by job worker (intermittent issue) Sep 16, 2020

jwulf mentioned this issue Sep 17, 2020

make BatchWorker batch interrupt safe #179

Merged

jwulf closed this as completed in #179 Sep 18, 2020

jwulf reopened this Sep 18, 2020

jwulf closed this as completed Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job is activated but is not received by job worker (intermittent issue) #177

Job is activated but is not received by job worker (intermittent issue) #177

jetdream commented Sep 15, 2020 •

edited

Loading

jetdream commented Sep 16, 2020 •

edited

Loading

jwulf commented Sep 17, 2020 •

edited

Loading

jwulf commented Sep 18, 2020 •

edited

Loading

KojanAbzakh commented Sep 27, 2020

jwulf commented Sep 30, 2020 •

edited

Loading

jwulf commented Oct 23, 2020

Job is activated but is not received by job worker (intermittent issue) #177

Job is activated but is not received by job worker (intermittent issue) #177

Comments

jetdream commented Sep 15, 2020 • edited Loading

jetdream commented Sep 16, 2020 • edited Loading

jwulf commented Sep 17, 2020 • edited Loading

jwulf commented Sep 18, 2020 • edited Loading

KojanAbzakh commented Sep 27, 2020

jwulf commented Sep 30, 2020 • edited Loading

jwulf commented Oct 23, 2020

jetdream commented Sep 15, 2020 •

edited

Loading

jetdream commented Sep 16, 2020 •

edited

Loading

jwulf commented Sep 17, 2020 •

edited

Loading

jwulf commented Sep 18, 2020 •

edited

Loading

jwulf commented Sep 30, 2020 •

edited

Loading