Driver 1.5 and up eventually stops working when put under load #373

samus42 · 2018-05-16T20:37:29Z

We have a set of integration tests for our API that supports our neo4j instance. When we upgraded from 1.3 to 1.6, we started running into "Connection acquisition timed out in 60000 ms" at a certain point in the test run (Stack traces at the end). Once that happened, every following call ended up with the same issue. When running the tests in isolation they worked without issue. I performed an experiment where I would disable tests earlier in the run, and the run would get past the problem areas, then have the same issue a little later. So it seems that at a certain point, something in the driver overloads and won't work anymore. Went back to 1.5.x of the driver, same issue (slight different message). 1.4.1 of the driver worked fine and was able to complete the run.

All test runs went against my locally running neo4j 3.2.9 enterprise database, which is operating fine during the test run.

We are running on node 8.9.4, npm 6.0.1.

Stack trace for 1.6.x driver:
message=Connection acquisition timed out in 60000 ms., stack=Neo4jError: Connection acquisition timed out in 60000 ms.

at captureStacktrace (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:199:15)
at new Result (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:72:19)
at _runPullAll (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:331:10)
at Object.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:247:24)
at Transaction.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:135:35)
at Transaction._onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:159:21)
at _TransactionStreamObserver.onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:219:18)
at /Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:77:29
at :null:null

Stack trace for 1.5.x driver:
message=Operation timed out in 60000 ms., stack=Neo4jError: Operation timed out in 60000 ms.
at new Neo4jError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:76:132)
at newError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:66:10)
at Timeout._onTimeout (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/internal/util.js:153:34)
at ontimeout (timers.js:475:11)
at tryOnTimeout (timers.js:310:5)
at Timer.listOnTimeout (timers.js:270:5)

lutovich · 2018-05-17T09:46:21Z

Hi @samus42,

Starting from 1.5 driver has a hard limit on the number of connections in the connection pool. This limit was introduced so that driver does not overload the database with an excessive amount of connections. See this PR for more details. The limit is configurable with the default value of 100. Driver will not allow clients to obtain/create more than 100 connections. Such acquire operations will wait until previously acquired connections are returned back to the pool.

Connections are acquired from the pool to run queries via Session#run() and when a transaction is started via Session#beginTransaction(). Connections are returned to the pool when the result of Session#run() is fully consumed or when a transaction is closed via Transaction#commit() / Transaction#rollback() or when session is closed via Session#close().

I think the problem you are seeing might happen when an application does not close sessions or commit/rollback transactions. Eventually, max pool limit is reached and no more connections can be acquired. Could you please share a reproducer app or code that interacts with the driver? It would also be valuable to see if you use explicit transactions via Session#beginTransaction().

samus42 · 2018-05-18T19:11:59Z

I know I went through everything and made sure the sessions were closed, i'll check to see if we have transactions as well.

samus42 · 2018-05-18T19:51:14Z

So we're doing both session.writeTransaction and session.newTransaction. But in each case they're followed by a session.close(). I'll go through the code base again and make sure every new session is closed, but I'm pretty sure I accounted for everything. Is there something else that we need to do for transactions?

lutovich · 2018-05-21T10:18:30Z

@samus42 transactions used in Session#readTransaction() and Session#writeTransaction() will be automatically committed or rolled back by the driver. They do not require any explicit management. Only explicit transactions created with Session#beginTransaction() has to be explicitly committed or rolled back via Transaction#commit() or Transaction#rollback().

We will investigate this further to make sure there are no problems with the connection pool.
Do your integration tests ever issue queries or transactions that are expected to fail?
Does your application already use a logging framework? Logging of all acquire and release operations in the connection pool might be helpful here. We do not have logging capabilities right now, but it is probably worth adding some.

samus42 · 2018-05-25T15:44:20Z

We do have some integration tests that cause query failure on purpose, but not a lot. I'm going through and making sure we have catches on everything now. For logging, we just use a winston logger.

samus42 · 2018-05-25T18:12:05Z

We made sure every new session closed the session even with an error, and the tests are breaking in the same place in the same way

lutovich · 2018-06-11T08:17:48Z

Hi @samus42,

We have added configurable logging in the driver: #380. When enabled with DEBUG level, it shows when a connection is acquired and released to the connection pool and every IO operation that goes through the connection. It would be very helpful if you could enable logging, run the tests again and attach the output to this issue!

Logging is available in the pre-release version 1.7.0-alpha01 which can be installed with npm install --save neo4j-driver@next. Debug logging can be enabled in the driver configuration:

const config = {
  logger: neo4j.logging.console('debug')
};

const driver = neo4j.driver(
 'bolt://localhost',
  neo4j.auth.basic('neo4j', 'password'),
  config
);

this will make driver output logs to console. It is also possible to plug in a custom logging function:

const config = {
  logging: {
    level: 'debug',
    logger: (level, message) => console.log('+++' + level + ' ' + message) // custom function here
  }
};

const driver = neo4j.driver(
 'bolt://localhost',
  neo4j.auth.basic('neo4j', 'password'),
  config
);

which can be used to prefix the log message with some dummy string for easier grepping or to write all log messages to a file.

I suspect, logging will help us see if there are any connections that are not returned to the pool and what kind of operations they execute. Logging could also show if there are any problems with the connection pool.

Looking forward to your reply!

lutovich · 2018-06-28T09:33:59Z

Hi @samus42,

Did you have a chance to run the tests with logging turned on?

regevbr · 2018-07-16T11:58:34Z

I seem to have a similar issue (1.6.2 and 1.5.*) - after a lot of load on the driver, some queries are never being ran, causing stale promises to get stuck. The timeout on the connection acquire from the pool doesn't seem to help and I don't know why that happens. The issue only happens in production under heavy load. I scanned a lot of your code and can't seem to find the culprit. The pool module seems fine tom me, I suspected the queries got stuck on the neo4j server but there is no indication for queries running for more than 10 seconds and said queries get stuck for more than 3 minutes, possibly never getting released (I installed my own timeout on the transaction to stop it). So my believe is that somehow the request gets stuck before the pool module is being used, or an unexpected error occurs during pool operations which doesn't reject the promise properly...

lutovich · 2018-07-16T13:49:14Z

Hi @regevbr,

Have you seen any messages about unhandled promise rejections when this happened? Could you please share code that queries Neo4j using the driver?

lutovich · 2018-07-16T13:50:59Z

Hi @samus42,

Do you have any updates about this issue? It would be extremely helpful to see logging from a test run that reproduces the issue.

regevbr · 2018-07-16T13:52:41Z

Nope no unhandled rejections.

const pTimeout = require('promise-timeout');
const timeunit = require('timeunit');
const TIMEOUT = timeunit.seconds.toMillis(150);
const neo4j = require('neo4j-driver').v1;
const driver = neo4j.driver(config.endpoint, neo4j.auth.basic(config.user, config.password), {
  maxConnectionLifetime: 20 * 60 * 1000,
  connectionTimeout: 1000 * 45,
  connectionAcquisitionTimeout: 60000,
  maxTransactionRetryTime: 10000,
  connectionPoolSize: 300,
});
function runTransaction(query, params, isWrite) {
  const session = driver.session();
  const transactionStart = isWrite ? session.writeTransaction.bind(session) : session.readTransaction.bind(session);
  const transactionAction = transaction => {
    return transaction.run(query, params);
  };
  const transactionPromise = transactionStart(transactionAction);
  return pTimeout.timeout(transactionPromise, TIMEOUT)
    .then(res => {
      session.close();
      return res;
    })
    .catch(err => {
      session.close();
      throw err;
    });
}

lutovich · 2018-07-17T08:37:28Z

The provided code looks correct to me. Thanks for sharing. We will investigate if this issue is related to Session#readTransaction() and Session#writeTransaction.

regevbr · 2018-07-17T08:39:53Z

By the way, after I increased the pool size from 100 to 300 and decreased the retry timeout to 10 seconds the number of failures dropped dramatically (but still exists) so my best guess is that it is related to either the pool module or the exponential back-off mechanism...

lutovich · 2018-07-17T08:42:37Z

Is it true that when queries get stuck you do not see them in the output of dbms.listQueries() or dbms.listTransactions() procedure?

regevbr · 2018-07-17T08:57:00Z

@lutovich when I examine the query logs of the neo instance there are no stuck queries nor queries that run more than 10 seconds...

samus42 · 2018-07-17T18:12:11Z

@lutovich sorry, been out of the country, i'll try to get to that this week

samus42 · 2018-07-19T20:42:31Z

@lutovich I have the log file, it's about 30Mb, how do you want me to get it to you? In my scanning through, the connection numbers go up, but i see the previous number released to the pool.

lutovich · 2018-07-20T10:20:57Z

@samus42 great! could you please archive it and send to konstantin dot lutovich at neo4j dot com? Alternatively it can be uploaded to this google drive. Thank you!

lutovich · 2018-07-20T14:46:09Z

Found an issue with transaction functions which might cause connections not being returned to the pool. PR is here #397. It might not be the definite fix for this problem. Having logs would still be very useful.

samus42 · 2018-07-20T20:44:55Z

@lutovich uploaded to your google drive, hope it helps!

lutovich · 2018-07-20T20:47:40Z

Thank you! I can see the file. We will investigate and update this issue.

regevbr · 2018-07-20T21:26:26Z

@lutovich can you please release a new 1.6 patch so i can see if the fix helps?

lutovich · 2018-07-23T11:51:57Z

@samus42 I can see 100 connections in the log that were acquired but not returned to the pool. 100 is the default max pool limit starting from version 1.5. No upper limit was enforced before 1.5. All those connections are acquired to execute very similar transactions. Each of them starts with a query match (g:Group {id: {groupId}})-[r:APPLIES_TO]->() delete r. This query appears in log 101 times and connection is never returned back to the pool after such transaction is committed. Could you please search the codebase for that query and share the code that executes corresponding transaction?

The problem might happen because driver only releases connections back to the pool when user code explicitly consumes the result of Transaction#commit(). This result might either be used as Promise with then-catch or subscribed to. It would be great to see the code to validate this theory.

@regevbr we will probably release 1.6 patch this week. Just want to be more confident that it actually contains the fix.

regevbr · 2018-07-23T12:01:20Z

Perfect thanks!

…

On Mon, Jul 23, 2018, 14:52 Konstantin Lutovich ***@***.***> wrote: @samus42 <https://github.com/samus42> I can see 100 connections in the log that were acquired but not returned to the pool. 100 is the default max pool limit starting from version 1.5. No upper limit was enforced before 1.5. All those connections are acquired to execute very similar transactions. Each of them starts with a query match (g:Group {id: {groupId}})-[r:APPLIES_TO]->() delete r. This query appears in log 101 times and connection is never returned back to the pool after such transaction is committed. Could you please search the codebase for that query and share the code that executes corresponding transaction? The problem might happen because driver only releases connections back to the pool when user code explicitly consumes the result of Transaction#commit(). This result might either be used as Promise with then-catch or subscribed to. It would be great to see the code to validate this theory. @regevbr <https://github.com/regevbr> we will probably release 1.6 patch this week. Just want to be more confident that it actually contains the fix. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#373 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFggn9DdfoaFh2cxTSqqZwas06Stq-lOks5uJbjggaJpZM4UB_y3> .

regevbr · 2018-07-23T12:14:19Z

@lutovich if that is indeed the case then why doesnt the wait timeout for a conneciton from the pool is not taking affect? It all just hangs there forever which doesnt make sense

lutovich · 2018-07-23T12:21:49Z

connectionAcquisitionTimeout should fire after 60 seconds for new acquisition attempts when pool is at maximum capacity. This is the behavior reported by @samus42.

Your problem might be caused by something else. I'm not yet sure. Is it an option for you to use alpha version of the 1.7 driver with logging enabled?

samus42 · 2018-07-27T20:03:51Z

I'm found the code that query pertains to, i'm digging through it now. It does a transaction = session.beginTransaction(), then later does a transaction.commit() or transaction.rollback(). What's interesting is the session.close() is only called on an error. I must have missed that on my last couple of pass throughs (this code is more complex than most because we check for the creation of cycles). I'll throw a session.close() in there and try again

samus42 · 2018-07-27T20:24:29Z

OK, putting that last session.close() in allowed for a clean run! Thanks for helping track that down.

lutovich · 2018-07-31T14:24:49Z

@samus42 I'm glad it works now!

@regevbr version 1.6.3 with the mentioned fix for transaction functions is now released. Could you please try it out?

regevbr · 2018-07-31T15:34:07Z

@lutovich perfect ill update to the new version now.
Though I must say the the ugly solution that fixed the issue for me in the past weeks was to destroy and recreate the driver once an hour to ensure the queue does not get stuck haha :-)

regevbr · 2018-07-31T17:26:30Z

For now looks good, lets give it 24 hours to see if it sticks...

zhenlineo · 2018-08-08T10:07:49Z

@regevbr Have you got more updates to this issue? Have you got any problems after 24 * 8 hours so far?

regevbr · 2018-08-08T10:09:19Z

Sorry, it works perfect! Thanks a lot to everyone

lutovich added a commit to lutovich/neo4j-javascript-driver that referenced this issue Jul 23, 2018

Initial work on neo4j#373

62d77c3

zhenlineo closed this as completed Aug 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver 1.5 and up eventually stops working when put under load #373

Driver 1.5 and up eventually stops working when put under load #373

samus42 commented May 16, 2018

lutovich commented May 17, 2018

samus42 commented May 18, 2018

samus42 commented May 18, 2018

lutovich commented May 21, 2018

samus42 commented May 25, 2018

samus42 commented May 25, 2018

lutovich commented Jun 11, 2018

lutovich commented Jun 28, 2018

regevbr commented Jul 16, 2018 •

edited

Loading

lutovich commented Jul 16, 2018

lutovich commented Jul 16, 2018

regevbr commented Jul 16, 2018

lutovich commented Jul 17, 2018

regevbr commented Jul 17, 2018

lutovich commented Jul 17, 2018

regevbr commented Jul 17, 2018 •

edited

Loading

samus42 commented Jul 17, 2018

samus42 commented Jul 19, 2018

lutovich commented Jul 20, 2018

lutovich commented Jul 20, 2018 •

edited

Loading

samus42 commented Jul 20, 2018

lutovich commented Jul 20, 2018

regevbr commented Jul 20, 2018 •

edited

Loading

lutovich commented Jul 23, 2018

regevbr commented Jul 23, 2018 via email

regevbr commented Jul 23, 2018

lutovich commented Jul 23, 2018

samus42 commented Jul 27, 2018

samus42 commented Jul 27, 2018

lutovich commented Jul 31, 2018

regevbr commented Jul 31, 2018 •

edited

Loading

regevbr commented Jul 31, 2018

zhenlineo commented Aug 8, 2018

regevbr commented Aug 8, 2018

Driver 1.5 and up eventually stops working when put under load #373

Driver 1.5 and up eventually stops working when put under load #373

Comments

samus42 commented May 16, 2018

lutovich commented May 17, 2018

samus42 commented May 18, 2018

samus42 commented May 18, 2018

lutovich commented May 21, 2018

samus42 commented May 25, 2018

samus42 commented May 25, 2018

lutovich commented Jun 11, 2018

lutovich commented Jun 28, 2018

regevbr commented Jul 16, 2018 • edited Loading

lutovich commented Jul 16, 2018

lutovich commented Jul 16, 2018

regevbr commented Jul 16, 2018

lutovich commented Jul 17, 2018

regevbr commented Jul 17, 2018

lutovich commented Jul 17, 2018

regevbr commented Jul 17, 2018 • edited Loading

samus42 commented Jul 17, 2018

samus42 commented Jul 19, 2018

lutovich commented Jul 20, 2018

lutovich commented Jul 20, 2018 • edited Loading

samus42 commented Jul 20, 2018

lutovich commented Jul 20, 2018

regevbr commented Jul 20, 2018 • edited Loading

lutovich commented Jul 23, 2018

regevbr commented Jul 23, 2018 via email

regevbr commented Jul 23, 2018

lutovich commented Jul 23, 2018

samus42 commented Jul 27, 2018

samus42 commented Jul 27, 2018

lutovich commented Jul 31, 2018

regevbr commented Jul 31, 2018 • edited Loading

regevbr commented Jul 31, 2018

zhenlineo commented Aug 8, 2018

regevbr commented Aug 8, 2018

regevbr commented Jul 16, 2018 •

edited

Loading

regevbr commented Jul 17, 2018 •

edited

Loading

lutovich commented Jul 20, 2018 •

edited

Loading

regevbr commented Jul 20, 2018 •

edited

Loading

regevbr commented Jul 31, 2018 •

edited

Loading