Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver 1.5 and up eventually stops working when put under load #373

Closed
samus42 opened this issue May 16, 2018 · 34 comments
Closed

Driver 1.5 and up eventually stops working when put under load #373

samus42 opened this issue May 16, 2018 · 34 comments

Comments

@samus42
Copy link

samus42 commented May 16, 2018

We have a set of integration tests for our API that supports our neo4j instance. When we upgraded from 1.3 to 1.6, we started running into "Connection acquisition timed out in 60000 ms" at a certain point in the test run (Stack traces at the end). Once that happened, every following call ended up with the same issue. When running the tests in isolation they worked without issue. I performed an experiment where I would disable tests earlier in the run, and the run would get past the problem areas, then have the same issue a little later. So it seems that at a certain point, something in the driver overloads and won't work anymore. Went back to 1.5.x of the driver, same issue (slight different message). 1.4.1 of the driver worked fine and was able to complete the run.

All test runs went against my locally running neo4j 3.2.9 enterprise database, which is operating fine during the test run.

We are running on node 8.9.4, npm 6.0.1.

Stack trace for 1.6.x driver:
message=Connection acquisition timed out in 60000 ms., stack=Neo4jError: Connection acquisition timed out in 60000 ms.

at captureStacktrace (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:199:15)
at new Result (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:72:19)
at _runPullAll (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:331:10)
at Object.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:247:24)
at Transaction.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:135:35)
at Transaction._onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:159:21)
at _TransactionStreamObserver.onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:219:18)
at /Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:77:29
at :null:null

Stack trace for 1.5.x driver:
message=Operation timed out in 60000 ms., stack=Neo4jError: Operation timed out in 60000 ms.
at new Neo4jError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:76:132)
at newError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:66:10)
at Timeout._onTimeout (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/internal/util.js:153:34)
at ontimeout (timers.js:475:11)
at tryOnTimeout (timers.js:310:5)
at Timer.listOnTimeout (timers.js:270:5)

@lutovich
Copy link
Contributor

Hi @samus42,

Starting from 1.5 driver has a hard limit on the number of connections in the connection pool. This limit was introduced so that driver does not overload the database with an excessive amount of connections. See this PR for more details. The limit is configurable with the default value of 100. Driver will not allow clients to obtain/create more than 100 connections. Such acquire operations will wait until previously acquired connections are returned back to the pool.

Connections are acquired from the pool to run queries via Session#run() and when a transaction is started via Session#beginTransaction(). Connections are returned to the pool when the result of Session#run() is fully consumed or when a transaction is closed via Transaction#commit() / Transaction#rollback() or when session is closed via Session#close().

I think the problem you are seeing might happen when an application does not close sessions or commit/rollback transactions. Eventually, max pool limit is reached and no more connections can be acquired. Could you please share a reproducer app or code that interacts with the driver? It would also be valuable to see if you use explicit transactions via Session#beginTransaction().

@samus42
Copy link
Author

samus42 commented May 18, 2018

I know I went through everything and made sure the sessions were closed, i'll check to see if we have transactions as well.

@samus42
Copy link
Author

samus42 commented May 18, 2018

So we're doing both session.writeTransaction and session.newTransaction. But in each case they're followed by a session.close(). I'll go through the code base again and make sure every new session is closed, but I'm pretty sure I accounted for everything. Is there something else that we need to do for transactions?

@lutovich
Copy link
Contributor

@samus42 transactions used in Session#readTransaction() and Session#writeTransaction() will be automatically committed or rolled back by the driver. They do not require any explicit management. Only explicit transactions created with Session#beginTransaction() has to be explicitly committed or rolled back via Transaction#commit() or Transaction#rollback().

We will investigate this further to make sure there are no problems with the connection pool.
Do your integration tests ever issue queries or transactions that are expected to fail?
Does your application already use a logging framework? Logging of all acquire and release operations in the connection pool might be helpful here. We do not have logging capabilities right now, but it is probably worth adding some.

@samus42
Copy link
Author

samus42 commented May 25, 2018

We do have some integration tests that cause query failure on purpose, but not a lot. I'm going through and making sure we have catches on everything now. For logging, we just use a winston logger.

@samus42
Copy link
Author

samus42 commented May 25, 2018

We made sure every new session closed the session even with an error, and the tests are breaking in the same place in the same way

@lutovich
Copy link
Contributor

Hi @samus42,

We have added configurable logging in the driver: #380. When enabled with DEBUG level, it shows when a connection is acquired and released to the connection pool and every IO operation that goes through the connection. It would be very helpful if you could enable logging, run the tests again and attach the output to this issue!

Logging is available in the pre-release version 1.7.0-alpha01 which can be installed with npm install --save neo4j-driver@next. Debug logging can be enabled in the driver configuration:

const config = {
  logger: neo4j.logging.console('debug')
};

const driver = neo4j.driver(
 'bolt://localhost',
  neo4j.auth.basic('neo4j', 'password'),
  config
);

this will make driver output logs to console. It is also possible to plug in a custom logging function:

const config = {
  logging: {
    level: 'debug',
    logger: (level, message) => console.log('+++' + level + ' ' + message) // custom function here
  }
};

const driver = neo4j.driver(
 'bolt://localhost',
  neo4j.auth.basic('neo4j', 'password'),
  config
);

which can be used to prefix the log message with some dummy string for easier grepping or to write all log messages to a file.

I suspect, logging will help us see if there are any connections that are not returned to the pool and what kind of operations they execute. Logging could also show if there are any problems with the connection pool.

Looking forward to your reply!

@lutovich
Copy link
Contributor

Hi @samus42,

Did you have a chance to run the tests with logging turned on?

@regevbr
Copy link

regevbr commented Jul 16, 2018

I seem to have a similar issue (1.6.2 and 1.5.*) - after a lot of load on the driver, some queries are never being ran, causing stale promises to get stuck. The timeout on the connection acquire from the pool doesn't seem to help and I don't know why that happens. The issue only happens in production under heavy load. I scanned a lot of your code and can't seem to find the culprit. The pool module seems fine tom me, I suspected the queries got stuck on the neo4j server but there is no indication for queries running for more than 10 seconds and said queries get stuck for more than 3 minutes, possibly never getting released (I installed my own timeout on the transaction to stop it). So my believe is that somehow the request gets stuck before the pool module is being used, or an unexpected error occurs during pool operations which doesn't reject the promise properly...

@lutovich
Copy link
Contributor

Hi @regevbr,

Have you seen any messages about unhandled promise rejections when this happened? Could you please share code that queries Neo4j using the driver?

@lutovich
Copy link
Contributor

Hi @samus42,

Do you have any updates about this issue? It would be extremely helpful to see logging from a test run that reproduces the issue.

@regevbr
Copy link

regevbr commented Jul 16, 2018

Nope no unhandled rejections.

const pTimeout = require('promise-timeout');
const timeunit = require('timeunit');
const TIMEOUT = timeunit.seconds.toMillis(150);
const neo4j = require('neo4j-driver').v1;
const driver = neo4j.driver(config.endpoint, neo4j.auth.basic(config.user, config.password), {
  maxConnectionLifetime: 20 * 60 * 1000,
  connectionTimeout: 1000 * 45,
  connectionAcquisitionTimeout: 60000,
  maxTransactionRetryTime: 10000,
  connectionPoolSize: 300,
});
function runTransaction(query, params, isWrite) {
  const session = driver.session();
  const transactionStart = isWrite ? session.writeTransaction.bind(session) : session.readTransaction.bind(session);
  const transactionAction = transaction => {
    return transaction.run(query, params);
  };
  const transactionPromise = transactionStart(transactionAction);
  return pTimeout.timeout(transactionPromise, TIMEOUT)
    .then(res => {
      session.close();
      return res;
    })
    .catch(err => {
      session.close();
      throw err;
    });
}

@lutovich
Copy link
Contributor

The provided code looks correct to me. Thanks for sharing. We will investigate if this issue is related to Session#readTransaction() and Session#writeTransaction.

@regevbr
Copy link

regevbr commented Jul 17, 2018

By the way, after I increased the pool size from 100 to 300 and decreased the retry timeout to 10 seconds the number of failures dropped dramatically (but still exists) so my best guess is that it is related to either the pool module or the exponential back-off mechanism...

@lutovich
Copy link
Contributor

Is it true that when queries get stuck you do not see them in the output of dbms.listQueries() or dbms.listTransactions() procedure?

@regevbr
Copy link

regevbr commented Jul 17, 2018

@lutovich when I examine the query logs of the neo instance there are no stuck queries nor queries that run more than 10 seconds...

@samus42
Copy link
Author

samus42 commented Jul 17, 2018

@lutovich sorry, been out of the country, i'll try to get to that this week

@samus42
Copy link
Author

samus42 commented Jul 19, 2018

@lutovich I have the log file, it's about 30Mb, how do you want me to get it to you? In my scanning through, the connection numbers go up, but i see the previous number released to the pool.

@lutovich
Copy link
Contributor

@samus42 great! could you please archive it and send to konstantin dot lutovich at neo4j dot com? Alternatively it can be uploaded to this google drive. Thank you!

@lutovich
Copy link
Contributor

lutovich commented Jul 20, 2018

Found an issue with transaction functions which might cause connections not being returned to the pool. PR is here #397. It might not be the definite fix for this problem. Having logs would still be very useful.

@samus42
Copy link
Author

samus42 commented Jul 20, 2018

@lutovich uploaded to your google drive, hope it helps!

@lutovich
Copy link
Contributor

Thank you! I can see the file. We will investigate and update this issue.

@regevbr
Copy link

regevbr commented Jul 20, 2018

@lutovich can you please release a new 1.6 patch so i can see if the fix helps?

@lutovich
Copy link
Contributor

@samus42 I can see 100 connections in the log that were acquired but not returned to the pool. 100 is the default max pool limit starting from version 1.5. No upper limit was enforced before 1.5. All those connections are acquired to execute very similar transactions. Each of them starts with a query match (g:Group {id: {groupId}})-[r:APPLIES_TO]->() delete r. This query appears in log 101 times and connection is never returned back to the pool after such transaction is committed. Could you please search the codebase for that query and share the code that executes corresponding transaction?

The problem might happen because driver only releases connections back to the pool when user code explicitly consumes the result of Transaction#commit(). This result might either be used as Promise with then-catch or subscribed to. It would be great to see the code to validate this theory.

@regevbr we will probably release 1.6 patch this week. Just want to be more confident that it actually contains the fix.

@regevbr
Copy link

regevbr commented Jul 23, 2018 via email

lutovich added a commit to lutovich/neo4j-javascript-driver that referenced this issue Jul 23, 2018
@regevbr
Copy link

regevbr commented Jul 23, 2018

@lutovich if that is indeed the case then why doesnt the wait timeout for a conneciton from the pool is not taking affect? It all just hangs there forever which doesnt make sense

@lutovich
Copy link
Contributor

connectionAcquisitionTimeout should fire after 60 seconds for new acquisition attempts when pool is at maximum capacity. This is the behavior reported by @samus42.

Your problem might be caused by something else. I'm not yet sure. Is it an option for you to use alpha version of the 1.7 driver with logging enabled?

@samus42
Copy link
Author

samus42 commented Jul 27, 2018

I'm found the code that query pertains to, i'm digging through it now. It does a transaction = session.beginTransaction(), then later does a transaction.commit() or transaction.rollback(). What's interesting is the session.close() is only called on an error. I must have missed that on my last couple of pass throughs (this code is more complex than most because we check for the creation of cycles). I'll throw a session.close() in there and try again

@samus42
Copy link
Author

samus42 commented Jul 27, 2018

OK, putting that last session.close() in allowed for a clean run! Thanks for helping track that down.

@lutovich
Copy link
Contributor

@samus42 I'm glad it works now!

@regevbr version 1.6.3 with the mentioned fix for transaction functions is now released. Could you please try it out?

@regevbr
Copy link

regevbr commented Jul 31, 2018

@lutovich perfect ill update to the new version now.
Though I must say the the ugly solution that fixed the issue for me in the past weeks was to destroy and recreate the driver once an hour to ensure the queue does not get stuck haha :-)

@regevbr
Copy link

regevbr commented Jul 31, 2018

For now looks good, lets give it 24 hours to see if it sticks...

@zhenlineo
Copy link
Contributor

@regevbr Have you got more updates to this issue? Have you got any problems after 24 * 8 hours so far?

@regevbr
Copy link

regevbr commented Aug 8, 2018

Sorry, it works perfect! Thanks a lot to everyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants