-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver 1.5 and up eventually stops working when put under load #373
Comments
Hi @samus42, Starting from 1.5 driver has a hard limit on the number of connections in the connection pool. This limit was introduced so that driver does not overload the database with an excessive amount of connections. See this PR for more details. The limit is configurable with the default value of 100. Driver will not allow clients to obtain/create more than 100 connections. Such acquire operations will wait until previously acquired connections are returned back to the pool. Connections are acquired from the pool to run queries via I think the problem you are seeing might happen when an application does not close sessions or commit/rollback transactions. Eventually, max pool limit is reached and no more connections can be acquired. Could you please share a reproducer app or code that interacts with the driver? It would also be valuable to see if you use explicit transactions via |
I know I went through everything and made sure the sessions were closed, i'll check to see if we have transactions as well. |
So we're doing both session.writeTransaction and session.newTransaction. But in each case they're followed by a session.close(). I'll go through the code base again and make sure every new session is closed, but I'm pretty sure I accounted for everything. Is there something else that we need to do for transactions? |
@samus42 transactions used in We will investigate this further to make sure there are no problems with the connection pool. |
We do have some integration tests that cause query failure on purpose, but not a lot. I'm going through and making sure we have catches on everything now. For logging, we just use a winston logger. |
We made sure every new session closed the session even with an error, and the tests are breaking in the same place in the same way |
Hi @samus42, We have added configurable logging in the driver: #380. When enabled with Logging is available in the pre-release version 1.7.0-alpha01 which can be installed with const config = {
logger: neo4j.logging.console('debug')
};
const driver = neo4j.driver(
'bolt://localhost',
neo4j.auth.basic('neo4j', 'password'),
config
); this will make driver output logs to console. It is also possible to plug in a custom logging function: const config = {
logging: {
level: 'debug',
logger: (level, message) => console.log('+++' + level + ' ' + message) // custom function here
}
};
const driver = neo4j.driver(
'bolt://localhost',
neo4j.auth.basic('neo4j', 'password'),
config
); which can be used to prefix the log message with some dummy string for easier grepping or to write all log messages to a file. I suspect, logging will help us see if there are any connections that are not returned to the pool and what kind of operations they execute. Logging could also show if there are any problems with the connection pool. Looking forward to your reply! |
Hi @samus42, Did you have a chance to run the tests with logging turned on? |
I seem to have a similar issue (1.6.2 and 1.5.*) - after a lot of load on the driver, some queries are never being ran, causing stale promises to get stuck. The timeout on the connection acquire from the pool doesn't seem to help and I don't know why that happens. The issue only happens in production under heavy load. I scanned a lot of your code and can't seem to find the culprit. The pool module seems fine tom me, I suspected the queries got stuck on the neo4j server but there is no indication for queries running for more than 10 seconds and said queries get stuck for more than 3 minutes, possibly never getting released (I installed my own timeout on the transaction to stop it). So my believe is that somehow the request gets stuck before the pool module is being used, or an unexpected error occurs during pool operations which doesn't reject the promise properly... |
Hi @regevbr, Have you seen any messages about unhandled promise rejections when this happened? Could you please share code that queries Neo4j using the driver? |
Hi @samus42, Do you have any updates about this issue? It would be extremely helpful to see logging from a test run that reproduces the issue. |
Nope no unhandled rejections. const pTimeout = require('promise-timeout');
const timeunit = require('timeunit');
const TIMEOUT = timeunit.seconds.toMillis(150);
const neo4j = require('neo4j-driver').v1;
const driver = neo4j.driver(config.endpoint, neo4j.auth.basic(config.user, config.password), {
maxConnectionLifetime: 20 * 60 * 1000,
connectionTimeout: 1000 * 45,
connectionAcquisitionTimeout: 60000,
maxTransactionRetryTime: 10000,
connectionPoolSize: 300,
});
function runTransaction(query, params, isWrite) {
const session = driver.session();
const transactionStart = isWrite ? session.writeTransaction.bind(session) : session.readTransaction.bind(session);
const transactionAction = transaction => {
return transaction.run(query, params);
};
const transactionPromise = transactionStart(transactionAction);
return pTimeout.timeout(transactionPromise, TIMEOUT)
.then(res => {
session.close();
return res;
})
.catch(err => {
session.close();
throw err;
});
} |
The provided code looks correct to me. Thanks for sharing. We will investigate if this issue is related to |
By the way, after I increased the pool size from 100 to 300 and decreased the retry timeout to 10 seconds the number of failures dropped dramatically (but still exists) so my best guess is that it is related to either the pool module or the exponential back-off mechanism... |
Is it true that when queries get stuck you do not see them in the output of |
@lutovich when I examine the query logs of the neo instance there are no stuck queries nor queries that run more than 10 seconds... |
@lutovich sorry, been out of the country, i'll try to get to that this week |
@lutovich I have the log file, it's about 30Mb, how do you want me to get it to you? In my scanning through, the connection numbers go up, but i see the previous number released to the pool. |
@samus42 great! could you please archive it and send to konstantin dot lutovich at neo4j dot com? Alternatively it can be uploaded to this google drive. Thank you! |
Found an issue with transaction functions which might cause connections not being returned to the pool. PR is here #397. It might not be the definite fix for this problem. Having logs would still be very useful. |
@lutovich uploaded to your google drive, hope it helps! |
Thank you! I can see the file. We will investigate and update this issue. |
@lutovich can you please release a new 1.6 patch so i can see if the fix helps? |
@samus42 I can see 100 connections in the log that were acquired but not returned to the pool. 100 is the default max pool limit starting from version 1.5. No upper limit was enforced before 1.5. All those connections are acquired to execute very similar transactions. Each of them starts with a query The problem might happen because driver only releases connections back to the pool when user code explicitly consumes the result of @regevbr we will probably release 1.6 patch this week. Just want to be more confident that it actually contains the fix. |
Perfect thanks!
…On Mon, Jul 23, 2018, 14:52 Konstantin Lutovich ***@***.***> wrote:
@samus42 <https://github.com/samus42> I can see 100 connections in the
log that were acquired but not returned to the pool. 100 is the default max
pool limit starting from version 1.5. No upper limit was enforced before
1.5. All those connections are acquired to execute very similar
transactions. Each of them starts with a query match (g:Group {id:
{groupId}})-[r:APPLIES_TO]->() delete r. This query appears in log 101
times and connection is never returned back to the pool after such
transaction is committed. Could you please search the codebase for that
query and share the code that executes corresponding transaction?
The problem might happen because driver only releases connections back to
the pool when user code explicitly consumes the result of
Transaction#commit(). This result might either be used as Promise with
then-catch or subscribed to. It would be great to see the code to validate
this theory.
@regevbr <https://github.com/regevbr> we will probably release 1.6 patch
this week. Just want to be more confident that it actually contains the fix.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#373 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFggn9DdfoaFh2cxTSqqZwas06Stq-lOks5uJbjggaJpZM4UB_y3>
.
|
@lutovich if that is indeed the case then why doesnt the wait timeout for a conneciton from the pool is not taking affect? It all just hangs there forever which doesnt make sense |
Your problem might be caused by something else. I'm not yet sure. Is it an option for you to use alpha version of the 1.7 driver with logging enabled? |
I'm found the code that query pertains to, i'm digging through it now. It does a transaction = session.beginTransaction(), then later does a transaction.commit() or transaction.rollback(). What's interesting is the session.close() is only called on an error. I must have missed that on my last couple of pass throughs (this code is more complex than most because we check for the creation of cycles). I'll throw a session.close() in there and try again |
OK, putting that last session.close() in allowed for a clean run! Thanks for helping track that down. |
@lutovich perfect ill update to the new version now. |
For now looks good, lets give it 24 hours to see if it sticks... |
@regevbr Have you got more updates to this issue? Have you got any problems after 24 * 8 hours so far? |
Sorry, it works perfect! Thanks a lot to everyone |
We have a set of integration tests for our API that supports our neo4j instance. When we upgraded from 1.3 to 1.6, we started running into "Connection acquisition timed out in 60000 ms" at a certain point in the test run (Stack traces at the end). Once that happened, every following call ended up with the same issue. When running the tests in isolation they worked without issue. I performed an experiment where I would disable tests earlier in the run, and the run would get past the problem areas, then have the same issue a little later. So it seems that at a certain point, something in the driver overloads and won't work anymore. Went back to 1.5.x of the driver, same issue (slight different message). 1.4.1 of the driver worked fine and was able to complete the run.
All test runs went against my locally running neo4j 3.2.9 enterprise database, which is operating fine during the test run.
We are running on node 8.9.4, npm 6.0.1.
Stack trace for 1.6.x driver:
message=Connection acquisition timed out in 60000 ms., stack=Neo4jError: Connection acquisition timed out in 60000 ms.
at captureStacktrace (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:199:15)
at new Result (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/result.js:72:19)
at _runPullAll (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:331:10)
at Object.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:247:24)
at Transaction.rollback (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:135:35)
at Transaction._onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:159:21)
at _TransactionStreamObserver.onError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:219:18)
at /Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/transaction.js:77:29
at :null:null
Stack trace for 1.5.x driver:
message=Operation timed out in 60000 ms., stack=Neo4jError: Operation timed out in 60000 ms.
at new Neo4jError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:76:132)
at newError (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/error.js:66:10)
at Timeout._onTimeout (/Users/srmacdo/projects/profile/profile-api/node_modules/neo4j-driver/lib/v1/internal/util.js:153:34)
at ontimeout (timers.js:475:11)
at tryOnTimeout (timers.js:310:5)
at Timer.listOnTimeout (timers.js:270:5)
The text was updated successfully, but these errors were encountered: