-
Notifications
You must be signed in to change notification settings - Fork 52
Replication stalls; nodes involved stop responding to HTTP #95
Comments
I've been seeing some weirdness like this off and on too. I started a cluster with 3 nodes, tried replicating a ~1GB database, and had the process fail several times. It would always make some amount of progress, but ultimately it died on me each time. If someone can tell me what to grep I'll search my logs, but here's a couple snippets that I found. Unfortunately I didn't run these tests in a very methodical way, and I've been having trouble reproducing these issues, so I'll try the scattershot approach in an effort to provide something valuable by accident :)
Tried accessing a _conflicts view, this was returned to my browser (not from the logs, chrome had this document returned to it):
Looking through I do know that these machines had very high CPU load (around 5.0-10.0 range) for a 1-core, 1-CPU box, which may have caused them to become unresponsive. One of my 3 nodes did seem to actually die (sv status bigcouch reported that it was no longer running, box load went to 0.0). The strange thing to me is that the node reporting the errors below (domU-12-31-*****) was the local node that I was directly querying via curl, and not a remote node (and not the node that crashed!)
I was trying to replicate a database from cloudant, so you'll see that in these errors (I was also running this node on port 5985 instead of 5984 because I had haproxy on the standard port; might this have been causing my issues?):
Which came coupled with:
and
Sorry this post was so long; as I mentioned I am not exactly sure what caused the crashes or how to reproduce. If there's some way for me to be more helpful, happy to do it! |
In doing some debugging, I just tried doing a replication from one shard of the db on the backend (5986) port to a stand-alone, vanilla couchdb instance. What I'm seeing is quite interesting. So this is a push replication, from nodeA (backend of bigcouch) to nodeB (vanilla couchdb) (note: I've trimmed the Headers lines from these logs and obfuscated IPs) NodeA's log:
NodeB's log:
What's weird, is that for every post that timed out on the source, it was received by the destination. Remember, the destination isn't part of a cluster, it's just a standalone, vanilla couchdb instance. |
I'm seeing the same issue when enabling continuous replication jobs between two bigcouches. Things seem to work for a period of time, then the process stops logging and responding on http. Restarting the process starts the replication jobs again. I can't track down the root cause. Anyone got any leads on this? |
@dhdanno: bigcouch is discontinued. Clustering code has been merged into upstream couchdb. |
Thanks for the update. I do realize this, but we still are using it in production so i thought I'd ask. I believe the issue was the sheer volume of documents being replicated in some of the databases causing the process to crash. I ended up doing a manual seed of the data, then enabling replication... I believe this is the way it needs to be done! |
Doing a Pull replication from ClusterA to ClusterB.
{"source":"http://node1.clusterA:5984/db", "target":"http://node2.clusterB:5984/db"}
When using the
_replicator
db, state is triggered. When using_replicate
, I get{http_request_failed,<<"failed to replicate http://node2.clusterB:5984/db/">>}
returned.Logs from node2.clusterB during the problem occurrence are available here: https://raw.github.com/gist/c372d0fbd0f01a2e0fb9/58a96d00c5b58d4d457ece12813cbf013a932b5c/bigcouch.log.out
The text was updated successfully, but these errors were encountered: