-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy subgraphs between shards #2313
Conversation
I added a few commits so that connections that are used for copying, and therefore use fdw, come from a separate pool where connections are closed more aggressively. That achieves two things: (1) 'normal' connections don't have additional fdw conns hanging off them (2) the number of connections that are busy because of copying is limited separately so that heavy copy activity doesn't block other database work. |
c951875
to
b414c33
Compare
e0a57a7
to
21f7c7e
Compare
74cb23c
to
e8b9188
Compare
We now use one dedicated fdw connection, and hang on to it for the duration of the copying.
The various assign commands and 'copy create' need to be able to find a unique deployment. Allow passing a shard to disambiguate.
Also, mark the cancellation time in copy_state
The SubgraphStore gives access to all instances of a deployment, but for the WritableStore we need to be very careful that we do not accidentally query or modify another deployment instance than the one in the site. These code changes will hopefully make it more obvious when we would rely on SubgraphStore functionality that is dependent on the precise deployment instance.
Starting a subgraph can be very slow if we have to copy the subgraph as part of starting it. Also, call `Writable.start_subgraph_deployment` early on so that dynamic data sources are in place when we look for them.
The id makes it easier to distinguish between log messages coming from different copies of the same subgraph
With this PR, subgraphs can be copied from one shard to another shard. The system can now deal with multiple copies of the same deployment hash existing side by side. These copies need to be in different shards (there's little point in having the same data duplicated in the same database), and only one of these copies (the 'active' one) will be used to respond to queries.
Copying is initiated by running
graphman copy create ...
, and queries can be switched to the copy withgraphman copy activate ..
once the copy has finished. Behind the scenes, copying uses most of the grafting machinery from #2293. In addition, all operations make sure that they now uniquely identify the precise copy of a deployment using aDeploymentLocator
on the write path for a subgraph (encapsulated in aWritableStore
) For queries, theQueryStore
automatically picks the active copy of a subgraph. Within the store, deployments are now identified by theid
of thedeployment_schemas
table;DeploymentLocator
encapsulates that so that the rest of the code can be oblivious to that.The system is limited to 5 active copy/graft operations per index pod. In the future, it might be better to limit this systemwide. The limit is hardcoded, but it would be easy to make that configurable in
graph-node.toml
While a copy/graft is under way, progress is printed approximately every 3-5 minutes; the progress message looks like
A copy/graft can be cancelled by unassigning the destination deployment. That will (with some delay of up to 5 minutes) lead to the copy process stopping, and the subgraph being stopped.
A list of active copies can be generated with
graphman copy list
:The details of a copy operation can be printed with
graphman copy status <dst>
which prints something likeTo help clarify the distinction between internal and external identifier of a deployment, I plan on renaming
SubgraphDeploymentId
toDeploymentHash
. Since this is a very boring, but very intrusive change, I will do that in a separate PR; that's the reason why a lot of code now has variableshash: SubgraphDeploymentId
.This PR sits on top of #2293.