Copy subgraphs between shards #2313

lutter · 2021-03-25T22:54:49Z

With this PR, subgraphs can be copied from one shard to another shard. The system can now deal with multiple copies of the same deployment hash existing side by side. These copies need to be in different shards (there's little point in having the same data duplicated in the same database), and only one of these copies (the 'active' one) will be used to respond to queries.

Copying is initiated by running graphman copy create ..., and queries can be switched to the copy with graphman copy activate .. once the copy has finished. Behind the scenes, copying uses most of the grafting machinery from #2293. In addition, all operations make sure that they now uniquely identify the precise copy of a deployment using a DeploymentLocator on the write path for a subgraph (encapsulated in a WritableStore) For queries, the QueryStore automatically picks the active copy of a subgraph. Within the store, deployments are now identified by the id of the deployment_schemas table; DeploymentLocator encapsulates that so that the rest of the code can be oblivious to that.

The system is limited to 5 active copy/graft operations per index pod. In the future, it might be better to limit this systemwide. The limit is hardcoded, but it would be easy to make that configurable in graph-node.toml

While a copy/graft is under way, progress is printed approximately every 3-5 minutes; the progress message looks like

INFO Copied 68.36% of `UpdateOperator` entities (150000/219432 entity versions), 36.60% of overall data, 
         dst: sgd139, subgraph_id: QmXB2mf78FTeXPFT7R7LUiUGNxypTbDnoMXYGu7oRAfh9z, 
         component: SubgraphInstanceManager

A copy/graft can be cancelled by unassigning the destination deployment. That will (with some delay of up to 5 minutes) lead to the copy process stopping, and the subgraph being stopped.

A list of active copies can be generated with graphman copy list:

------------------------------------------------------------------------------
deployment           | QmQx9BPnarDU29c1vroiWCBd4wrZuMpngvA6peuSB2Kzy6
action               | sgd120 -> sgd141 (shard_1)
started              | 2021-04-06T00:17:41+00:00
progress             | 12.51% done, 310000/2477526
------------------------------------------------------------------------------
deployment           | QmRuorV4Ck1sVdpfpAAwfYXnf3cfSkbDwZvvzWud9SH8Dg
action               | sgd130 -> sgd145 (shard_1)
queued               | 2021-04-06T00:18:16+00:00
------------------------------------------------------------------------------
deployment           | QmRpG1kMLJxnruvCScfs4JYPeHJEnJk1ZsEdhpfqaGPUHs
action               | sgd19 -> sgd146 (shard_1)
cancel requested     | 2021-04-06T00:22:30+00:00
------------------------------------------------------------------------------
deployment           | QmSsv4sJNwyEkSmwWgErmHS4tneAuVEoChSRNcyXSE1mjG
action               | sgd43 -> sgd142 (shard_1)
cancelled            | 2021-04-06T00:22:56+00:00

The details of a copy operation can be printed with graphman copy status <dst> which prints something like

src          | 122
dst          | 144
target block | 12152039
duration     | 25m
status       | .

         entity type           |   next   |  target  |  batch   | duration
--------------------------------------------------------------------------
> Loan                         |   637909 |  1283075 |   144322 |      19m
. Poi$                         |        0 |   108543 |    10000 |      0ms
✓ Pool                         |   113975 |   113974 |   160000 |     112s
✓ Proxy                        |       22 |       21 |    20000 |     40ms

To help clarify the distinction between internal and external identifier of a deployment, I plan on renaming SubgraphDeploymentId to DeploymentHash. Since this is a very boring, but very intrusive change, I will do that in a separate PR; that's the reason why a lot of code now has variables hash: SubgraphDeploymentId.

This PR sits on top of #2293.

lutter · 2021-03-31T21:54:04Z

I added a few commits so that connections that are used for copying, and therefore use fdw, come from a separate pool where connections are closed more aggressively. That achieves two things: (1) 'normal' connections don't have additional fdw conns hanging off them (2) the number of connections that are busy because of copying is limited separately so that heavy copy activity doesn't block other database work.

…able

We now use one dedicated fdw connection, and hang on to it for the duration of the copying.

The various assign commands and 'copy create' need to be able to find a unique deployment. Allow passing a shard to disambiguate.

Also, mark the cancellation time in copy_state

The SubgraphStore gives access to all instances of a deployment, but for the WritableStore we need to be very careful that we do not accidentally query or modify another deployment instance than the one in the site. These code changes will hopefully make it more obvious when we would rely on SubgraphStore functionality that is dependent on the precise deployment instance.

Starting a subgraph can be very slow if we have to copy the subgraph as part of starting it. Also, call `Writable.start_subgraph_deployment` early on so that dynamic data sources are in place when we look for them.

The id makes it easier to distinguish between log messages coming from different copies of the same subgraph

lutter force-pushed the lutter/cp branch from d94f1a2 to b9ebf4d Compare March 26, 2021 02:13

lutter force-pushed the lutter/graft branch from a6e1273 to b3561a3 Compare March 26, 2021 22:21

lutter force-pushed the lutter/cp branch from a784ffd to 69b19e0 Compare March 26, 2021 22:21

lutter force-pushed the lutter/graft branch from b3561a3 to 405ec3a Compare March 27, 2021 02:05

lutter force-pushed the lutter/cp branch from 69b19e0 to f9967d5 Compare March 27, 2021 02:05

lutter force-pushed the lutter/graft branch from 405ec3a to 69fd7f3 Compare March 27, 2021 02:12

lutter force-pushed the lutter/cp branch from f9967d5 to fa1d140 Compare March 27, 2021 02:13

lutter mentioned this pull request Mar 27, 2021

Copy subgraph data in batches when grafting #2293

Merged

lutter force-pushed the lutter/graft branch from 69fd7f3 to 51d1b5b Compare March 29, 2021 17:53

lutter force-pushed the lutter/cp branch from fa1d140 to ed09417 Compare March 29, 2021 17:53

lutter force-pushed the lutter/graft branch from 51d1b5b to 01b5233 Compare March 31, 2021 00:00

lutter force-pushed the lutter/cp branch from ed09417 to f6346b6 Compare March 31, 2021 00:00

lutter force-pushed the lutter/graft branch from 01b5233 to 0533d5a Compare March 31, 2021 21:52

lutter force-pushed the lutter/cp branch from 0e33a0d to 5f317e4 Compare March 31, 2021 21:52

lutter force-pushed the lutter/graft branch from 0533d5a to 4ff0f40 Compare April 1, 2021 02:15

lutter force-pushed the lutter/cp branch from 5f317e4 to 92a9713 Compare April 1, 2021 02:17

lutter force-pushed the lutter/graft branch from 4ff0f40 to d76b303 Compare April 2, 2021 00:34

lutter force-pushed the lutter/cp branch 3 times, most recently from c951875 to b414c33 Compare April 2, 2021 20:47

lutter force-pushed the lutter/graft branch from d76b303 to 70062ae Compare April 2, 2021 23:54

lutter force-pushed the lutter/cp branch 5 times, most recently from e0a57a7 to 21f7c7e Compare April 6, 2021 23:04

lutter force-pushed the lutter/graft branch from 70062ae to d20df11 Compare April 7, 2021 22:40

lutter force-pushed the lutter/cp branch 2 times, most recently from 74cb23c to e8b9188 Compare April 7, 2021 23:05

lutter added 23 commits April 20, 2021 16:37

store: Use the fdw pool for copy connections

56a5f70

store: Do not copy dynamic data sources more than once

a2bfd73

store: Refuse to copy a deployment in a few edge cases

bca96e1

store: Fix the mapping of timestamp columns in the copy tables

7c53888

store: Reset copy_table_state.started_at when we start working on a t…

3f7f397

…able

node: Add graphman copy list/status

e4dab8e

store: Move advisory locks for migrations to separate module

54276ee

store: Change the strategy how copying manages connections

ce22d7c

We now use one dedicated fdw connection, and hang on to it for the duration of the copying.

store: Guard against a new and a terminated process copying in parallel

f2094cf

store: Fix duration tracking during copying

3ce547d

node: Adapt graphman to possibly having multiple deployments

8e91522

The various assign commands and 'copy create' need to be able to find a unique deployment. Allow passing a shard to disambiguate.

store: Make logs when we wait for a copy conn less alarming

c0fdde7

store: Return an error from start_subgraph if copy was canceled

48ee9ad

Also, mark the cancellation time in copy_state

node: Make 'graphman copy list' more useful

c6a4e01

store: Only report active deployments in status

e307d36

core: Start subgraphs in the background; copy before loading dds

8cb9b2b

Starting a subgraph can be very slow if we have to copy the subgraph as part of starting it. Also, call `Writable.start_subgraph_deployment` early on so that dynamic data sources are in place when we look for them.

core, graph: Add the deployment id to internal log messages

fdf0a69

The id makes it easier to distinguish between log messages coming from different copies of the same subgraph

core, graph, store: Address review comments

39935fd

store: Extract common logic from primary::allocate_site/copy_site

9388c42

graph, node, store: Address more review comments

49631e5

store: Copy (non-fatal) errors to the new subgraph

4745e2c

store: Make sure that copied errors have a unique id

01c5b8d

lutter force-pushed the lutter/graft branch from d20df11 to 674efa6 Compare April 20, 2021 23:48

lutter force-pushed the lutter/cp branch from 7c4e33e to 01c5b8d Compare April 20, 2021 23:48

lutter changed the base branch from lutter/graft to master April 21, 2021 20:30

lutter merged commit 01c5b8d into master Apr 21, 2021

lutter mentioned this pull request Apr 21, 2021

store: Tolerate removing a deployment whose metadata is missing #2396

Closed

lutter deleted the lutter/cp branch April 21, 2021 22:26

AnPiakhota mentioned this pull request Dec 6, 2023

[Bug] Graphman copy operation is stuck in 'queued' phase. #5052

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy subgraphs between shards #2313

Copy subgraphs between shards #2313

lutter commented Mar 25, 2021 •

edited

Loading

lutter commented Mar 31, 2021

Copy subgraphs between shards #2313

Copy subgraphs between shards #2313

Conversation

lutter commented Mar 25, 2021 • edited Loading

lutter commented Mar 31, 2021

lutter commented Mar 25, 2021 •

edited

Loading