graph: Use a map with interned keys for `Entity` #4485

lutter · 2023-03-21T20:35:43Z

This is an implementation of a string pool that I've had kicking around locally for a long time. The idea behind this is that we have a lot of places where we deal with maps whose keys are strings, and those keys come from a known, fixed set of strings. On the indexing side, those are the names of attributes from the subgraph schema, and for queries it's those names plus field aliases in the query.

Moving from maps with string keys to the Object struct, which is a map keyed on interned strings, should reduce memory consumption quite a bit.

The main reason to open this PR is to start a discussion around this approach before I go and plumb this into the places where we deal with such maps, mostly the Entity type during indexing and the r::Value type for queries.

leoyvens

The interner implementation looks good! Perhaps the next step, before a full refactor, is to switch the Entity implementation to use the interner Object, and see if that brings up any new concerns? One thing I'm curious about is the stable hash implementation, and if we can keep that consistent when switching to this.

leoyvens · 2023-04-10T19:21:03Z

graph/src/util/intern.rs

+
+    /// Find the value for `key` in the object. Return `None` if the key is
+    /// not present.
+    pub fn get(&self, key: &str) -> Option<&V> {


No fn get_by_atom? I'd hope we can get by atom in hot code.

I'll add one once we need it - for now, I was thinking of keeping atoms internal to entities, and not plumb this through everything, i.e., users of entities will look up by string for now.

graph/src/util/intern.rs

lutter · 2023-04-12T02:42:22Z

before a full refactor, is to switch the Entity implementation to use the interner Object, and see if that brings up any new concerns?

Yes, that's what I have been working on - it involves quite a bit of change since we need to get rid of Entity::new and similar methods that create entities and make the schema a factory for entities (that's where the string pool lives) I'll add to this PR once I have something that's halfway understandable.

One thing I'm curious about is the stable hash implementation, and if we can keep that consistent when switching to this.

It should follow the implementation for HashMap and not change the stable hash - I need a helper from the stable hash crate to become public to do that, but it shouldn't be very hard since the helper operates on an iterator over (&str, Value).

That3Percent · 2023-04-12T04:46:43Z

graph/src/util/intern.rs

+pub struct AtomPool {
+    base: Option<Arc<AtomPool>>,
+    base_sym: AtomInt,
+    atoms: Vec<Box<str>>,


If you are attempting to reduce memory consumption, would it be better to use Arc instead of Box here? The str in atoms and words are identical.

Yes, that might be a win depending on the average size of those strings. Arc introduces an overhead of 24 bytes, whereas str is 16 bytes, so the savings come down to how many strings fit into 8 bytes. But for now, the main win of interning will be to reduce hundreds of copies of the same string to two.

lutter · 2023-04-13T23:38:18Z

This PR now integrates the AtomPool into graph-node and uses it to back Entity, i.e., the indexing side now uses objects with interned keys to represent entities. Most of this PR is concerned with moving schema-handling code between crates/modules with the goal of making the new InputSchema a factory for entities since it is now not allowed to create free-standing entities - they all need to be based on a schema (except for entities in tests)

This PR is best reviewed commit-by-commit; the initial introduction of interned keys for Entity panics quite a bit, but those panics are removed in later commits.

leoyvens

Amazing work! It's hard to review in depth since the PR is quite extensive, but the commits were meticulously organized as always, I read through them and they look good. We should give this a short run on the test cluster because refactoring Entity is sensitive and we should make sure PoIs are unaffected.

leoyvens · 2023-04-20T14:26:34Z

graph/src/schema/input_schema.rs

+/// in the document and the names of all their fields
+fn atom_pool(document: &s::Document) -> AtomPool {
+    let mut pool = AtomPool::new();
+    // These two entries are always required


three entries.

Heh .. two of them shouldn't be there. I'll have a follow-on PR that removes __typename and g$parent_id; they are only for the query side.

lutter · 2023-04-20T18:03:27Z

I added one more commit to reduce the size of u16 - not much of a win for now, but it'll make it easier when we improve the memory layout of Object further.

mangas · 2023-04-21T09:13:04Z

Just for future reference, I think it would have been nicer to have the self-contained implementation and tests in one PR with a refactor in a separate PR

mangas · 2023-04-21T12:19:17Z

graph/src/components/store/entity_cache.rs

@@ -170,7 +177,8 @@ impl EntityCache {
            }
            None => {
                let value = self.schema.id_value(&key)?;
-                entity.set("id", value);
+                // unwrap: our AtomPool always has an id in it
+                entity.set("id", value).unwrap();


could be useful to have a set_id fn that doesn't return an error? I think it would make it easier to use correctly

I'll have another PR that gets rid of single-key changes to entities - they are only needed in tests. In general, an Entity is used to shuttle data between WASM and the store, and we don't really need to update an Entity once it's been constructed (this id setting logic then moves to store_set before the Entity is constructed) That has the nice side-effect that an Entity always has an id and Entity.id() can just return String

graph/src/schema/input_schema.rs

mangas · 2023-04-21T12:22:04Z

graph/src/schema/input_schema.rs

+}
+
+#[derive(Debug, PartialEq)]
+pub struct Inner {


Does it make sense for Inner to be pub? I see the Deref but Inner while a common pattern doesn't seem like a great name to be exported. I'd rather move the exported functionality to the outer type of possible.

I removed the impl Inner and the Deref

mangas · 2023-04-21T12:24:17Z

graph/src/schema/input_schema.rs

+}
+
+impl Inner {
+    pub fn api_schema(&self) -> Result<ApiSchema, anyhow::Error> {


could you also add some comments to these other exported fns ?

A lot of this is just existing code moved here from other places - for better or worse, a lot of that doesn't have comments, and commenting it all would be a pretty big undertaking. I'll add some comments, but can't do that for everything here.

graph/src/schema/input_schema.rs

mangas · 2023-04-21T12:28:11Z

graph/src/util/intern.rs

@@ -0,0 +1,658 @@
+//! Interning of strings.
+//!
+//! This module provides an interned string pool `AtomPool` and a map-like


nitpick: I don't think this really is a util, it's more of a data structure as state, also intern doesn't really describe what is contained in the file so I think either a re-usable crate if possible or at least a different package name/path would benefit readability

mangas · 2023-04-21T14:38:17Z

graph/src/schema/api.rs

@@ -71,6 +73,220 @@ impl TryFrom<&r::Value> for ErrorPolicy {
    }
 }

+#[derive(Debug)]
+pub struct ApiSchema {


Comment would be great

This is code that I moved here from somewhere else as-is; I'll add a comment

mangas

See comments

lutter · 2023-04-21T16:55:41Z

Just for reference: I ran this PR in the integration cluster for ~ 24 hours without any PoI differences.

We use Schema just as a basic utility, and differentiate in the rest of the code between input and api schemas

This avoids building an intermediate BTreeMap

We need this to lower the requirements for implementors of FromEntityData, in particular so that we do not need an impl of Default

For now, we use a fake `AtomPool`, but eventually, all entities will need to be created in connection to an `AtomPool` that comes from the `InputSchema`. This also changes `Entity` from `HashMap<String, _>` to `HashMap<Word, _>`

That allows us to remove `Deserialize` from `Entity`

lutter · 2023-04-21T17:02:31Z

Rebased to latest master

With the current implementation, it doesn't save much memory compared to u32, but it makes sure we can fit all atoms into a u16, and enables a few more memory optimizations.

leoyvens reviewed Apr 11, 2023

View reviewed changes

That3Percent reviewed Apr 12, 2023

View reviewed changes

lutter force-pushed the lutter/intern branch from 59e707b to b4363bf Compare April 13, 2023 23:34

lutter changed the title ~~graph: A simple string pool~~ graph: Use a map with interned keys for Entity Apr 13, 2023

lutter marked this pull request as ready for review April 13, 2023 23:38

lutter requested a review from leoyvens April 13, 2023 23:38

lutter force-pushed the lutter/intern branch from b4363bf to 805e05c Compare April 14, 2023 00:19

leoyvens approved these changes Apr 20, 2023

View reviewed changes

mangas reviewed Apr 21, 2023

View reviewed changes

graph/src/schema/input_schema.rs Show resolved Hide resolved

mangas reviewed Apr 21, 2023

View reviewed changes

graph/src/schema/input_schema.rs Show resolved Hide resolved

mangas reviewed Apr 21, 2023

View reviewed changes

mangas approved these changes Apr 21, 2023

View reviewed changes

lutter added 9 commits April 21, 2023 09:55

graph, graphql: Move graphql::schema to graph::schema

2988570

store: Remove graph-core depencency

d465428

all: Introduce InputSchema for subgraph schemas

52c3bda

We use Schema just as a basic utility, and differentiate in the rest of the code between input and api schemas

graph: Move some methods from Schema to ApiSchema

4f5207c

graph, store: Move fulltext types to graph::schema

3c33577

all: Move ApiSchema into graph::schema

d5c932c

all: Move remainder of graph::data::schema into graph::schema

cf0c562

graph, graphql, store: Build Object much earlier during query

cae4914

This avoids building an intermediate BTreeMap

all: Rewrite deserialize_with_layout to use an iterator

8332901

We need this to lower the requirements for implementors of FromEntityData, in particular so that we do not need an impl of Default

lutter added 18 commits April 21, 2023 09:55

graph, runtime: Remove Default implementation from Entity

624e909

graph, store: Use Entity.remove_null_fields where possible

f55947d

all: Make InputSchema a factory for Entity

f5e57d7

For now, we use a fake `AtomPool`, but eventually, all entities will need to be created in connection to an `AtomPool` that comes from the `InputSchema`. This also changes `Entity` from `HashMap<String, _>` to `HashMap<Word, _>`

graph: Make InputSchema cheap to clone

1d2f31a

store: Keep the input schema in the Layout

e1d902c

graph, store: Use the input schema to create entities from the store

cc6fbde

all: Remove Entity::new

5b7090c

graph: Remove unused TryFromValue for Entity

268efe3

graph, runtime: Make DataSourceContext distinct from Entity

1e90d42

That allows us to remove `Deserialize` from `Entity`

graph: Remove unused StableHash impl for Entity

2b8a1f0

graph: A simple string pool

10f963e

graph: Keep an AtomPool in InputSchema

468d2df

graph: Use an interned Object for Entity

abff069

graph, store: Take &str, not String, for the key in Entity.insert

2582131

graph: Use &str for the key in Entity.set

e623822

graph: Avoid Word::from in a few places in Entity

22c82ba

all: Do not panic in Entity::make when given uninterned keys

9a5c305

graph, store: Do not panic in Entity.insert when given an uninterned key

7dc748b

lutter force-pushed the lutter/intern branch from 6fa363e to e679b79 Compare April 21, 2023 17:02

lutter added 4 commits April 21, 2023 11:09

graph, store: Do not panic in Entity::set when given an uninterned key

5a065c1

graph, store: Do not panic in Entity::try_make on uninterned keys

58811ae

graph: Reduce size of Atom to a u16

8a620be

With the current implementation, it doesn't save much memory compared to u32, but it makes sure we can fit all atoms into a u16, and enables a few more memory optimizations.

graph: Address review comments

d14c155

lutter force-pushed the lutter/intern branch from e679b79 to d14c155 Compare April 21, 2023 18:12

lutter merged commit d14c155 into master Apr 25, 2023

lutter deleted the lutter/intern branch April 25, 2023 18:27

tsudmi mentioned this pull request Jul 5, 2023

[Bug] lfu cache error in 0.31.0 #4741

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph: Use a map with interned keys for `Entity` #4485

graph: Use a map with interned keys for `Entity` #4485

lutter commented Mar 21, 2023

leoyvens left a comment •

edited

Loading

leoyvens Apr 10, 2023

lutter Apr 12, 2023

lutter commented Apr 12, 2023

That3Percent Apr 12, 2023

lutter Apr 13, 2023

lutter commented Apr 13, 2023 •

edited

Loading

leoyvens left a comment

leoyvens Apr 20, 2023

lutter Apr 20, 2023

lutter commented Apr 20, 2023

mangas commented Apr 21, 2023

mangas Apr 21, 2023

lutter Apr 21, 2023

mangas Apr 21, 2023 •

edited

Loading

lutter Apr 21, 2023

mangas Apr 21, 2023

lutter Apr 21, 2023

mangas Apr 21, 2023

mangas Apr 21, 2023

lutter Apr 21, 2023

mangas left a comment

lutter commented Apr 21, 2023

lutter commented Apr 21, 2023

graph: Use a map with interned keys for Entity #4485

graph: Use a map with interned keys for Entity #4485

Conversation

lutter commented Mar 21, 2023

leoyvens left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter commented Apr 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter commented Apr 13, 2023 • edited Loading

leoyvens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter commented Apr 20, 2023

mangas commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas left a comment

Choose a reason for hiding this comment

lutter commented Apr 21, 2023

lutter commented Apr 21, 2023

graph: Use a map with interned keys for `Entity` #4485

graph: Use a map with interned keys for `Entity` #4485

leoyvens left a comment •

edited

Loading

lutter commented Apr 13, 2023 •

edited

Loading

mangas Apr 21, 2023 •

edited

Loading