Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph: Use a map with interned keys for Entity #4485

Merged
merged 31 commits into from
Apr 25, 2023
Merged

graph: Use a map with interned keys for Entity #4485

merged 31 commits into from
Apr 25, 2023

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Mar 21, 2023

This is an implementation of a string pool that I've had kicking around locally for a long time. The idea behind this is that we have a lot of places where we deal with maps whose keys are strings, and those keys come from a known, fixed set of strings. On the indexing side, those are the names of attributes from the subgraph schema, and for queries it's those names plus field aliases in the query.

Moving from maps with string keys to the Object struct, which is a map keyed on interned strings, should reduce memory consumption quite a bit.

The main reason to open this PR is to start a discussion around this approach before I go and plumb this into the places where we deal with such maps, mostly the Entity type during indexing and the r::Value type for queries.

Copy link
Collaborator

@leoyvens leoyvens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interner implementation looks good! Perhaps the next step, before a full refactor, is to switch the Entity implementation to use the interner Object, and see if that brings up any new concerns? One thing I'm curious about is the stable hash implementation, and if we can keep that consistent when switching to this.


/// Find the value for `key` in the object. Return `None` if the key is
/// not present.
pub fn get(&self, key: &str) -> Option<&V> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fn get_by_atom? I'd hope we can get by atom in hot code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one once we need it - for now, I was thinking of keeping atoms internal to entities, and not plumb this through everything, i.e., users of entities will look up by string for now.

graph/src/util/intern.rs Outdated Show resolved Hide resolved
graph/src/util/intern.rs Show resolved Hide resolved
@lutter
Copy link
Collaborator Author

lutter commented Apr 12, 2023

before a full refactor, is to switch the Entity implementation to use the interner Object, and see if that brings up any new concerns?

Yes, that's what I have been working on - it involves quite a bit of change since we need to get rid of Entity::new and similar methods that create entities and make the schema a factory for entities (that's where the string pool lives) I'll add to this PR once I have something that's halfway understandable.

One thing I'm curious about is the stable hash implementation, and if we can keep that consistent when switching to this.

It should follow the implementation for HashMap and not change the stable hash - I need a helper from the stable hash crate to become public to do that, but it shouldn't be very hard since the helper operates on an iterator over (&str, Value).

pub struct AtomPool {
base: Option<Arc<AtomPool>>,
base_sym: AtomInt,
atoms: Vec<Box<str>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are attempting to reduce memory consumption, would it be better to use Arc instead of Box here? The str in atoms and words are identical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that might be a win depending on the average size of those strings. Arc introduces an overhead of 24 bytes, whereas str is 16 bytes, so the savings come down to how many strings fit into 8 bytes. But for now, the main win of interning will be to reduce hundreds of copies of the same string to two.

@lutter lutter changed the title graph: A simple string pool graph: Use a map with interned keys for Entity Apr 13, 2023
@lutter
Copy link
Collaborator Author

lutter commented Apr 13, 2023

This PR now integrates the AtomPool into graph-node and uses it to back Entity, i.e., the indexing side now uses objects with interned keys to represent entities. Most of this PR is concerned with moving schema-handling code between crates/modules with the goal of making the new InputSchema a factory for entities since it is now not allowed to create free-standing entities - they all need to be based on a schema (except for entities in tests)

This PR is best reviewed commit-by-commit; the initial introduction of interned keys for Entity panics quite a bit, but those panics are removed in later commits.

@lutter lutter marked this pull request as ready for review April 13, 2023 23:38
@lutter lutter requested a review from leoyvens April 13, 2023 23:38
Copy link
Collaborator

@leoyvens leoyvens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! It's hard to review in depth since the PR is quite extensive, but the commits were meticulously organized as always, I read through them and they look good. We should give this a short run on the test cluster because refactoring Entity is sensitive and we should make sure PoIs are unaffected.

/// in the document and the names of all their fields
fn atom_pool(document: &s::Document) -> AtomPool {
let mut pool = AtomPool::new();
// These two entries are always required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh .. two of them shouldn't be there. I'll have a follow-on PR that removes __typename and g$parent_id; they are only for the query side.

@lutter
Copy link
Collaborator Author

lutter commented Apr 20, 2023

I added one more commit to reduce the size of u16 - not much of a win for now, but it'll make it easier when we improve the memory layout of Object further.

@mangas
Copy link
Contributor

mangas commented Apr 21, 2023

Just for future reference, I think it would have been nicer to have the self-contained implementation and tests in one PR with a refactor in a separate PR

@@ -170,7 +177,8 @@ impl EntityCache {
}
None => {
let value = self.schema.id_value(&key)?;
entity.set("id", value);
// unwrap: our AtomPool always has an id in it
entity.set("id", value).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be useful to have a set_id fn that doesn't return an error? I think it would make it easier to use correctly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have another PR that gets rid of single-key changes to entities - they are only needed in tests. In general, an Entity is used to shuttle data between WASM and the store, and we don't really need to update an Entity once it's been constructed (this id setting logic then moves to store_set before the Entity is constructed) That has the nice side-effect that an Entity always has an id and Entity.id() can just return String

}

#[derive(Debug, PartialEq)]
pub struct Inner {
Copy link
Contributor

@mangas mangas Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for Inner to be pub? I see the Deref but Inner while a common pattern doesn't seem like a great name to be exported. I'd rather move the exported functionality to the outer type of possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the impl Inner and the Deref

}

impl Inner {
pub fn api_schema(&self) -> Result<ApiSchema, anyhow::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also add some comments to these other exported fns ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is just existing code moved here from other places - for better or worse, a lot of that doesn't have comments, and commenting it all would be a pretty big undertaking. I'll add some comments, but can't do that for everything here.

@@ -0,0 +1,658 @@
//! Interning of strings.
//!
//! This module provides an interned string pool `AtomPool` and a map-like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I don't think this really is a util, it's more of a data structure as state, also intern doesn't really describe what is contained in the file so I think either a re-usable crate if possible or at least a different package name/path would benefit readability

@@ -71,6 +73,220 @@ impl TryFrom<&r::Value> for ErrorPolicy {
}
}

#[derive(Debug)]
pub struct ApiSchema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment would be great

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code that I moved here from somewhere else as-is; I'll add a comment

Copy link
Contributor

@mangas mangas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@lutter
Copy link
Collaborator Author

lutter commented Apr 21, 2023

Just for reference: I ran this PR in the integration cluster for ~ 24 hours without any PoI differences.

@lutter
Copy link
Collaborator Author

lutter commented Apr 21, 2023

Rebased to latest master

lutter added 4 commits April 21, 2023 11:09
With the current implementation, it doesn't save much memory compared to
u32, but it makes sure we can fit all atoms into a u16, and enables a few
more memory optimizations.
@lutter lutter merged commit d14c155 into master Apr 25, 2023
@lutter lutter deleted the lutter/intern branch April 25, 2023 18:27
@tsudmi tsudmi mentioned this pull request Jul 5, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants