-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph: Use a map with interned keys for Entity
#4485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interner implementation looks good! Perhaps the next step, before a full refactor, is to switch the Entity
implementation to use the interner Object
, and see if that brings up any new concerns? One thing I'm curious about is the stable hash implementation, and if we can keep that consistent when switching to this.
|
||
/// Find the value for `key` in the object. Return `None` if the key is | ||
/// not present. | ||
pub fn get(&self, key: &str) -> Option<&V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No fn get_by_atom
? I'd hope we can get by atom in hot code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add one once we need it - for now, I was thinking of keeping atoms internal to entities, and not plumb this through everything, i.e., users of entities will look up by string for now.
Yes, that's what I have been working on - it involves quite a bit of change since we need to get rid of
It should follow the implementation for |
pub struct AtomPool { | ||
base: Option<Arc<AtomPool>>, | ||
base_sym: AtomInt, | ||
atoms: Vec<Box<str>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are attempting to reduce memory consumption, would it be better to use Arc
instead of Box
here? The str
in atoms
and words
are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that might be a win depending on the average size of those strings. Arc
introduces an overhead of 24 bytes, whereas str
is 16 bytes, so the savings come down to how many strings fit into 8 bytes. But for now, the main win of interning will be to reduce hundreds of copies of the same string to two.
Entity
This PR now integrates the This PR is best reviewed commit-by-commit; the initial introduction of interned keys for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! It's hard to review in depth since the PR is quite extensive, but the commits were meticulously organized as always, I read through them and they look good. We should give this a short run on the test cluster because refactoring Entity
is sensitive and we should make sure PoIs are unaffected.
/// in the document and the names of all their fields | ||
fn atom_pool(document: &s::Document) -> AtomPool { | ||
let mut pool = AtomPool::new(); | ||
// These two entries are always required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
three entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh .. two of them shouldn't be there. I'll have a follow-on PR that removes __typename
and g$parent_id
; they are only for the query side.
I added one more commit to reduce the size of |
Just for future reference, I think it would have been nicer to have the self-contained implementation and tests in one PR with a refactor in a separate PR |
@@ -170,7 +177,8 @@ impl EntityCache { | |||
} | |||
None => { | |||
let value = self.schema.id_value(&key)?; | |||
entity.set("id", value); | |||
// unwrap: our AtomPool always has an id in it | |||
entity.set("id", value).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be useful to have a set_id fn that doesn't return an error? I think it would make it easier to use correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have another PR that gets rid of single-key changes to entities - they are only needed in tests. In general, an Entity
is used to shuttle data between WASM and the store, and we don't really need to update an Entity
once it's been constructed (this id
setting logic then moves to store_set
before the Entity
is constructed) That has the nice side-effect that an Entity
always has an id
and Entity.id()
can just return String
} | ||
|
||
#[derive(Debug, PartialEq)] | ||
pub struct Inner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for Inner to be pub? I see the Deref but Inner while a common pattern doesn't seem like a great name to be exported. I'd rather move the exported functionality to the outer type of possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the impl Inner
and the Deref
} | ||
|
||
impl Inner { | ||
pub fn api_schema(&self) -> Result<ApiSchema, anyhow::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also add some comments to these other exported fns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this is just existing code moved here from other places - for better or worse, a lot of that doesn't have comments, and commenting it all would be a pretty big undertaking. I'll add some comments, but can't do that for everything here.
@@ -0,0 +1,658 @@ | |||
//! Interning of strings. | |||
//! | |||
//! This module provides an interned string pool `AtomPool` and a map-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I don't think this really is a util, it's more of a data structure as state, also intern doesn't really describe what is contained in the file so I think either a re-usable crate if possible or at least a different package name/path would benefit readability
@@ -71,6 +73,220 @@ impl TryFrom<&r::Value> for ErrorPolicy { | |||
} | |||
} | |||
|
|||
#[derive(Debug)] | |||
pub struct ApiSchema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is code that I moved here from somewhere else as-is; I'll add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
Just for reference: I ran this PR in the integration cluster for ~ 24 hours without any PoI differences. |
We use Schema just as a basic utility, and differentiate in the rest of the code between input and api schemas
This avoids building an intermediate BTreeMap
We need this to lower the requirements for implementors of FromEntityData, in particular so that we do not need an impl of Default
For now, we use a fake `AtomPool`, but eventually, all entities will need to be created in connection to an `AtomPool` that comes from the `InputSchema`. This also changes `Entity` from `HashMap<String, _>` to `HashMap<Word, _>`
That allows us to remove `Deserialize` from `Entity`
Rebased to latest |
With the current implementation, it doesn't save much memory compared to u32, but it makes sure we can fit all atoms into a u16, and enables a few more memory optimizations.
This is an implementation of a string pool that I've had kicking around locally for a long time. The idea behind this is that we have a lot of places where we deal with maps whose keys are strings, and those keys come from a known, fixed set of strings. On the indexing side, those are the names of attributes from the subgraph schema, and for queries it's those names plus field aliases in the query.
Moving from maps with string keys to the
Object
struct, which is a map keyed on interned strings, should reduce memory consumption quite a bit.The main reason to open this PR is to start a discussion around this approach before I go and plumb this into the places where we deal with such maps, mostly the
Entity
type during indexing and ther::Value
type for queries.