Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache successfully parsed and validated documents for future requests. #2111

Merged
merged 15 commits into from
Jan 23, 2019

Conversation

abernix
Copy link
Member

@abernix abernix commented Dec 18, 2018

This PR builds upon the housekeeping work done in #2107, #2108 and #2109 and implements an in-memory cache store for parsed and validated documents in an effort to increase server performance.

Upon receiving a request, one of the many steps Apollo Server must perform is the parsing and validation of the document via graphql-js's parse and validate functions. The parsing and the validation of the document results in a DocumentNode (an AST) which is used for the eventual execution (via graphql-js's execute);

For relatively small documents, this takes a small fraction of time, though with more complex documents, the computational sink is becomes larger and less predictable.

These performance characteristics aren't always predictable, but when processing the same operation over and over again we can avoid the expensive computation cost associated with parsing and validating again by storing the resulting DocumentNode rather than building what is ultimately a re-forged copy of an effectively identical DocumentNode.

This PR introduces that technique by utilizing the SHA-256 hex digest of the operation document itself as a key and storing identical DocumentNodes in an InMemoryLRUCache (powered under-the-hood by the excellent lru-cache package!).

When using GraphQL variables (rather than string/numeric literals within the GraphQL document itself), this means that repeat operations — even with dynamically changing variables — can still be ready to execute based on the parsing and validation of a similar, previous invocation and cut a substantial amount of time off the request time round-trip.

Some benchmarks might help demonstrate this better. For the purposes of demonstrating this, I'll utilize relatively flat GraphQL documents (and corresponding schemas) which I benchmarked locally using Apache JMeter. The resolvers involved in this test are all just so-called "default resolvers", as in, they do nothing but return null and should accurately demonstrate the "win" from this optimization technique:

Benchmarks

p99 = 99th percentile.

Query Without
DocumentNode
caching
With
DocumentNode
caching
query Benchmark {
  Field_1
}
Avg.: 19.96ms
p99: 56.00ms
Throughput: 925.45
Avg.: 18.85ms ⚡️
p99: 58.00ms
Throughput: 979.08 ⚡️
query Benchmark {
  Field_1
  Field_2
  # ...until...
  Field_100
}
Avg.: 34.89ms
p99: 74.00ms
Throughput: 530.99
Avg.: 20.18ms ⚡️
p99: 55.99ms ⚡️
Throughput: 917.70 ⚡️
query Benchmark {
  Field_1
  Field_2
  # ...until...
  Field_5000
}
Avg.: 954.87ms
p99: 1007.00ms
Throughput: 19.22
Avg.: 310.07ms ⚡️
p99: 390.00ms ⚡️
Throughput: 59.64 ⚡️

The single field operations are a wash but there are substantial improvements as the query document's complexity increases! Looking at the middle of the road (and more reasonable request) for 100 fields, the 99th percentile (as in, 99% of requests returned within that timeframe) is showing that it's 132% faster!

It's worth pointing out that I had a 1000ms request/response timeouts configured on the test-runner so the results on the "5000 fields - Before" test aren't showing the true depth of the situation since 88.10% of the operations Read timed out! (Of course, requesting 5000 top-level fields probably has a number of concerns so if that metric piques your concern you I certainly encourage you to dive in more!). On the plus side, the parsed and validated document did not timeout at all and, based on the average response time performed 308% faster. 🏎💨🎠

For anyone interested in more details, I've posted the more complete results at https://abernix.github.io/documentStoreBenchmarks, in the form of JMeter reports. You can click through to them from the index page.

Of course, there's still a couple things to be figured out here, particularly in the realm of configuration surface area. Currently, this utilizes a 30MiB cache, which might be enough for many/most configurations, but I'd like to figure out exactly what the sensible defaults might be here — and what might need tuning — before opening it up for configuration. In fact, if we don't need to add configuration, that would be even better!

I'm excited to see how this performs for others!

Fixes #1566

@abernix
Copy link
Member Author

abernix commented Dec 18, 2018

For anyone interested in trying this out and providing feedback, this has been published under the apollo-server "next" tag as:

Integrations (e.g. express, hapi, koa) will also find their respective versions accordingly (e.g. [email protected], [email protected], [email protected], respectively; see 7697623 above for the full list).

// If we still don't have a document, we'll need to parse and validate it.
// With success, we'll attempt to save it into the store for future use.
if (!document) {
const parsingDidEnd = await dispatcher.invokeDidStartHook(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be any downsides for not calling parsingDidStart/DidEnd in this code path (and not providing some alternate callback to take its place)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone has implemented critical parsingDidStart or parsingDidEnd functionality that didn't isolate its reach to that of the parsing phase, then they might find these callbacks are no longer invoked where they used to be.

Since we never documented or officially supported the graphql-extensions API and the request pipeline plugin hooks are new (though still undocumented. see #2008), I think the best we can do is stay true to the name of the methods and stand by the fact that parsing did not start or end; I don't think we should call parsingDidStart and parsingDidEnd if those behaviors did not actually happen.

Presumably, those who wanted behavior which occurred when execution started or ended would be using the executionDidStart or executionDidEnd hooks (and more generally, requestDidStart, requestDidEnd, willSendResponse, etc.).

I don't think there are any alternate callbacks that we can provide that will make this any easier, and I don't think such hooks would provide intrinsic value.

@mxstbr
Copy link

mxstbr commented Jan 10, 2019

We have a bunch of really big queries for https://github.com/withspectrum/spectrum, so I'm excited about this change as it should provide a solid speedup for our use case!

Will see if we can get 2.4.0-alpha.0 into production and report back with some solid metrics 💯 EDIT: PRs open: withspectrum/spectrum#4533

mxstbr pushed a commit to withspectrum/spectrum that referenced this pull request Jan 10, 2019
This implements query parsing and validation caching, reference
apollographql/apollo-server#2111, and should
thusly provide a nice speedup for our frontend's queries!
@mxstbr
Copy link

mxstbr commented Jan 14, 2019

Cross-posting from withspectrum/spectrum#4533: Reassessing after a couple days in prod, it seems to have helped @abernix!

p99 response time over:

  • Three days before deploying this: 250.78ms
  • Three days after deploying this: 230.65ms

👍

… use.

Without this change, the `document` property was not set on the
`requestContext` for consumption by request pipeline plugins.

To further guard against this oversight, I've removed the extra `document`
variable which was being used as scoped state for the document and switched to
directly using (and assigning to) the `requestContext.document`.

Nice catch, @glasser!

Ref: https://github.com/apollographql/apollo-server/pull/2111/files#r247617469
The parsed/validated cache store is on by default.  While it could be disabled,
in theory, it cannot be disabled since its an internal property of the
request pipeline processor class.

See confusion here:

  withspectrum/spectrum#4533 (comment)
…ore`.

While the implementation of the `documentStore` is currently simple enough
to never throw (it is in-memory), it makes some sense to guard against future
extended functionality where an exception might be raised.

Since storing this object in a distributed memory store isn't currently
feasible, I'm not sure what such an exception would be right now, but I
don't mind being proactive!

Ref: https://github.com/apollographql/apollo-server/pull/2111/files#r247618501
Previously, this used the `JSON.stringify` length, but this is slightly more
aware.
…class.

The implementation of object-size approximation which is used for cache
eviction purposes in the `InMemoryLRUCache` implementation (via `lru-cache`)
was a short-term location for extensible logic which is better located
within `ApolloServerBase`.

This is particularly important since future logic may necessitate knowing or
understanding the current size (roughly, memory usage) of the in-memory
storage.  Effective immediately, this adds support for providing a `dispose`
function which is called when an object is purged from the cache to make
room for another.
@abernix abernix force-pushed the abernix/cache-parsed-validated branch from 2a987a9 to be71620 Compare January 16, 2019 10:47
@abernix
Copy link
Member Author

abernix commented Jan 16, 2019

That's great news, @mxstbr!

I'm pretty close to merging this and iterating on this in subsequent follow-up work, but I think it's worth a last peak before doing so.

In particular, while those who have implemented this so far have had consistently improved response times to show for it, I've added an additional commit to this PR and cut another -alpha.1 release to gain some additional validation through data which will hopefully show that this cache store is sufficiently default-sized and that cache evictions are low.

Therefore, the versions indicated in the Publish commit above (33d1f82; e.g. [email protected]) will console.debug some statistics every 60 seconds which show the document store usage. e.g.:

[documentStore] 331408 / 31457280 (1%). Disposed of 0 documents this interval.

My intention is to remove this spammy-ness before merging, but I thought interested parties might want to take a look. So far, in our own dog-fooding on Apollo Engine, this seems to be working well as currently implemented.

… seconds."

This reverts commit 7a0d0e6, as I intended
when I originally introduced it.

Ref: #2111 (comment)
@abernix
Copy link
Member Author

abernix commented Jan 23, 2019

As I mentioned I would in #2111 (comment), I've reverted 7a0d0e6 and published 2.4.0-alpha.2 without that commit.

I'm going to merge this PR into a release-2.4.0 branch which will be a staging grounds for anything else 2.4.0 related (#1971, I'm looking at you!).

@abernix abernix changed the base branch from master to release-2.4.0 January 23, 2019 13:09
@abernix abernix merged commit 7d54688 into release-2.4.0 Jan 23, 2019
@abernix abernix deleted the abernix/cache-parsed-validated branch January 23, 2019 13:10
@abernix abernix mentioned this pull request Jan 23, 2019
abernix added a commit that referenced this pull request Jan 25, 2019
Since we already write to the persisted query cache asynchronously (and
intentionally do not await the Promise) this won't have any affect on the
current behavior of when the persisted query cache is written to in the
event of an executable request, but if an operation comes in and doesn't
parse or validate, we'll avoid wasting cache space on an operation that will
never execute.

Additionally, in a similar light, if a plugin throws an error which stops
execution, we can avoid the side-effect of writing to the persisted query
cache.

In terms of the APQ behavior, while this could cause additional round-trips
for a client which repeatedly sends an invalid operation, that operation is
never going to successfully finish anyway.  While developer tooling will
help avoid this problem in the first place, this refusal to store an invalid
operation in the APQ cache could actually help diagnose a failure since the
full operation (rather than just the SHA256 of that document) will appear
in the browser's dev-tools on the retry.

If we're looking to spare parsing and validating documents which we know are
going to fail, I think that's going to be a better use of the (new)
`documentStore` cache (#2111), since its in-memory and can accommodate a
more complex data structure which includes the validation errors from the
previous attempt which will, of course, be the same when validating the same
operation against the same schema again.  As the PR that introduced that
feature shows, sparing those additional parses and validations (cached APQ
documents still needs to be parsed and validated, currently) will provide a
more successful performance win overall.

Ref: #2111
abernix added a commit that referenced this pull request Jan 25, 2019
Since we already write to the persisted query cache asynchronously (and
intentionally do not await the Promise) this won't have any affect on the
current behavior of when the persisted query cache is written to in the
event of an executable request, but if an operation comes in and doesn't
parse or validate, we'll avoid wasting cache space on an operation that will
never execute.

Additionally, in a similar light, if a plugin throws an error which stops
execution, we can avoid the side-effect of writing to the persisted query
cache.

In terms of the APQ behavior, while this could cause additional round-trips
for a client which repeatedly sends an invalid operation, that operation is
never going to successfully finish anyway.  While developer tooling will
help avoid this problem in the first place, this refusal to store an invalid
operation in the APQ cache could actually help diagnose a failure since the
full operation (rather than just the SHA256 of that document) will appear
in the browser's dev-tools on the retry.

If we're looking to spare parsing and validating documents which we know are
going to fail, I think that's going to be a better use of the (new)
`documentStore` cache (#2111), since its in-memory and can accommodate a
more complex data structure which includes the validation errors from the
previous attempt which will, of course, be the same when validating the same
operation against the same schema again.  As the PR that introduced that
feature shows, sparing those additional parses and validations (cached APQ
documents still needs to be parsed and validated, currently) will provide a
more successful performance win overall.

Ref: #2111
abernix added a commit that referenced this pull request Jan 25, 2019
Since we already write to the persisted query cache asynchronously (and
intentionally do not await the Promise) this won't have any affect on the
current behavior of when the persisted query cache is written to in the
event of an executable request, but if an operation comes in and doesn't
parse or validate, we'll avoid wasting cache space on an operation that will
never execute.

Additionally, in a similar light, if a plugin throws an error which stops
execution, we can avoid the side-effect of writing to the persisted query
cache.

In terms of the APQ behavior, while this could cause additional round-trips
for a client which repeatedly sends an invalid operation, that operation is
never going to successfully finish anyway.  While developer tooling will
help avoid this problem in the first place, this refusal to store an invalid
operation in the APQ cache could actually help diagnose a failure since the
full operation (rather than just the SHA256 of that document) will appear
in the browser's dev-tools on the retry.

If we're looking to spare parsing and validating documents which we know are
going to fail, I think that's going to be a better use of the (new)
`documentStore` cache (#2111), since its in-memory and can accommodate a
more complex data structure which includes the validation errors from the
previous attempt which will, of course, be the same when validating the same
operation against the same schema again.  As the PR that introduced that
feature shows, sparing those additional parses and validations (cached APQ
documents still needs to be parsed and validated, currently) will provide a
more successful performance win overall.

Ref: #2111
abernix added a commit that referenced this pull request Jan 25, 2019
Since we already write to the persisted query cache asynchronously (and
intentionally do not await the Promise) this won't have any affect on the
current behavior of when the persisted query cache is written to in the
event of an executable request, but if an operation comes in and doesn't
parse or validate, we'll avoid wasting cache space on an operation that will
never execute.

Additionally, in a similar light, if a plugin throws an error which stops
execution, we can avoid the side-effect of writing to the persisted query
cache.

In terms of the APQ behavior, while this could cause additional round-trips
for a client which repeatedly sends an invalid operation, that operation is
never going to successfully finish anyway.  While developer tooling will
help avoid this problem in the first place, this refusal to store an invalid
operation in the APQ cache could actually help diagnose a failure since the
full operation (rather than just the SHA256 of that document) will appear
in the browser's dev-tools on the retry.

If we're looking to spare parsing and validating documents which we know are
going to fail, I think that's going to be a better use of the (new)
`documentStore` cache (#2111), since its in-memory and can accommodate a
more complex data structure which includes the validation errors from the
previous attempt which will, of course, be the same when validating the same
operation against the same schema again.  As the PR that introduced that
feature shows, sparing those additional parses and validations (cached APQ
documents still needs to be parsed and validated, currently) will provide a
more successful performance win overall.

Ref: #2111
akpx added a commit to akpx/spectrum that referenced this pull request Feb 4, 2019
* update background-jobs documentation

* Fix chatInput focussing

* Fix sending plaintext messages

* Remove unnecessary double-read of attached media

* Resolve React warnings

* Fix creating new DM thread

* Revert "Fix creating new DM thread"

This reverts commit d8e9cc8.

* Fix flow

* Update background-jobs.md

* Don't include the req path in the datadog key

This is an anti-pattern, as it'll blow up our DataDog storage costs.
Instead, we should be using a logging service like Splunk to dig into
specific slow requests after noticing abnormalities in the
metrics. Will tackle that next most likely, but this should be a good start!

Thanks to the folks in #observability for pointing that out!

* Dont track connection pool size in DataDog, track query response times and sizes

* Properly flow-type statsd

* Add statsd logging in dev

* Log instance hostname

* Upgrade apollo-cache-inmemory

* Better network disabled styling on input

* Fix styling if quote-replying and uploading media at same time

* Dont use alpha version of new apollo inmemory

* Add codesandbox to explore

* Add codesandbox url regex for better embeds

* Add packages for deploys

* Add the rest

* Update react-modal to version 3.8.1

* Update electron-builder to version 20.38.4

* Focus chatinput on quote message click

* Fix media messages arriving after text messages

* Handle websocket connection states

* Track job queue metrics

* Update react-transition-group to version 2.5.2

* Eliminate warning in console due to incorrect html element nesting

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update debug to version 4.1.1

* Update validator to version 10.10.0

* Update serialize-javascript to version 1.6.1

* Remove use of Google+ APIs

* Downrank watercooler threads in digests

* Nicer reputation string in digest emails

* Add function to check if localStorage is allowed

* Trigger login modal in the joinChannel upsell if no currentUser

* Remove localStorage check

* Make logged out users sign in before seeing chat input

* Remove unused timeInUTC variable

* Fix e2e tests

* Better string, fix test snapshots

* Update rimraf to version 2.6.3

* Update aws-sdk to version 2.383.0

* Update aws-sdk to version 2.383.0

* Update aws-sdk to version 2.383.0

* Update aws-sdk to version 2.383.0

* Update aws-sdk to version 2.383.0

* Add missing username property in onValidationResult object

`onValidationResult` function takes an object with prop error, success
and username.

* Make username optional

* Fix flow

* Remove unused variables from chatInput

* Explicitly target chat input from e2e tests

* Move data-cy="chat-input" to actual input

* Use .clear() instead of .type() in e2e tests

* Remove .only in thread_spec, fix /new/thread tests

* Upgrade to draft-js-import-markdown@latest to fix code blocks

* Make plaintext message editing work

* edit in plaintext on the frontend

* Fix flow

* Fix editing e2e test

* Match input font weight to rendered message font weight

* Remove invalid dataCy dom node property

* Change editor to allow multiline messages

* Remove empty line while editing

* Fix incorrect focussing on chat input while editing message

* Make single line breaks work!

* Upgrade to [email protected]

This implements query parsing and validation caching, reference
apollographql/apollo-server#2111, and should
thusly provide a nice speedup for our frontend's queries!

* add delete cover feature to edit community form

* add coverphoto flow type

* Add extra checks at send time for valid user recipient

* Async await

* Remove unnecessary code

The code block for the `SET_SEARCH_VALUE` type and the `SET_SEARCH_VALUE_FOR_SERVER
` type are indentical. And, the `SET_SEARCH_VALUE` type isnt used in any action in the codebase.

* Better keycode management

* Enable flow in keycodes file

* Fix flow

* Focus the edit message input when edit message button is clicked

* Move keycodes.js to src/helpers

* Update source-map-support to version 0.5.10

* Update view error message

* Fenced code blocks while editing

* WIP: Implement GraphQL Rate Limiting

* Make graphql-rate-limit work!

* Show rate limit errors to end users

* Adjust createChannel/createCommunity rate limits

* Make flow happy

* Use the job queue redis instance for rate limit data

* Log rate limit errors to sentry in prod

* Update to new version of gql-rate-limit, use string syntax

* Make flow happy

* Add flow stub for request-ip

* Update react-transition-group to version 2.5.3

* Remove draftjs block type validation in messages

This is confusing now that we have the plaintext input, and doesn't add
anything as invalid blocks are treated as text blocks anyway on the
frontend.

* Disable rate limiting unless in prod

* Gender-neutral rate limit error message, better addMessage rate limit

* Update electron-builder to version 20.38.5

* Support hyperlinks in messages

* Make flow happy

* Update validator to version 10.11.0

* WIP: Add mention suggestions to chat input

* add delete cover feature to createCommunityForm

* Update validate-draft-js-input.js

* Stop submitting message during IME is composing

* Fix oopsie

* Search users on type

* It works!

* Fix flow

* Cleanup unnecessary code changes

* Fix mentionsuggestions props

* Fix dm chatinput onfocus and onblur handlers

* Fix typo

* Upgrade graphql-rate-limit

This fixes a critical bug that is preventing some folks from posting
messages

* Update bull to version 3.5.3

* Update bull to version 3.5.3

* Update bull to version 3.5.3

* Update bull to version 3.5.3

* Update bull to version 3.6.0

* Update bull to version 3.6.0

* Update bull to version 3.6.0

* Update bull to version 3.6.0

* Hotfix Android thread creation

* Fix hyperion memory leak

Closes #4573

* Polish styles and prepopulate suggestions box

* Update moment to version 2.24.0

* Update immutability-helper to version 2.9.1

* Show participants and search results in mention suggestions

1. Enter `@`
2. Participants are shown
3. Enter `@b`
4. Participants filtered and sorted by whether they have a `b` in them are shown
5. In the background, we do an API req
6. If there are users that come back from the API, we append them to the list of suggestions

This is a much nicer experience than hiding the participants if you just want to filter them

* Fix bug with null participant

* Update graphql-tools to version 4.0.4

* Fix login redirects

* Only show 10 suggestions, make sure suggestions have usernames

* Pass participants to chatinput for SSR views

* Darker shadow for contrast

* Return original participants if user removes search query

* If switched thread has no messages, remove participant suggestions

* Ensure author is a suggested participant

* Better styling, fewer suggestion results

* Improves suggestion sort to factor in name

* Fix capitalization for rendering and filtering

* Fix flow

* Ensure that local results will appear even if no remote results exist

* Fix the default sort if search query isn't at the 0 index

* Properly transform author username suggestion

* Refactor to only perform suggestion transformations in one place

* Provide transformation function in DM sorting

* Fix flow

* Workaround react-mentions bug with markup option

Ideally, we would fix mentions by setting `markup="@__display__"` for
the `MentionInput`. Unfortunately due to a bug in react-mentions, this
doesn't work: signavio/react-mentions#299

This works around the issue by inserting `@[username]` into the value,
and then removing the brackets before sending to the server!

Closes #4587

* Add <MentionsInput /> component

* Fix dataCy passthrough to mentionsinput

* Update graphql-cost-analysis to version 1.0.3

* Update draft-js-export-markdown to fix escaped code chars

* Update draft-js-import-markdown to fix link swallowing

* Update draft-js-import-markdown to version 1.3.0

* Update highlight.js to version 9.14.1

* Update graphql-rate-limit to version 1.2.3

* Update aws-sdk to version 2.395.0

* Update aws-sdk to version 2.395.0

* Update aws-sdk to version 2.395.0

* Update slate to version 0.44.10
abernix added a commit that referenced this pull request Feb 7, 2019
@abernix
Copy link
Member Author

abernix commented Feb 8, 2019

@stephenh
Copy link

Just as an FYI to others that find this, we upgraded to apollo-server-express-2.4.0 and had to rollback due to OOMs; we usually run pretty lean on RAM, 600mb/container and use ~50% of it; with 2.4.0 (presumably b/c of this change) our RAM usage spiked to ~100%. Which is fine, we'll probably allocate more RAM to use this feature. Disclaimer we haven't taken heap dumps yet, so it could have been something else, but the apollo-server from ~2.2 to 2.4 was the only change in the release.

@glasser
Copy link
Member

glasser commented Feb 20, 2019

Huh, the intention is for this cache to take about 30MB. If you're seeing more, that sounds like a bug!

And also evidence that it's time to add the configurability @abernix alluded to, but more important to make sure there isn't just a bug in the size calculation or something...

@stephenh
Copy link

@glasser 30mb does sound much more reasonable. :-) I'll work on confirming the root cause but it probably won't be super quick just given current priorities.

@abernix
Copy link
Member Author

abernix commented Feb 20, 2019

@stephenh We've been running this in production ourselves for a while now and I haven't witnessed that problem, but I certainly wouldn't rule it out the possibility!

If you're able, I'm curious if you could try using 2.4.0-alpha.2, which as I eluded to in the comment above (#2111 (comment)) has hard-coded console.log-ging every 60 seconds which should expose the cache usage. The stability of that -alpha should be fine, but I'm also happy to cut you another release off the current master with that logging enabled if it's any use.

My hope in offering this suggestion is that it might be something you could quickly opt into, suffer the temporary consequences of the OOM and feed back that information to help identify if this is indeed at fault.

@stephenh
Copy link

@abernix sure, I can try 2.4.0-alpha.2, that sounds like a good idea. I'll have to jury rig a ~dark/canary type setup...likely manually register a machine into the ELB for a bit or something like that, so will take a bit to work on it in between other things.

@shackpank
Copy link

shackpank commented Mar 5, 2019

I'm experiencing the same issue here - seems like you may or may not encounter it based on client behaviour. In our case, there's a mutation like this, where the input will be different for every user, and the client is constructing the document with the input inline:

mutation {
  storeSomeId(id: "sajbfjkbdkfjbfk") { id }
}

Repeatedly running this mutation locally with a randomised input quickly fills the cache up; the debugging does indicate older documents are being dropped when the limit is reached, but the memory usage difference was more like 138 MB than 30 MB between an empty and full cache, and memory usage continues to rise (though more slowly; roughly 1MB per 2k requests - edit: comparing heapdumps suggests this is something internal to the lru-cache module and would likely need investigation over there) even after the cache is "full".

Assuming you control the client, the issue can be avoided (and much better use made of this cache) by rewriting the query like this to pass the input as a variable - this means the cache does not grow at all with additional requests:

mutation storeSomeId($id:String!) {
  storeSomeId(id:$id) { id }
}

@abernix
Copy link
Member Author

abernix commented Mar 7, 2019

@shackpank Thank you for sharing these helpful details and taking the time to look into this, particularly the follow-up edit which questions lru-cache's potential involvement. Super appreciated!

It sounds like there's two things:

  • Cache limit exceedance : slower growth, past the cache limit. Potentially something we're doing incorrectly, but also potentially something in lru-cache.
  • Cache misrepresentation: The "30MB" roughly-estimated cache limit (intention!), which mainly aims to facilitate ejection and prevent cache from becoming too full, may need to be re-analyzed. Measuring actual heap object size here would sure be nice, but JavaScript being JavaScript, we may just need another creative approach. This isn't configurable right now, and this lack of certainty of what the limit actually means in terms of memory usage is a good reason why. The overarching goal here is to just make sure that someone couldn't send so many unique documents to the server that it would fill up the cache.

I'm curious, when you say that the switching of string literals to variables:

means the cache does not grow at all with additional requests

...does this indicate that the cache limit exceedance also stopped? Or was it just no longer showing up on the radar? I'd expect this to be continuing regardless, if enough unique DocumentNodes (with or without variables) were being sent.

The lru-cache made a major bump to v5 in November via #2004. Maybe worth toggling that out with a v4 version? Is that something you could try? I somehow doubt this is something with that package — and more likely something we're doing — but worth digging around, I guess.

Since overall the cache is still bringing performance benefits even though it's exceeded it's intended size (read: in your case, 138MB) allotment, I'm actually mostly concerned about the slower growth, so hopefully we can get to the bottom of that. Still though, we'll need to adjust these expectations, and possible figure out another way to estimate AST sizes.

However, I'm glad you've discovered that string literals pose particular problems for cache benefits and that using GraphQL variables offers benefits in this department. The use of variables offers other benefits as well (e.g. aggregation purposes for metric reporting), so highly recommended!

@sebnow
Copy link
Contributor

sebnow commented Apr 15, 2019

I can confirm the cache causes a memory leak. I wrote a script to generate random distinct queries and request them with wrk2. Taking a heap snapshot shows that it is retaining well above 30MB.

Screen Shot 2019-04-15 at 16 47 12

In a production environment I am noticing a ~30mb/hr increase in memory. After logging distinct queries, I have found that there are on average 4 unique queries (with literals) requested per second.

graphql-js: 14.2.1
apollo-server-core: 2.4.8
apollo-server-caching: 0.3.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants