-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache successfully parsed and validated documents for future requests. #2111
Conversation
- [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected]
For anyone interested in trying this out and providing feedback, this has been published under the Integrations (e.g. |
// If we still don't have a document, we'll need to parse and validate it. | ||
// With success, we'll attempt to save it into the store for future use. | ||
if (!document) { | ||
const parsingDidEnd = await dispatcher.invokeDidStartHook( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be any downsides for not calling parsingDidStart/DidEnd in this code path (and not providing some alternate callback to take its place)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone has implemented critical parsingDidStart
or parsingDidEnd
functionality that didn't isolate its reach to that of the parsing phase, then they might find these callbacks are no longer invoked where they used to be.
Since we never documented or officially supported the graphql-extensions
API and the request pipeline plugin hooks are new (though still undocumented. see #2008), I think the best we can do is stay true to the name of the methods and stand by the fact that parsing did not start or end; I don't think we should call parsingDidStart
and parsingDidEnd
if those behaviors did not actually happen.
Presumably, those who wanted behavior which occurred when execution started or ended would be using the executionDidStart
or executionDidEnd
hooks (and more generally, requestDidStart
, requestDidEnd
, willSendResponse
, etc.).
I don't think there are any alternate callbacks that we can provide that will make this any easier, and I don't think such hooks would provide intrinsic value.
We have a bunch of really big queries for https://github.com/withspectrum/spectrum, so I'm excited about this change as it should provide a solid speedup for our use case! Will see if we can get 2.4.0-alpha.0 into production and report back with some solid metrics 💯 EDIT: PRs open: withspectrum/spectrum#4533 |
This implements query parsing and validation caching, reference apollographql/apollo-server#2111, and should thusly provide a nice speedup for our frontend's queries!
Cross-posting from withspectrum/spectrum#4533: Reassessing after a couple days in prod, it seems to have helped @abernix! p99 response time over:
👍 |
… use. Without this change, the `document` property was not set on the `requestContext` for consumption by request pipeline plugins. To further guard against this oversight, I've removed the extra `document` variable which was being used as scoped state for the document and switched to directly using (and assigning to) the `requestContext.document`. Nice catch, @glasser! Ref: https://github.com/apollographql/apollo-server/pull/2111/files#r247617469
The parsed/validated cache store is on by default. While it could be disabled, in theory, it cannot be disabled since its an internal property of the request pipeline processor class. See confusion here: withspectrum/spectrum#4533 (comment)
Of my own making!
…ore`. While the implementation of the `documentStore` is currently simple enough to never throw (it is in-memory), it makes some sense to guard against future extended functionality where an exception might be raised. Since storing this object in a distributed memory store isn't currently feasible, I'm not sure what such an exception would be right now, but I don't mind being proactive! Ref: https://github.com/apollographql/apollo-server/pull/2111/files#r247618501
Previously, this used the `JSON.stringify` length, but this is slightly more aware.
…class. The implementation of object-size approximation which is used for cache eviction purposes in the `InMemoryLRUCache` implementation (via `lru-cache`) was a short-term location for extensible logic which is better located within `ApolloServerBase`. This is particularly important since future logic may necessitate knowing or understanding the current size (roughly, memory usage) of the in-memory storage. Effective immediately, this adds support for providing a `dispose` function which is called when an object is purged from the cache to make room for another.
2a987a9
to
be71620
Compare
In an effort to see how effective this cache is in production during this alpha phase, we'll print out the stats on the document store every 60 seconds.
- [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected] - [email protected]
That's great news, @mxstbr! I'm pretty close to merging this and iterating on this in subsequent follow-up work, but I think it's worth a last peak before doing so. In particular, while those who have implemented this so far have had consistently improved response times to show for it, I've added an additional commit to this PR and cut another Therefore, the versions indicated in the
My intention is to remove this spammy-ness before merging, but I thought interested parties might want to take a look. So far, in our own dog-fooding on Apollo Engine, this seems to be working well as currently implemented. |
… seconds." This reverts commit 7a0d0e6, as I intended when I originally introduced it. Ref: #2111 (comment)
As I mentioned I would in #2111 (comment), I've reverted 7a0d0e6 and published I'm going to merge this PR into a |
Since we already write to the persisted query cache asynchronously (and intentionally do not await the Promise) this won't have any affect on the current behavior of when the persisted query cache is written to in the event of an executable request, but if an operation comes in and doesn't parse or validate, we'll avoid wasting cache space on an operation that will never execute. Additionally, in a similar light, if a plugin throws an error which stops execution, we can avoid the side-effect of writing to the persisted query cache. In terms of the APQ behavior, while this could cause additional round-trips for a client which repeatedly sends an invalid operation, that operation is never going to successfully finish anyway. While developer tooling will help avoid this problem in the first place, this refusal to store an invalid operation in the APQ cache could actually help diagnose a failure since the full operation (rather than just the SHA256 of that document) will appear in the browser's dev-tools on the retry. If we're looking to spare parsing and validating documents which we know are going to fail, I think that's going to be a better use of the (new) `documentStore` cache (#2111), since its in-memory and can accommodate a more complex data structure which includes the validation errors from the previous attempt which will, of course, be the same when validating the same operation against the same schema again. As the PR that introduced that feature shows, sparing those additional parses and validations (cached APQ documents still needs to be parsed and validated, currently) will provide a more successful performance win overall. Ref: #2111
Since we already write to the persisted query cache asynchronously (and intentionally do not await the Promise) this won't have any affect on the current behavior of when the persisted query cache is written to in the event of an executable request, but if an operation comes in and doesn't parse or validate, we'll avoid wasting cache space on an operation that will never execute. Additionally, in a similar light, if a plugin throws an error which stops execution, we can avoid the side-effect of writing to the persisted query cache. In terms of the APQ behavior, while this could cause additional round-trips for a client which repeatedly sends an invalid operation, that operation is never going to successfully finish anyway. While developer tooling will help avoid this problem in the first place, this refusal to store an invalid operation in the APQ cache could actually help diagnose a failure since the full operation (rather than just the SHA256 of that document) will appear in the browser's dev-tools on the retry. If we're looking to spare parsing and validating documents which we know are going to fail, I think that's going to be a better use of the (new) `documentStore` cache (#2111), since its in-memory and can accommodate a more complex data structure which includes the validation errors from the previous attempt which will, of course, be the same when validating the same operation against the same schema again. As the PR that introduced that feature shows, sparing those additional parses and validations (cached APQ documents still needs to be parsed and validated, currently) will provide a more successful performance win overall. Ref: #2111
Since we already write to the persisted query cache asynchronously (and intentionally do not await the Promise) this won't have any affect on the current behavior of when the persisted query cache is written to in the event of an executable request, but if an operation comes in and doesn't parse or validate, we'll avoid wasting cache space on an operation that will never execute. Additionally, in a similar light, if a plugin throws an error which stops execution, we can avoid the side-effect of writing to the persisted query cache. In terms of the APQ behavior, while this could cause additional round-trips for a client which repeatedly sends an invalid operation, that operation is never going to successfully finish anyway. While developer tooling will help avoid this problem in the first place, this refusal to store an invalid operation in the APQ cache could actually help diagnose a failure since the full operation (rather than just the SHA256 of that document) will appear in the browser's dev-tools on the retry. If we're looking to spare parsing and validating documents which we know are going to fail, I think that's going to be a better use of the (new) `documentStore` cache (#2111), since its in-memory and can accommodate a more complex data structure which includes the validation errors from the previous attempt which will, of course, be the same when validating the same operation against the same schema again. As the PR that introduced that feature shows, sparing those additional parses and validations (cached APQ documents still needs to be parsed and validated, currently) will provide a more successful performance win overall. Ref: #2111
Since we already write to the persisted query cache asynchronously (and intentionally do not await the Promise) this won't have any affect on the current behavior of when the persisted query cache is written to in the event of an executable request, but if an operation comes in and doesn't parse or validate, we'll avoid wasting cache space on an operation that will never execute. Additionally, in a similar light, if a plugin throws an error which stops execution, we can avoid the side-effect of writing to the persisted query cache. In terms of the APQ behavior, while this could cause additional round-trips for a client which repeatedly sends an invalid operation, that operation is never going to successfully finish anyway. While developer tooling will help avoid this problem in the first place, this refusal to store an invalid operation in the APQ cache could actually help diagnose a failure since the full operation (rather than just the SHA256 of that document) will appear in the browser's dev-tools on the retry. If we're looking to spare parsing and validating documents which we know are going to fail, I think that's going to be a better use of the (new) `documentStore` cache (#2111), since its in-memory and can accommodate a more complex data structure which includes the validation errors from the previous attempt which will, of course, be the same when validating the same operation against the same schema again. As the PR that introduced that feature shows, sparing those additional parses and validations (cached APQ documents still needs to be parsed and validated, currently) will provide a more successful performance win overall. Ref: #2111
* update background-jobs documentation * Fix chatInput focussing * Fix sending plaintext messages * Remove unnecessary double-read of attached media * Resolve React warnings * Fix creating new DM thread * Revert "Fix creating new DM thread" This reverts commit d8e9cc8. * Fix flow * Update background-jobs.md * Don't include the req path in the datadog key This is an anti-pattern, as it'll blow up our DataDog storage costs. Instead, we should be using a logging service like Splunk to dig into specific slow requests after noticing abnormalities in the metrics. Will tackle that next most likely, but this should be a good start! Thanks to the folks in #observability for pointing that out! * Dont track connection pool size in DataDog, track query response times and sizes * Properly flow-type statsd * Add statsd logging in dev * Log instance hostname * Upgrade apollo-cache-inmemory * Better network disabled styling on input * Fix styling if quote-replying and uploading media at same time * Dont use alpha version of new apollo inmemory * Add codesandbox to explore * Add codesandbox url regex for better embeds * Add packages for deploys * Add the rest * Update react-modal to version 3.8.1 * Update electron-builder to version 20.38.4 * Focus chatinput on quote message click * Fix media messages arriving after text messages * Handle websocket connection states * Track job queue metrics * Update react-transition-group to version 2.5.2 * Eliminate warning in console due to incorrect html element nesting * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update debug to version 4.1.1 * Update validator to version 10.10.0 * Update serialize-javascript to version 1.6.1 * Remove use of Google+ APIs * Downrank watercooler threads in digests * Nicer reputation string in digest emails * Add function to check if localStorage is allowed * Trigger login modal in the joinChannel upsell if no currentUser * Remove localStorage check * Make logged out users sign in before seeing chat input * Remove unused timeInUTC variable * Fix e2e tests * Better string, fix test snapshots * Update rimraf to version 2.6.3 * Update aws-sdk to version 2.383.0 * Update aws-sdk to version 2.383.0 * Update aws-sdk to version 2.383.0 * Update aws-sdk to version 2.383.0 * Update aws-sdk to version 2.383.0 * Add missing username property in onValidationResult object `onValidationResult` function takes an object with prop error, success and username. * Make username optional * Fix flow * Remove unused variables from chatInput * Explicitly target chat input from e2e tests * Move data-cy="chat-input" to actual input * Use .clear() instead of .type() in e2e tests * Remove .only in thread_spec, fix /new/thread tests * Upgrade to draft-js-import-markdown@latest to fix code blocks * Make plaintext message editing work * edit in plaintext on the frontend * Fix flow * Fix editing e2e test * Match input font weight to rendered message font weight * Remove invalid dataCy dom node property * Change editor to allow multiline messages * Remove empty line while editing * Fix incorrect focussing on chat input while editing message * Make single line breaks work! * Upgrade to [email protected] This implements query parsing and validation caching, reference apollographql/apollo-server#2111, and should thusly provide a nice speedup for our frontend's queries! * add delete cover feature to edit community form * add coverphoto flow type * Add extra checks at send time for valid user recipient * Async await * Remove unnecessary code The code block for the `SET_SEARCH_VALUE` type and the `SET_SEARCH_VALUE_FOR_SERVER ` type are indentical. And, the `SET_SEARCH_VALUE` type isnt used in any action in the codebase. * Better keycode management * Enable flow in keycodes file * Fix flow * Focus the edit message input when edit message button is clicked * Move keycodes.js to src/helpers * Update source-map-support to version 0.5.10 * Update view error message * Fenced code blocks while editing * WIP: Implement GraphQL Rate Limiting * Make graphql-rate-limit work! * Show rate limit errors to end users * Adjust createChannel/createCommunity rate limits * Make flow happy * Use the job queue redis instance for rate limit data * Log rate limit errors to sentry in prod * Update to new version of gql-rate-limit, use string syntax * Make flow happy * Add flow stub for request-ip * Update react-transition-group to version 2.5.3 * Remove draftjs block type validation in messages This is confusing now that we have the plaintext input, and doesn't add anything as invalid blocks are treated as text blocks anyway on the frontend. * Disable rate limiting unless in prod * Gender-neutral rate limit error message, better addMessage rate limit * Update electron-builder to version 20.38.5 * Support hyperlinks in messages * Make flow happy * Update validator to version 10.11.0 * WIP: Add mention suggestions to chat input * add delete cover feature to createCommunityForm * Update validate-draft-js-input.js * Stop submitting message during IME is composing * Fix oopsie * Search users on type * It works! * Fix flow * Cleanup unnecessary code changes * Fix mentionsuggestions props * Fix dm chatinput onfocus and onblur handlers * Fix typo * Upgrade graphql-rate-limit This fixes a critical bug that is preventing some folks from posting messages * Update bull to version 3.5.3 * Update bull to version 3.5.3 * Update bull to version 3.5.3 * Update bull to version 3.5.3 * Update bull to version 3.6.0 * Update bull to version 3.6.0 * Update bull to version 3.6.0 * Update bull to version 3.6.0 * Hotfix Android thread creation * Fix hyperion memory leak Closes #4573 * Polish styles and prepopulate suggestions box * Update moment to version 2.24.0 * Update immutability-helper to version 2.9.1 * Show participants and search results in mention suggestions 1. Enter `@` 2. Participants are shown 3. Enter `@b` 4. Participants filtered and sorted by whether they have a `b` in them are shown 5. In the background, we do an API req 6. If there are users that come back from the API, we append them to the list of suggestions This is a much nicer experience than hiding the participants if you just want to filter them * Fix bug with null participant * Update graphql-tools to version 4.0.4 * Fix login redirects * Only show 10 suggestions, make sure suggestions have usernames * Pass participants to chatinput for SSR views * Darker shadow for contrast * Return original participants if user removes search query * If switched thread has no messages, remove participant suggestions * Ensure author is a suggested participant * Better styling, fewer suggestion results * Improves suggestion sort to factor in name * Fix capitalization for rendering and filtering * Fix flow * Ensure that local results will appear even if no remote results exist * Fix the default sort if search query isn't at the 0 index * Properly transform author username suggestion * Refactor to only perform suggestion transformations in one place * Provide transformation function in DM sorting * Fix flow * Workaround react-mentions bug with markup option Ideally, we would fix mentions by setting `markup="@__display__"` for the `MentionInput`. Unfortunately due to a bug in react-mentions, this doesn't work: signavio/react-mentions#299 This works around the issue by inserting `@[username]` into the value, and then removing the brackets before sending to the server! Closes #4587 * Add <MentionsInput /> component * Fix dataCy passthrough to mentionsinput * Update graphql-cost-analysis to version 1.0.3 * Update draft-js-export-markdown to fix escaped code chars * Update draft-js-import-markdown to fix link swallowing * Update draft-js-import-markdown to version 1.3.0 * Update highlight.js to version 9.14.1 * Update graphql-rate-limit to version 1.2.3 * Update aws-sdk to version 2.395.0 * Update aws-sdk to version 2.395.0 * Update aws-sdk to version 2.395.0 * Update slate to version 0.44.10
This has been released in Apollo Server 2.4.0! e.g. |
Just as an FYI to others that find this, we upgraded to apollo-server-express-2.4.0 and had to rollback due to OOMs; we usually run pretty lean on RAM, 600mb/container and use ~50% of it; with 2.4.0 (presumably b/c of this change) our RAM usage spiked to ~100%. Which is fine, we'll probably allocate more RAM to use this feature. Disclaimer we haven't taken heap dumps yet, so it could have been something else, but the apollo-server from ~2.2 to 2.4 was the only change in the release. |
Huh, the intention is for this cache to take about 30MB. If you're seeing more, that sounds like a bug! And also evidence that it's time to add the configurability @abernix alluded to, but more important to make sure there isn't just a bug in the size calculation or something... |
@glasser 30mb does sound much more reasonable. :-) I'll work on confirming the root cause but it probably won't be super quick just given current priorities. |
@stephenh We've been running this in production ourselves for a while now and I haven't witnessed that problem, but I certainly wouldn't rule it out the possibility! If you're able, I'm curious if you could try using My hope in offering this suggestion is that it might be something you could quickly opt into, suffer the temporary consequences of the OOM and feed back that information to help identify if this is indeed at fault. |
@abernix sure, I can try |
I'm experiencing the same issue here - seems like you may or may not encounter it based on client behaviour. In our case, there's a mutation like this, where the input will be different for every user, and the client is constructing the document with the input inline:
Repeatedly running this mutation locally with a randomised input quickly fills the cache up; the debugging does indicate older documents are being dropped when the limit is reached, but the memory usage difference was more like 138 MB than 30 MB between an empty and full cache, and memory usage continues to rise (though more slowly; roughly 1MB per 2k requests - edit: comparing heapdumps suggests this is something internal to the lru-cache module and would likely need investigation over there) even after the cache is "full". Assuming you control the client, the issue can be avoided (and much better use made of this cache) by rewriting the query like this to pass the input as a variable - this means the cache does not grow at all with additional requests:
|
@shackpank Thank you for sharing these helpful details and taking the time to look into this, particularly the follow-up edit which questions It sounds like there's two things:
I'm curious, when you say that the switching of string literals to variables:
...does this indicate that the cache limit exceedance also stopped? Or was it just no longer showing up on the radar? I'd expect this to be continuing regardless, if enough unique The Since overall the cache is still bringing performance benefits even though it's exceeded it's intended size (read: in your case, 138MB) allotment, I'm actually mostly concerned about the slower growth, so hopefully we can get to the bottom of that. Still though, we'll need to adjust these expectations, and possible figure out another way to estimate AST sizes. However, I'm glad you've discovered that string literals pose particular problems for cache benefits and that using GraphQL variables offers benefits in this department. The use of variables offers other benefits as well (e.g. aggregation purposes for metric reporting), so highly recommended! |
I can confirm the cache causes a memory leak. I wrote a script to generate random distinct queries and request them with wrk2. Taking a heap snapshot shows that it is retaining well above 30MB. In a production environment I am noticing a ~30mb/hr increase in memory. After logging distinct queries, I have found that there are on average 4 unique queries (with literals) requested per second. graphql-js: 14.2.1 |
This PR builds upon the housekeeping work done in #2107, #2108 and #2109 and implements an in-memory cache store for parsed and validated documents in an effort to increase server performance.
Upon receiving a request, one of the many steps Apollo Server must perform is the parsing and validation of the document via
graphql-js
'sparse
andvalidate
functions. The parsing and the validation of the document results in aDocumentNode
(an AST) which is used for the eventual execution (viagraphql-js
'sexecute
);For relatively small documents, this takes a small fraction of time, though with more complex documents, the computational sink is becomes larger and less predictable.
These performance characteristics aren't always predictable, but when processing the same operation over and over again we can avoid the expensive computation cost associated with parsing and validating again by storing the resulting
DocumentNode
rather than building what is ultimately a re-forged copy of an effectively identicalDocumentNode
.This PR introduces that technique by utilizing the SHA-256 hex digest of the operation document itself as a key and storing identical
DocumentNode
s in anInMemoryLRUCache
(powered under-the-hood by the excellentlru-cache
package!).When using GraphQL variables (rather than string/numeric literals within the GraphQL document itself), this means that repeat operations — even with dynamically changing variables — can still be ready to
execute
based on the parsing and validation of a similar, previous invocation and cut a substantial amount of time off the request time round-trip.Some benchmarks might help demonstrate this better. For the purposes of demonstrating this, I'll utilize relatively flat GraphQL documents (and corresponding schemas) which I benchmarked locally using Apache JMeter. The resolvers involved in this test are all just so-called "default resolvers", as in, they do nothing but return
null
and should accurately demonstrate the "win" from this optimization technique:Benchmarks
DocumentNode
caching
DocumentNode
caching
query Benchmark {
Field_1
}
p99: 56.00ms
Throughput: 925.45
p99: 58.00ms
Throughput: 979.08 ⚡️
query Benchmark {
Field_1
Field_2
# ...until...
Field_100
}
p99: 74.00ms
Throughput: 530.99
p99: 55.99ms ⚡️
Throughput: 917.70 ⚡️
query Benchmark {
Field_1
Field_2
# ...until...
Field_5000
}
p99: 1007.00ms
Throughput: 19.22
p99: 390.00ms ⚡️
Throughput: 59.64 ⚡️
The single field operations are a wash but there are substantial improvements as the query document's complexity increases! Looking at the middle of the road (and more reasonable request) for 100 fields, the 99th percentile (as in, 99% of requests returned within that timeframe) is showing that it's 132% faster!
It's worth pointing out that I had a 1000ms request/response timeouts configured on the test-runner so the results on the "5000 fields - Before" test aren't showing the true depth of the situation since 88.10% of the operations
Read timed out
! (Of course, requesting 5000 top-level fields probably has a number of concerns so if that metric piques your concern you I certainly encourage you to dive in more!). On the plus side, the parsed and validated document did not timeout at all and, based on the average response time performed 308% faster. 🏎💨🎠For anyone interested in more details, I've posted the more complete results at https://abernix.github.io/documentStoreBenchmarks, in the form of JMeter reports. You can click through to them from the index page.
Of course, there's still a couple things to be figured out here, particularly in the realm of configuration surface area. Currently, this utilizes a 30MiB cache, which might be enough for many/most configurations, but I'd like to figure out exactly what the sensible defaults might be here — and what might need tuning — before opening it up for configuration. In fact, if we don't need to add configuration, that would be even better!
I'm excited to see how this performs for others!
Fixes #1566