-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building large amount of pages (~16k) on Gatsby V2 performance issues #7373
Comments
What do the pages look like and their queries? My initial guess is you're using gatsby-image a ton and generating a bunch of blur-up effects? |
I'm not using gatsby-image at the moment but the page consist of text and images on each page. The images are loaded with a url string into another lazy-loading lib. Is there any good/easy way to monitor the graphQL queries, for example to understand which queries takes the most time and the frequency ? |
Can you paste some queries and the data they return? There are ways to trace graphql queries. We haven't got this working just yet but hope to extend our earlier work here https://next.gatsbyjs.org/docs/performance-tracing/ Also doing normal perf analysis could turn up some problems. Follow this guide and do a performance analysis in chrome dev tools while graphql queries are running https://next.gatsbyjs.org/docs/debugging-the-build-process/#chrome-devtools-for-node |
Do the graphql queries hit an internal memory store of the data that you generate in an earlier step? I'm curious about this issue as well, 200k pages run this step at ~30 queries per second, which makes it take about 2 hours. |
@chuntley these aren't normal query speeds which is why we're trying to debug what's going on in the site. gatsbyjs.org e.g. does ~300 queries / second. |
Query running is single threaded as well atm which we'll make multi-threaded in the future. |
@chuntley or are you saying you have another site w/ 200k pages? |
Here's an example how the queries looks like. There is ~6 similar queries in sequential after each other with similar structure (this query in the component that is passed to createPage())
Some of the fragments have around 20-30 attributes. Majority of the attributes are strings. Thanks for the links, I will take a look at it and see it I can did deeper into what it taking so long time. |
@chuntley not what I've seen at least. |
By the time my build process gets to this step, memory usage is around 2gb. Is there a chance that node performance decreases once you go past the original 1.5gb limit? |
@chuntley it can. Best thing to do is do a performance analysis as I mentioned earlier as you can then see which functions are using the most time. |
Something interesting to note is that the performance usually starts high, but as it goes a long (noticeably around 10-20k documents processed), it beings to slow down. For example (using the same data set with a limit set): 10k pages: 300 per second, start to finish |
i'm having a similar issue, but with relatively small amount of files. i have around ~2500 mds, and i get edit: sorry, i was using rc.0, after upgrading to rc.15 the problem is no longer present |
Just a general note for people posting/reading here — performance is complex. There's a ton of things that can affect your site building performance from Gatsby's code itself, plugins you're using, react components you're using, js libs you're using, and your own code of course. So it's the most useful if you run into performance problems if you can reproduce the same problems with one of our benchmark sites of by making some small changes to them that you share. The only way we can make improvements is if we can see the same problems on our own machine. |
@KyleAMathews i tried to cut down all the other stuff and still reproduce the issue. i have created a repo at https://github.com/eLod/gatsby-bench, it produces the error |
@m-allanson @pieh @DSchau @rase- and I met to investigate this issue this morning. @pieh has been talking to @eLod about his site and his comment #7373 (comment) We tried removing the cache.sets in transformer-remark as he suggested and also saw that this solved the rapidly growing memory seen w/ larger markdown sites. This seems mostly due to avoiding copying objects in memory (to the cache). We also write out the cache at an extremely fast clip (every 250ms) which uses CPU/memory to stringify the data which gets more problematic as the cache gets larger. Removing that sped up query running quite a bit. gatsby/packages/gatsby/src/utils/cache.js Line 69 in 33b4c76
|
@KyleAMathews I've continued a bit investigating our data model for speed improvements. Will get back to you if I find anything interesting. Regarding your benchmark test, I have some questions regarding speed. What I can see now is that The run speed is around 40-60 queries/second in |
We where having similar issues with 1700 pages. Was able to increase performance by removing a graphql query from a template and passing that data through pageContext via a source plugin. This allowed it to run once instead of 1700 times. Beware though, passing to much data can cause memory issues. Obvious mistake, but hope this helps someone. |
Yes i agree about this (i have got 5 languages), much better to prefetch everything in gatsby-node and store results in context. The gatsby-node is a bit complicate for beginners but it's such a really nice feature of gatsby some thought: |
I'm currently building a blog with articles containing lots of images and getting between 2-4 queries per second on build (1 query = 1 article). It is my understanding that |
Can attest to doing the expensive stuff in one shot in |
@eads fun project! :-D And woah... that's a big time diff :-( |
@KyleAMathews I am running into some memory issues, but I'll open a separate issue. But definitely, one 30s query in |
Oh should probably note that your situation is a bit new @eads in that you're using gatsby-source-graphql with a remote API which means every call has network latency. Currently we hard code things so we only run 4 queries at a time. With remote APIs, we should run way more queries concurrently to speed things up. |
Moving convo over to here from Not the Gatsby Gazette 2018-11-28 - good shout @pieh I've created a PR adding in some CPU control in html-renderer-queue.js (multi-core builds) which includes some tweaks we've made to improve our larger site builds. Our main site has ~25k nodes, most of which have a combination of static data (that runs through Gatsby's static build) and dynamic data on React app load. We've managed to reduce build times of ~10mins down to ~6mins using these CPU controls - specifically by using Courting thoughts from people involved with large site build from this issue... |
Many examples of gatsby select data relevant for single page with graphql query like so:
What is the performance of such query? Does it matter how many records there are? If selecting single record by id is not O(1), this could potentially make whole build O(n^2) operation. |
Would it be possible to change the development server to run page-exported queries on demand, rather than running all queries upfront? It’s necessary for a production build, but it probably makes some people wait a long time before being able to develop, or it makes them work around the issue by passing data via Edit: Answered in my spectrum question. |
Sorry for late response, but yea we have had some memory issues (not sure tho if it was due to the movement of queries) but we solved it by adding a filter to contentful #12939 (to minimize the amount of nodes created) |
Sorry for late response, yea we have had this issues as well. But we moved away from running an dev server for authers and we always build the application instead and trying to use the internal cache as good as possible to lower build times. This have worked good for us. |
In CI/CD environment, would you recommend to cache the |
We save our |
i'm evaluating Gatsby for a React website with 2.5 million pages right now which would really benefit from the SEO/perf benefits of Gatsby... unfortunately, the build times look untenable. it'd be neat if e: maybe it'd be possible to make the 10k most frequented pages static, and the others stay dynamic? |
@ashtonsix I’m having big trouble with ~1500 pages. I would highly recommend you write your own very simple code for building that amount of pages. |
From my experience, the images take the most time. if you can turn that off in the config and just load images from absolute URLs |
What would ya'll recommend I do when trying to source and transform over 160k (160,000+) nodes using gatsby-source-mysql? MySQL just timeouts when I do a select query of the entire database. If I put a limit on it, it works fine, but I need the entire database for this app. |
You'd probably want then to add support to gatsby-source-mysql for paging so it doesn't try to query everything at once. |
@KyleAMathews I ended up paginating the queries by month, but it still times out with this error...
Here's the relevant code snippet for the custom query paginator I wrote. let queries = []
currentMonth = 1
for ( let i = 0 ; i < monthsSinceLaunch + 1 ; i++ ) {
let month = moment().subtract(currentMonth, 'months')
let monthStr = month.format('YYYY-MM-')
queries.push({
statement: `SELECT * FROM clips WHERE created_at \
BETWEEN cast('${monthStr + '01'}' as DATE) \
AND cast('${monthStr + '31'}' as DATE);`,
idFieldName: 'id',
name: `${month.format('MM') + month.format('MMM') + month.format('YYYY')}Clips`
})
currentMonth++
} Example Output
|
In case anyone is interested in optimizing build times and isn't familiar with |
@nadrane Nice article but your caveat at the bottom was the killer for me. Gatsby's change tracking is broken by this speedup and you wind up having to delete your cache on any small change, otherwise you won't see changes. |
@pauleveritt Are we you sure about that? I thought that although the hot-reloading stops working, cache-busting works fine. It's my understanding that Gatsby has a filewatcher configured to look for changes against |
@nadrane You're right, if the edit is to As an example, let's say you have a site with authors and a GraphQL query in A change to an author's title won't result in each page displaying that value getting updated. |
@pauleveritt Yeah that's a good question. I'm not even sure how Gatsby handles this without the optimization. I have to assume that Gatsby Source Filesystem is setting up file watchers for us. Regardless, I'm curious to learn if you think this is a practical concern. The reason I say that is because I'd imagine you'd only want to use this optimization in performance critical scenarios, and I'd suspect that any query against your filesystem is going to be fast already. In my experience, the place where this strategy is most valuable is when each query crosses over the network, introducing network latency into each request |
Hiya! This issue has gone quiet. Spooky quiet. 👻 We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here. If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open! As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing! Thanks for being a part of the Gatsby community! 💪💜 |
+1 |
Hey again! It’s been 30 days since anything happened on this issue, so our friendly neighborhood robot (that’s me!) is going to close it. Please keep in mind that I’m only a robot, so if I’ve closed this issue in error, I’m As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing! Thanks again for being part of the Gatsby community! |
It seems Gatsby found a way to solve this issue |
@sheerun and what's the way? |
We've been having out-of-memory issues in our CI environment. The error occurs during We did notice that now that environment variable is being completely ignored, see that By manually editing the file and setting to true, all our memory issues went away, and the speed of HTML pages build increased 10X. Before it was 30pages/second now almost 300 pages/second. Hope this helps. |
We had the same problem described in the previous comment (all credit to @leonfs for debugging it); our builds fail with out of memory errors on a CI env which reports 18 cores; forcing the number of reported cores to 1 (by overwriting node_modules/gatsby-core-utils/dist/cpu-core-count.js) fixes the problem. Gatsby really needs to provide some way to properly control this. |
@leonfs @juliangoacher that issue happens to be fixed yesterday, is this still a problem with that fix? Any chance I could build your site and check for additional perf bottlenecks on your config and our (Gatsby) build pipeline? |
Summary
I'm building a website that contains lots of pages (~16k) with graphQL requests on each page. I've done some benchmarks and the build at the moment ~12k pages takes ~25 minutes.
Relevant information
The website fetch data from different sources (contentful and JSON files that are added to graphQL). That data is then used on every page with its own graphQL query on each site.
Possibile optimization could be to remove most of the queries from each page and do one bigger query in gatsby-node.js? But we still have a build time for the static pages around 450 seconds.
File contents (if changed)
gatsby-node.js
: I'm fetching all the pages in queries and then loop through that array to create every pageThe text was updated successfully, but these errors were encountered: