Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building large amount of pages (~16k) on Gatsby V2 performance issues #7373

Closed
stoltzrobin opened this issue Aug 16, 2018 · 58 comments
Closed
Labels
stale? Issue that may be closed soon due to the original author not responding any more. type: question or discussion Issue discussing or asking a question about Gatsby

Comments

@stoltzrobin
Copy link
Contributor

Summary

I'm building a website that contains lots of pages (~16k) with graphQL requests on each page. I've done some benchmarks and the build at the moment ~12k pages takes ~25 minutes.

Relevant information

The website fetch data from different sources (contentful and JSON files that are added to graphQL). That data is then used on every page with its own graphQL query on each site.

success building schema — 1.363 s
success createPages — 0.818 s
success createPagesStatefully — 4.484 s
success onPreExtractQueries — 0.008 s
success update schema — 0.765 s
success extract queries from components — 0.244 s
success run graphql queries — 1021.016 s — 11582/11582 11.34 queries/second
success write out page data — 0.105 s
success write out redirect data — 0.001 s
success onPostBootstrap — 0.262 s

info bootstrap finished - 1049.157 s

success Building production JavaScript and CSS bundles — 47.377 s
success Building static HTML for pages — 443.926 s — 11582/11582 26.59 pages/second
info Done building in 1545.152 sec
✨  Done in 1552.26s. 

Possibile optimization could be to remove most of the queries from each page and do one bigger query in gatsby-node.js? But we still have a build time for the static pages around 450 seconds.

File contents (if changed)

gatsby-node.js: I'm fetching all the pages in queries and then loop through that array to create every page

@KyleAMathews
Copy link
Contributor

What do the pages look like and their queries? My initial guess is you're using gatsby-image a ton and generating a bunch of blur-up effects?

@stoltzrobin
Copy link
Contributor Author

I'm not using gatsby-image at the moment but the page consist of text and images on each page. The images are loaded with a url string into another lazy-loading lib.

Is there any good/easy way to monitor the graphQL queries, for example to understand which queries takes the most time and the frequency ?

@KyleAMathews
Copy link
Contributor

Can you paste some queries and the data they return?

There are ways to trace graphql queries. We haven't got this working just yet but hope to extend our earlier work here https://next.gatsbyjs.org/docs/performance-tracing/

Also doing normal perf analysis could turn up some problems. Follow this guide and do a performance analysis in chrome dev tools while graphql queries are running https://next.gatsbyjs.org/docs/debugging-the-build-process/#chrome-devtools-for-node

@chuntley
Copy link
Contributor

Do the graphql queries hit an internal memory store of the data that you generate in an earlier step? I'm curious about this issue as well, 200k pages run this step at ~30 queries per second, which makes it take about 2 hours.

@KyleAMathews
Copy link
Contributor

@chuntley these aren't normal query speeds which is why we're trying to debug what's going on in the site. gatsbyjs.org e.g. does ~300 queries / second.

@KyleAMathews
Copy link
Contributor

Query running is single threaded as well atm which we'll make multi-threaded in the future.

@KyleAMathews
Copy link
Contributor

@chuntley or are you saying you have another site w/ 200k pages?

@stoltzrobin
Copy link
Contributor Author

stoltzrobin commented Aug 16, 2018

Here's an example how the queries looks like. There is ~6 similar queries in sequential after each other with similar structure (this query in the component that is passed to createPage())

query_1(id: {eq: $id}) {
 ...[fragment_name_1]
}

query_2(id: {eq: $id}) {
 ...[fragment_name_2]
}

...
fragment [fragment_name_X] on Query{
 attribute
 nestedAttribute {
  id
 }
 ...
}

Some of the fragments have around 20-30 attributes. Majority of the attributes are strings. Thanks for the links, I will take a look at it and see it I can did deeper into what it taking so long time.

@stoltzrobin
Copy link
Contributor Author

@chuntley not what I've seen at least.

@Chuloo Chuloo added the type: question or discussion Issue discussing or asking a question about Gatsby label Aug 18, 2018
@chuntley
Copy link
Contributor

By the time my build process gets to this step, memory usage is around 2gb. Is there a chance that node performance decreases once you go past the original 1.5gb limit?

@KyleAMathews
Copy link
Contributor

@chuntley it can. Best thing to do is do a performance analysis as I mentioned earlier as you can then see which functions are using the most time.

@chuntley
Copy link
Contributor

Something interesting to note is that the performance usually starts high, but as it goes a long (noticeably around 10-20k documents processed), it beings to slow down. For example (using the same data set with a limit set):

10k pages: 300 per second, start to finish
200k pages: Starts at 70 per second, finishes at 12 per second

@eLod
Copy link
Contributor

eLod commented Sep 10, 2018

i'm having a similar issue, but with relatively small amount of files. i have around ~2500 mds, and i get FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory during the run graphql queries — 691/2527 7.01 queries/second step. the queries for most of the pages are simply getting the html for the markdown. it simply kills it around 1.5G memory.

edit: sorry, i was using rc.0, after upgrading to rc.15 the problem is no longer present

@KyleAMathews
Copy link
Contributor

Just a general note for people posting/reading here — performance is complex. There's a ton of things that can affect your site building performance from Gatsby's code itself, plugins you're using, react components you're using, js libs you're using, and your own code of course. So it's the most useful if you run into performance problems if you can reproduce the same problems with one of our benchmark sites of by making some small changes to them that you share.

The only way we can make improvements is if we can see the same problems on our own machine.

@eLod
Copy link
Contributor

eLod commented Sep 11, 2018

@KyleAMathews i tried to cut down all the other stuff and still reproduce the issue. i have created a repo at https://github.com/eLod/gatsby-bench, it produces the error FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory while extract queries from components on my machine for both gatsby develop and build (it is killed around 350-400/782 for me). i tried to check with zipkin, but i only see the traces until the query extraction starts and also i don't see memory usage. the content is a c library documentation generated with doxygen & moxygen.

@KyleAMathews
Copy link
Contributor

@m-allanson @pieh @DSchau @rase- and I met to investigate this issue this morning.

@pieh has been talking to @eLod about his site and his comment #7373 (comment)

We tried removing the cache.sets in transformer-remark as he suggested and also saw that this solved the rapidly growing memory seen w/ larger markdown sites.

This seems mostly due to avoiding copying objects in memory (to the cache).

We also write out the cache at an extremely fast clip (every 250ms) which uses CPU/memory to stringify the data which gets more problematic as the cache gets larger. Removing that sped up query running quite a bit.

fs.writeFile(`${directory}/db.json`, JSON.stringify(mapToObject(db)))

@stoltzrobin
Copy link
Contributor Author

@KyleAMathews I've continued a bit investigating our data model for speed improvements. Will get back to you if I find anything interesting.

Regarding your benchmark test, I have some questions regarding speed. What I can see now is that gatsby develop and gatsby build have a hugh difference in queries/second. Have you seen any similar on your machine?

The run speed is around 40-60 queries/second in gatsby develop and ~400 queries/second when I run gatsby build. Is this normal? What I can remember (in the Gatsby v2 beta) I saw a bit more even numbers between develop and build (but that might be me dreaming?)

@JordanDDisch
Copy link

JordanDDisch commented Oct 3, 2018

We where having similar issues with 1700 pages. Was able to increase performance by removing a graphql query from a template and passing that data through pageContext via a source plugin. This allowed it to run once instead of 1700 times. Beware though, passing to much data can cause memory issues. Obvious mistake, but hope this helps someone.

@simonjoom
Copy link

simonjoom commented Oct 6, 2018

Yes i agree about this (i have got 5 languages), much better to prefetch everything in gatsby-node and store results in context.
As well i think it's better to separate ' View and controller '
I use a cache filesystem coupled with queries in gatsby-node to avoid to refetch not updated pages.

The gatsby-node is a bit complicate for beginners but it's such a really nice feature of gatsby
Will be nice if Gatsby give more example to teach that.

some thought:
Unfortunately gatsby don't provide something to manually fetch a graphql query without have to pass in a StaticQuery.
I would like to fetch a graphql query before the render of Component like in a componentWillMount..?
there is some way to do that (of course we are not here in case of template or a pages context)

@dbismut
Copy link
Contributor

dbismut commented Oct 9, 2018

I'm currently building a blog with articles containing lots of images and getting between 2-4 queries per second on build (1 query = 1 article). It is my understanding that gatsby-image requires creating thumbnails of images on disk at build-time. Is it possible that creating these images make queries slow-er?

@eads
Copy link

eads commented Oct 29, 2018

Can attest to doing the expensive stuff in one shot in gatsby-node.js. I'm using Gatsby with Hasura to build about 40k pages (currently unpublished). Getting all the data in one shot takes about 25s, whereas god only knows what the 40k queries run serially would take. I didn't bother to figure it out, as just 1000 pages took many minutes.

@KyleAMathews
Copy link
Contributor

@eads fun project! :-D

And woah... that's a big time diff :-(

@eads
Copy link

eads commented Oct 29, 2018

@KyleAMathews I am running into some memory issues, but I'll open a separate issue.

But definitely, one 30s query in gatsby-node.js works a lot better than 40k 0.5s queries in this case and I highly recommend it as an approach.

@KyleAMathews
Copy link
Contributor

Oh should probably note that your situation is a bit new @eads in that you're using gatsby-source-graphql with a remote API which means every call has network latency. Currently we hard code things so we only run 4 queries at a time. With remote APIs, we should run way more queries concurrently to speed things up.

@dominicfallows
Copy link
Contributor

dominicfallows commented Dec 3, 2018

Moving convo over to here from Not the Gatsby Gazette 2018-11-28 - good shout @pieh

I've created a PR adding in some CPU control in html-renderer-queue.js (multi-core builds) which includes some tweaks we've made to improve our larger site builds.

Our main site has ~25k nodes, most of which have a combination of static data (that runs through Gatsby's static build) and dynamic data on React app load. We've managed to reduce build times of ~10mins down to ~6mins using these CPU controls - specifically by using logical_cores instead of physical_cores. Rather biased tests as its using our app, but encouraging results.

Courting thoughts from people involved with large site build from this issue...

@sheerun
Copy link

sheerun commented Dec 9, 2018

Many examples of gatsby select data relevant for single page with graphql query like so:

  query($id: String!) {
    # Select the post which equals this id.
    postsJson(id: { eq: $id }) {
      ...PostDetail_details
    }
  }

What is the performance of such query? Does it matter how many records there are? If selecting single record by id is not O(1), this could potentially make whole build O(n^2) operation.

@seidtgeist
Copy link

seidtgeist commented Feb 26, 2019

Would it be possible to change the development server to run page-exported queries on demand, rather than running all queries upfront?

It’s necessary for a production build, but it probably makes some people wait a long time before being able to develop, or it makes them work around the issue by passing data via pageContext.

Edit: Answered in my spectrum question.

@stoltzrobin
Copy link
Contributor Author

@seamofreality https://gist.github.com/JordanDDisch/5fa7f3972b9a4ff91cb7469c01eea1a6/fe91cce8bd34019b1d6e26bd3d2e7af93c34de5e thats how we pass stuff through page context. Not sure if that helps.

@stoltzrobin have you run into memory issues from moving most of your graphql queries to gatsby-node.js ?

Sorry for late response, but yea we have had some memory issues (not sure tho if it was due to the movement of queries) but we solved it by adding a filter to contentful #12939 (to minimize the amount of nodes created)

@stoltzrobin
Copy link
Contributor Author

@stoltzrobin I've also moved stuff to sourceNodes but found the authoring experience suffers. You add a new Markdown doc that uses mapping to get linked to another and it requires stopping gatsby develop and rm -rf .cache, many times a day.

I need to find an API similar to sourceNodes but which runs on each filesystem change.

Sorry for late response, yea we have had this issues as well. But we moved away from running an dev server for authers and we always build the application instead and trying to use the internal cache as good as possible to lower build times. This have worked good for us.

@PolGuixe
Copy link
Contributor

In CI/CD environment, would you recommend to cache the .cache and public folders?

@stoltzrobin
Copy link
Contributor Author

In CI/CD environment, would you recommend to cache the .cache and public folders?

We save our .cache and public folder on S3 and retrieve them when building the page again. This lowered our build time by quite much.

@ashtonsix
Copy link

ashtonsix commented May 27, 2019

i'm evaluating Gatsby for a React website with 2.5 million pages right now which would really benefit from the SEO/perf benefits of Gatsby... unfortunately, the build times look untenable. it'd be neat if gatsby build could run concurrently across multiple servers or like, something with 96 CPUs

e: maybe it'd be possible to make the 10k most frequented pages static, and the others stay dynamic?

@seidtgeist
Copy link

@ashtonsix I’m having big trouble with ~1500 pages. I would highly recommend you write your own very simple code for building that amount of pages.

@parkerproject
Copy link

From my experience, the images take the most time. if you can turn that off in the config and just load images from absolute URLs

@crock
Copy link
Contributor

crock commented Sep 2, 2019

What would ya'll recommend I do when trying to source and transform over 160k (160,000+) nodes using gatsby-source-mysql? MySQL just timeouts when I do a select query of the entire database. If I put a limit on it, it works fine, but I need the entire database for this app.

@KyleAMathews
Copy link
Contributor

You'd probably want then to add support to gatsby-source-mysql for paging so it doesn't try to query everything at once.

@crock
Copy link
Contributor

crock commented Sep 2, 2019

@KyleAMathews I ended up paginating the queries by month, but it still times out with this error...

⠋ building schema
\node_modules\yoga-layout-prebuilt\yoga-layout\build\Release\nbind.js:53
        throw ex;
        ^

Error: Quit inactivity timeout
    at Quit.<anonymous> (\node_modules\mysql\lib\protocol\Protocol.js:160:17)
    at Quit.emit (events.js:198:13)
    at Quit._onTimeout (\node_modules\mysql\lib\protocol\sequences\Sequence.js:124:8)
    at Timer._onTimeout (\node_modules\mysql\lib\protocol\Timer.js:32:23)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)

Here's the relevant code snippet for the custom query paginator I wrote.

let queries = []
currentMonth = 1
for ( let i = 0 ; i < monthsSinceLaunch + 1 ; i++ ) {
  let month = moment().subtract(currentMonth, 'months')
  let monthStr = month.format('YYYY-MM-')
  queries.push({
    statement: `SELECT * FROM clips WHERE created_at \
    BETWEEN cast('${monthStr + '01'}' as DATE) \
    AND cast('${monthStr + '31'}' as DATE);`,
    idFieldName: 'id',
    name: `${month.format('MM') + month.format('MMM') + month.format('YYYY')}Clips`
  })
  currentMonth++
}

Example Output

{ statement:
     'SELECT * FROM clips WHERE created_at     BETWEEN cast(\'2019-03-01\' as DATE)     AND cast(\'2019-03-31\' as DATE);',
    idFieldName: 'id',
    name: '03Mar2019Clips' }

@nadrane
Copy link
Contributor

nadrane commented Sep 3, 2019

In case anyone is interested in optimizing build times and isn't familiar with pageContext, I wrote an article explaining it: https://nickdrane.com/optimizing-gatsby-build-times-for-large-websites-using-pagecontext

@pauleveritt
Copy link

@nadrane Nice article but your caveat at the bottom was the killer for me. Gatsby's change tracking is broken by this speedup and you wind up having to delete your cache on any small change, otherwise you won't see changes.

@nadrane
Copy link
Contributor

nadrane commented Sep 3, 2019

@pauleveritt Are we you sure about that? I thought that although the hot-reloading stops working, cache-busting works fine. It's my understanding that Gatsby has a filewatcher configured to look for changes against gatsbyNode.js, and when it changes, the dev environment rebuilds.

@pauleveritt
Copy link

@nadrane You're right, if the edit is to gatsby-node.js itself, it rebuilds. If the edit is to some markdown file that affects a query in done once in gatsby-node.js, it's an open question.

As an example, let's say you have a site with authors and a GraphQL query in gatsby-node.js that collects the collection of authors once, then passes it into the context of each page. The page then gets the current author and displays the title.

A change to an author's title won't result in each page displaying that value getting updated.

@nadrane
Copy link
Contributor

nadrane commented Sep 3, 2019

@pauleveritt Yeah that's a good question. I'm not even sure how Gatsby handles this without the optimization. I have to assume that Gatsby Source Filesystem is setting up file watchers for us.

Regardless, I'm curious to learn if you think this is a practical concern. The reason I say that is because I'd imagine you'd only want to use this optimization in performance critical scenarios, and I'd suspect that any query against your filesystem is going to be fast already. In my experience, the place where this strategy is most valuable is when each query crosses over the network, introducing network latency into each request

@gatsbot
Copy link

gatsbot bot commented Sep 24, 2019

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

@gatsbot gatsbot bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Sep 24, 2019
@araphiel
Copy link

+1

@gatsbot
Copy link

gatsbot bot commented Oct 6, 2019

Hey again!

It’s been 30 days since anything happened on this issue, so our friendly neighborhood robot (that’s me!) is going to close it.

Please keep in mind that I’m only a robot, so if I’ve closed this issue in error, I’m HUMAN_EMOTION_SORRY. Please feel free to reopen this issue or create a new one if you need anything else.

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks again for being part of the Gatsby community!

@gatsbot gatsbot bot closed this as completed Oct 6, 2019
@sheerun
Copy link

sheerun commented Oct 6, 2019

It seems Gatsby found a way to solve this issue

@parkerproject
Copy link

@sheerun and what's the way?

@leonfs
Copy link

leonfs commented Dec 4, 2019

We've been having out-of-memory issues in our CI environment. The error occurs during Building static HTML for pages stage. After some debugging, we noticed that the number of jest workers was 18, but we were already setting the number of workers to 1 by using the GATSBY_CPU_COUNT environment variable.

We did notice that now that environment variable is being completely ignored, see that true is being passed as argument hence the environment variable being ignored: https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/worker/pool.js#L6. the

By manually editing the file and setting to true, all our memory issues went away, and the speed of HTML pages build increased 10X. Before it was 30pages/second now almost 300 pages/second.

Hope this helps.

@juliangoacher
Copy link

We had the same problem described in the previous comment (all credit to @leonfs for debugging it); our builds fail with out of memory errors on a CI env which reports 18 cores; forcing the number of reported cores to 1 (by overwriting node_modules/gatsby-core-utils/dist/cpu-core-count.js) fixes the problem.

Gatsby really needs to provide some way to properly control this.

@pvdz
Copy link
Contributor

pvdz commented Dec 10, 2019

@leonfs @juliangoacher that issue happens to be fixed yesterday, is this still a problem with that fix?

Any chance I could build your site and check for additional perf bottlenecks on your config and our (Gatsby) build pipeline?

@Wkasel
Copy link

Wkasel commented Dec 26, 2019

https://www.reddit.com/r/gatsbyjs/comments/dt9vea/can_gatsby_work_for_a_very_large_e_commerce_store/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale? Issue that may be closed soon due to the original author not responding any more. type: question or discussion Issue discussing or asking a question about Gatsby
Projects
None yet
Development

No branches or pull requests