memory leak in shadow realms #47353

joyeecheung · 2023-04-01T03:51:25Z

Version

all

Platform

all

Subsystem

shadow-realm

What steps will reproduce the bug?

// Flags: --experimental-shadow-realm --max-old-space-size=20
'use strict';

for (let i = 0; i < 1000; i++) {
  const realm = new ShadowRealm();
  realm.evaluate(`new TextEncoder(); 1;`);
}

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior? Why is that the expected behavior?

Shouldn't crash from OOM

What do you see instead?

Crash from OOM

Additional information

See analysis #47339 (comment). This was introduced in #46809. The current memory management of shadow realms requires that there should be no strong global references in the graph, but we currently have many of them in the code base - among them are the binding data and the aliased arrays associated with the encoding binding, which is lazily created when TextEncoder is accessed, for example.

I think we have at least two options:

Change the memory management of shadow realms so that it doesn't rely on the context being GC'ed to GC itself (otherwise any out-of-band global reference to objects in that context holds the context alive forever) - I am not sure how feasible it is, superficially that seems plausible, because the shadow realms boundary ensures that no objects can be leaked out of the shadow realms, therefore once the shadow realms wrappers are unreachable the context should be unreachable, but it might get more complex once there's asynchronicity/modules involved
Avoid making BaseObjects strong - there are strong BaseObjects everywhere in the code base, however, I am not sure if it's actually practical to simply eliminate all strong global references - we might end up having some of them just impossible to make weak, and they'll result in a leak once they end up in a shadow realm.

Or, we could use some help from V8 to get notified about the realms being unreachable, and release the realms in some callback instead of in a weak callback of the context. It's not clear to me what's enough for us at this point, though.

The text was updated successfully, but these errors were encountered:

joyeecheung · 2023-04-01T03:52:35Z

@nodejs/realm

anonrig · 2023-04-01T13:16:14Z

Could this memory leak cause performance regression when we run URL tests? I'm seeing a different result in node benchmark compare results, compared to microbenchmarks with mitata.

mcollina · 2023-04-01T14:41:21Z

I think we would eventually need some way to "know" when a ShadowRealm can be collected / closed / destroyed to dispose of any connected libuv handles or other native resources.

joyeecheung · 2023-04-01T18:28:55Z

Could this memory leak cause performance regression when we run URL tests? I'm seeing a different result in node benchmark compare results, compared to microbenchmarks with mitata.

I don't think so, this is only relevant in shadow realms, and I don't think we have URLs benchmarks in shadow realms (it's only supported starting from two weeks ago and not even released, and still it's behind --experimental-shadow-realms). Principal realms are unaffected because...you only get one principal realm per Node.js instance, and we actively destroy them when there are no more tasks to run.

legendecas · 2023-04-02T17:55:08Z

ShadowRealm is currently behind the runtime flag --experimental-shadow-realm or --harmony-shadow-realm and it should not affect performance in the principal realm.

I think we would eventually need some way to "know" when a ShadowRealm can be collected / closed / destroyed to dispose of any connected libuv handles or other native resources.

If BaseObjects are weak and no active HandleWraps or ReqWraps, the ShadowRealms' context should be unreachable and can be GC-ed. However, as pointed out by @joyeecheung as option 2, this needs to avoid making BaseObjects strong. This can be particularly a problem when we need to nest BaseObjects with c++ member fields by v8 global handles as v8's garbage collector is unable to infer the references between c++ objects. As an alternative, the references can be saved as object internal fields, or as v8's cppgc TracedReference.

joyeecheung · 2023-04-03T12:15:07Z

I also noticed that the current machinery for managing BaseObjects does not work well with the idea that "a context will become unreachable before the BaseObjects are destructed". For example I am trying to make BindingData weak, and I notice the they get deleted first via the cleanup hooks (as part of the destruction of the Realm) and then via the weak callbacks, whereas in the principal realm we simply assume that weak callbacks, if ever called, must always be called before the cleanup hooks (and we already have guard in the code for double-deletion in this case, but not the other one). There are probably more assumption about this throughout the code base. This also means, IIUC, none of the BaseObjects destructors should access the context or even modify JS states (not even setting the internal fields, that also crashes when the context is unreachable). This can be particularly problematic for complex objects like AsyncWraps (who...calls destroy hooks during destruction. Ouch).

joyeecheung · 2023-04-03T12:30:08Z

I now wonder if it's too late to make the ShadowRealms API actively dispose themselves (e.g. realm.dispose()) to avoid the leak. That would be the easiest solution, implementation-wise, to solve this problem. Probably harder TC39-process wise though (or it might not even be on the table - I am not sure if this constraint is just from our implementation).

mcollina · 2023-04-03T13:39:21Z

I now wonder if it's too late to make the ShadowRealms API actively dispose themselves (e.g. realm.dispose()) to avoid the leak. That would be the easiest solution, implementation-wise, to solve this problem. Probably harder TC39-process wise though (or it might not even be on the table - I am not sure if this constraint is just from our implementation).

From my point of view, it would be tough for us to deal with any native/asynchronous resources without a way to dispose of the realm deliberately. With the current semantics, if we open a socket inside a ShadowRealm, and the wrapper object is collected, what should we do? There are three options:

keep the realm going until there are native resources allocated to it
shut all the associated native resources down
crash badly

With a dispose() function, we could decide what behavior we want to put in place.

joyeecheung · 2023-04-03T16:20:47Z

if we open a socket inside a ShadowRealm

I think the ability of opening a socket is probably out of the scope in ShadowRealms ;). But even in those cases, those wrappers are usually weak so would not hold the context alive. The more problematic ones are e.g. caches (e.g. for module imports and serialization, which should be available in shadow realms?), we may need strong references to them to guarantee idempotence.

The AsyncWrap problem is a bit separate - they can be made weak, but also we need to make sure that they don't access the context when the weak callbacks are called, which is unsafe if the weak callbacks are called as a result of the context itself becoming reachable. This isn't possible for principal realms, because their the context is always destructed after all BaseObjects are cleaned up and async tasks are finished, but currently shadow realms have a different order. And AsyncWrap are still used in some APIs that might be available to shadow realms (e.g. anything that uses the thread pool for async work, like compression streams and crypto jobs).

legendecas · 2023-04-03T17:39:08Z

Yeah, modules are supposed to be supported in the shadow realm and can be imported with shadowRealm.importValue() or dynamic imports in evaluate().

I think modules are a bit different from AsyncWrap problem. IMO, we should keep the realm alive if there are any ongoing async tasks. This avoids the problem that a shadow realm is becoming unreachable before an async task is finished.

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: #47355 Refs: #47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

joyeecheung · 2023-04-06T00:06:15Z

I have sort of a fix here: https://github.com/joyeecheung/node/tree/weak-binding - but it's more like a band-aid for the binding data, which are special in that a) they are limited and one-per-realm b) they are alive as long as the context is still alive c) it's fine to actively and deterministically destroy them only in the cleanup hook. These characteristics aren't true for other BaseObjects, though. So that isn't really a solution to the issue here. But at least it works for binding data in the shadow realms (which are the most likely BaseObjects users can run into in shadow realms or now), and it kind of makes sense on its own already, even not for shadow realms.

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: #47355 Refs: #47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

joyeecheung · 2023-04-07T01:38:46Z

IMO, we should keep the realm alive if there are any ongoing async tasks. This avoids the problem that a shadow realm is becoming unreachable before an async task is finished.

I think that only solves part of the problem. Even if we manage to make all the references to the BaseObject wrappers weak, there really is no guarantee that the weak callbacks of the wrappers are always fired before the weak callback of the context is fired (becoming unreachable first doesn't mean their weak callback always get called first, V8 explicitly does not guarantee that). So if the destructors of the BaseObjects assume that the context is still valid (e.g. by getting anything from the wrapper, or calling any JS function, or even accessing the internal aligned pointer - this happens to way too many BaseObjects), we have to actively complete all the destructions of BaseObjects before the context becomes unreachable (which in turn means we must be the one controlling the timing of the context destruction). And that's what I was seeing when I tried to make the binding data weak -the realm weak callback got called first before the binding data weak callback got called. So the best solution is to just make them weak without a callback and simply rely on the cleanup queue to destruct them. But that's a very unique solution that only applies to binding data (they are fixed and one-per-realm so we don't get leaks by doing this).

legendecas · 2023-04-07T03:06:55Z

If BaseObjects' weak callback gets called first, their persistent handles should be reset first at

node/src/base_object.cc

Line 61 in 7efae93

obj->persistent_handle_.Reset();

and in their destructor, the handle should be empty. It is not possible to access JavaScript in the weak callback already.

However, if the context's weak callback gets called first, it is unsafe to clean up tracked base objects in the first pass weak callback. Yet it is still feasible to schedule a second pass weak callback and run the cleanup hooks to destroy all tracked base objects: https://github.com/legendecas/node/tree/shadow-realm/binding-data. In this way, the base object's weak callbacks are invoked before the realm's second-pass weak callback and their persistent handles are reset before calling their destructor. This aligns with the existing plain weak BaseObject finalization behavior.

AsyncWrap's destructor calls into JavaScript. However, AsyncWrap should keep the realm reachable. If an AsyncWrap is un-refed, it should avoid calling any V8 APIs in the destructor with the flag Realm::is_cleaning_up_ introduced in https://github.com/joyeecheung/node/tree/weak-binding.

joyeecheung · 2023-04-07T10:27:21Z

In this way, the base object's weak callbacks are invoked before the realm's second-pass weak callback and their persistent handles are reset before calling their destructor.

I think in the case, the BaseObject's weak callbacks only happen to be invoked before the Realm's second-pass callback. V8 explicitly does not guarantee the order of the weak callback invocations - it actually doesn't even guarantee that these callbacks would ever be called...and is also part of the reason why we have the cleanup queue to get rid of any BaseObjects that aren't destructed via the weak callback as the last resort (arguably it's rarer to see weak callbacks never get called, so we could get away with relying on them to reclaim resources, but relying on an assumed order is a bit too far IMHO).

legendecas · 2023-04-10T09:54:38Z

@joyeecheung thanks for the clarification. I'm afraid my comment above may be not clear. If we defer the draining of the realms' cleanup hooks to a safe point to access V8 API (i.e. outside of weak callbacks), we can avoid depending on the order of the weak callbacks of BaseObjects and v8::Context.

If the realm's cleanup hooks are drained in the weak callback, BaseObject's (whose weak callbacks are pending to be invoked) destructor tries to create an instance handle of the wrapper object and set the internal fields. This violates V8's weak callback API contract and leads to a crash.

The second-pass callback is a way that shows how we defer draining the realm cleanup hooks. It can also be replaced with Environment::SetImmediate: legendecas@4582785#diff-a0d00119410bc8c902ca44fce396732d95fd4bb4f13ad177852b2663ce746ca6R52. This doesn't depend on the order of the weak callback invocations. Whether or not a BaseObject's weak callback is called before the context's weak callback, their persistent handles are reset in BaseObject's weak callback and their destructor is not going to modify the internal fields.

joyeecheung · 2023-04-10T16:16:56Z

It can also be replaced with Environment::SetImmediate: legendecas@4582785#diff-a0d00119410bc8c902ca44fce396732d95fd4bb4f13ad177852b2663ce746ca6R52. This doesn't depend on the order of the weak callback invocations.

I think as long as we use realm() in the BaseObject destructors and we destruct the BaseObjects in the weak callback, it still assumes that the realm weak callback is called after the BaseObject wrapper weak callbacks (if the realm weak callback is called first, in the BaseObject weak callbacks realm() returns a dangling pointer). So it's still necessary to make the wrappers weak without giving it a weak callback. Which is okay for BindingData, but not necessarily for others...

joyeecheung · 2023-04-10T17:43:02Z

Actually if the realm weak callback is called first, data.GetParameter() in the BaseObject callbacks already returns a dangling pointer..I think the second pass trick only enables us to make the BaseObject destructors less strict (now you can call the V8 APIs that do not need a context), but it doesn't solve the issue that if realm weak callbacks are called before BaseObject callbacks, the BaseObject weak callbacks should not even be called (there data.GetParameter() can already be bogus).

(I also noticed that technically speaking BaseObject's weak callbacks should be made two-passed and shouldn't actually touch V8 in the first pass weak callback too...and we should make the cleanup queue for BaseObjects a two-pass process as well to avoid interleaving weak callbacks with cleanup hooks - which might happen if the BaseObject destructors allocates any memory somehow - I am not sure how probably that is, but left a TODO here)

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: nodejs#47355 Refs: nodejs#47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: #47355 Refs: #47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

The binding data holds references to the AliasedBuffers directly from their wrappers which already ensures that the AliasedBuffers won't be accessed when the wrappers are GC'ed. So we can just make the global references to the AliasedBuffers weak. This way we can simply deserialize the typed arrays when deserialize the binding data and avoid the extra Object::Set() calls. It also eliminates the caveat in the JS land where aliased buffers must be dynamically read from the binding. PR-URL: #47354 Refs: #47353 Reviewed-By: Chengzhong Wu <[email protected]>

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: #47355 Refs: #47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

There is actually a leak. The test doesn't exercise the right path to create a substantial enough object graph (e.g. accessing something that results in the loading of a binding). This does something more complicated in the test and moves it to known_issues until we find a fix. PR-URL: nodejs#47355 Refs: nodejs#47353 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Yagiz Nizipli <[email protected]>

This was referenced Apr 1, 2023

src: make AliasedBuffers in the binding data weak #47354

Merged

test: move test-shadow-realm-gc.js to known_issues #47355

Merged

VoltrexKeyva added memory Issues and PRs related to the memory management or memory footprint. realm Issues and PRs related to the ShadowRealm API and node::Realm labels Apr 1, 2023

legendecas mentioned this issue Apr 12, 2023

[v19.x] src: bootstrap Web [Exposed=*] APIs in the shadow realm #47530

Closed

nodejs-github-bot closed this as completed in #47354 Apr 20, 2023

legendecas reopened this Apr 20, 2023

legendecas mentioned this issue Apr 23, 2023

src: make realm binding data store weak #47688

Closed

mcollina mentioned this issue May 4, 2023

vm: add experimental NodeRealm implementation #47855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak in shadow realms #47353

memory leak in shadow realms #47353

joyeecheung commented Apr 1, 2023 •

edited

Loading

joyeecheung commented Apr 1, 2023

anonrig commented Apr 1, 2023

mcollina commented Apr 1, 2023

joyeecheung commented Apr 1, 2023 •

edited

Loading

legendecas commented Apr 2, 2023 •

edited

Loading

joyeecheung commented Apr 3, 2023 •

edited

Loading

joyeecheung commented Apr 3, 2023 •

edited

Loading

mcollina commented Apr 3, 2023

joyeecheung commented Apr 3, 2023 •

edited

Loading

legendecas commented Apr 3, 2023

joyeecheung commented Apr 6, 2023 •

edited

Loading

joyeecheung commented Apr 7, 2023 •

edited

Loading

legendecas commented Apr 7, 2023 •

edited

Loading

joyeecheung commented Apr 7, 2023 •

edited

Loading

legendecas commented Apr 10, 2023

joyeecheung commented Apr 10, 2023 •

edited

Loading

joyeecheung commented Apr 10, 2023 •

edited

Loading

memory leak in shadow realms #47353

memory leak in shadow realms #47353

Comments

joyeecheung commented Apr 1, 2023 • edited Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

joyeecheung commented Apr 1, 2023

anonrig commented Apr 1, 2023

mcollina commented Apr 1, 2023

joyeecheung commented Apr 1, 2023 • edited Loading

legendecas commented Apr 2, 2023 • edited Loading

joyeecheung commented Apr 3, 2023 • edited Loading

joyeecheung commented Apr 3, 2023 • edited Loading

mcollina commented Apr 3, 2023

joyeecheung commented Apr 3, 2023 • edited Loading

legendecas commented Apr 3, 2023

joyeecheung commented Apr 6, 2023 • edited Loading

joyeecheung commented Apr 7, 2023 • edited Loading

legendecas commented Apr 7, 2023 • edited Loading

joyeecheung commented Apr 7, 2023 • edited Loading

legendecas commented Apr 10, 2023

joyeecheung commented Apr 10, 2023 • edited Loading

joyeecheung commented Apr 10, 2023 • edited Loading

joyeecheung commented Apr 1, 2023 •

edited

Loading

joyeecheung commented Apr 1, 2023 •

edited

Loading

legendecas commented Apr 2, 2023 •

edited

Loading

joyeecheung commented Apr 3, 2023 •

edited

Loading

joyeecheung commented Apr 3, 2023 •

edited

Loading

joyeecheung commented Apr 3, 2023 •

edited

Loading

joyeecheung commented Apr 6, 2023 •

edited

Loading

joyeecheung commented Apr 7, 2023 •

edited

Loading

legendecas commented Apr 7, 2023 •

edited

Loading

joyeecheung commented Apr 7, 2023 •

edited

Loading

joyeecheung commented Apr 10, 2023 •

edited

Loading

joyeecheung commented Apr 10, 2023 •

edited

Loading