Improve rust resource usage #1

omid · 2024-11-29T13:16:05Z

It lowered it to about 30% for rust_tokio

purplesyringa · 2024-11-29T13:50:43Z

IMO this is not a correct or fair replacement.

With all other languages and runtimes, what's tested is:

How much memory futures require to exist, plus
How much memory futures require to be driven by the event loop.

In my opinion, this is the only thing reasonable to measure.

In contrast, what this PR measures is:

How much memory futures require to exist, plus
How much memory one future requires to be driven by the event loop.

In other words, calling await sequentially makes the event loop acknowledge the futures one by one, never letting its internal priority queues grow by more than one element. In practice, calling await sequentially like this has an effect of running the tasks in sequence rather than in parallel -- hardly an intended outcome.

The reason this is invisible is that sleep(duration) is in fact equivalent to sleep_until(now + duration) (in pseudocode, anyway). Even replacing sleep(duration) with async { sleep(duration).await } would quickly demonstrate that this code is not run in parallel anymore.

One way to prevent this is to write

use std::env;
use tokio::time::{sleep, Duration};

#[tokio::main]
async fn main() {
    let args: Vec<String> = env::args().collect();
    let num_tasks = args[1].parse::<i32>().unwrap();
    let tasks = (0..num_tasks)
        .map(|_| tokio::spawn(sleep(Duration::from_secs(10))))
        .collect::<Vec<_>>();

    for task in tasks {
        task.await;
    }
}

Note the use of tokio::spawn, which lets the runtime start the future immediately. However, I have a feeling that this isn't going to reduce the memory use.

omid · 2024-11-29T14:00:02Z

Nope, it doesn't evaluate them one by one.

The script takes 10 seconds, no matter how many tasks you run. So for 1M tasks it also takes 10 seconds. And it consumed more memory than 100k.

purplesyringa · 2024-11-29T14:16:55Z

I don't think you understood what I meant to say.

Your code does evaluate futures one by one. However, this seems to work as intended because "sleep for 10s" is translated to "sleep until now + 10s" upon future creation. So the futures are created, all waiting for approximately the same moment, and then they are evaluated one by one. The first feature takes a while to evaluate; the rest are awaited almost instantaneously.

This is purely a side effect of sleep time calculation being eager. You can clearly see that if sleep replaced with any other future, including just a wrapper around sleep, the code behaves differently: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e5f4c973216a695e06812bd455ab174.

In my opinion, it is exceedingly misleading to say that the code in this PR demonstrates that tokio uses less memory to run futures. As demonstrated, tokio loses this edge the moment sleep is replaced with any realistic future.

purplesyringa · 2024-11-29T14:22:16Z

To be more specific, the problem here is that futures don't "really" start execution until they are first polled, and they don't resume execution until they're polled either.

Your code starts futures sequentially, while all other examples in this repository start futures in parallel, including the tokio code you've replaced.

This effect is hidden when sleep is used, because sleep is basically a no-op and doesn't need to perform any work to start or resume. This is not the case for futures created with async blocks (or just async fn). This is not the case for any complicated futures that need to perform multiple asynchronous steps, like responding to a request over a socket.

This is only the case for some very specific futures provided by the runtime -- sleep in this case, but you might be able to find others too. I stand by the belief that relying on this barely documented, practically inapplicable fact in a benchmark is wrong.

omid · 2024-11-29T14:32:06Z

So if you are right, this code should wait 10 seconds between "2" and "3"?

async fn main() {
    let args: Vec<String> = env::args().collect();
    let num_tasks = args[1].parse::<i32>().unwrap();
    let tasks = (0..num_tasks)
        .map(|_| sleep(Duration::from_secs(10)))
        .collect::<Vec<_>>();

    println!("1");
    std::thread::sleep(std::time::Duration::from_secs(10));
    println!("2");
    for task in tasks {
        task.await;
    }
    println!("3");
}

purplesyringa · 2024-11-29T14:41:14Z

Maybe we're having a language barrier problem here. This is not what I'm saying, at all. I hate telling you to try reading what I said again, but I'm out of ideas of how to explain this.

As a very high-level not-at-all-correct metaphor, maybe consider that sleep has a property that any future wrapped in tokio::spawn has ("working fast under sequential .await"), but very few futures not wrapped in tokio::spawn have, so relying on this detail is "unfair" in an async runtime benchmark, as it's not applicable in any realistic use case, like running actual async functions.

purplesyringa · 2024-11-29T14:54:48Z

Maybe a lower-level explanation will work better? I can write something up if you let me know how much you know how familiar you are with async internals, scheduling, and event loops in general?

omid · 2024-11-29T15:13:42Z

I just checked tokio::time::sleep internals, understood what you mean.
So the original Rust code also has the same issues.

By having spawn in there, it consumes more memory than C#.
Maybe the C# code has also some issues? I cannot imagine C# becomes more memory efficient than a language like Rust.

I know the basics, but I'm not familiar with tokio internals.

purplesyringa · 2024-11-29T15:19:10Z

The original code does run futures in parallel, because futures::future::join_all has a more complex implementation than just for { .await }.

However, I agree that the two tokio/async_std benchmarks are somewhat incorrect, because they test the implementation of futures more than the async runtimes themselves. Let me restate that using futures is valid and is often done in practice, but it can't run the futures, say, in multiple threads, and IMO it should just be a separate benchmark (with either tokio or async_std, it doesn't matter which much).

The correct way to test the individual runtimes is to invoke tokio::spawn / async_std::task::spawn and then join the futures one by one (with for { .await }). Let me see if I can implement that.

purplesyringa · 2024-11-29T15:34:14Z

With the latter suggestion implemented, I get 386 MiB for tokio and 513 MiB for async_std, both on 1M tasks. I'm not yet sure why the memory use is that high, but it's quite possible that that's just how things are.

purplesyringa · 2024-11-29T15:46:04Z

Sadly, C# is the one language in this benchmark that I'm totally unfamiliar with. I'm reading the .NET code at the moment, and I think it's quite possible that their implementation is simply better than Rust's. Maybe polling (Rust) vs continuation (.NET) has something to do with this, I'm not sure. async_std's 512 bytes per task sounds unexpectedly high for no apparent reason, though.

hauleth · 2024-11-29T21:18:34Z

I think the main difference there about memory usage comes from simple place - Vec. IIRC by default it will allocate 2x memory from previous allocation when there is no memory. If the original code would simply use Vec::with_capacity(num_tasks) it would result in much better memory usage. But honestly, the code provided by @omid works exactly the same, but is:

more memory optimised (it will not overallocate)
more idiomatic

purplesyringa · 2024-11-30T03:22:35Z

I think the main difference there about memory usage comes from simple place - Vec.

No, it's quite easy to verify that's not the case. Even if you didn't verify that in practice, it's still clear that the next power of two after 1M is very close to 1M.

But honestly, the code provided by @omid works exactly the same

It only works the same by coincidence. Reading this thread would demonstrate that a semantic difference exists. Unless you're purely talking about the (0..num_tasks)...collect part, of course.

purplesyringa · 2024-11-30T04:54:16Z

I've done some research, and IMO, Rust is actually very inefficient re: coroutine memory use, apparently more so than C#. Hopefully this is fixed in rustc at some point, but this seems to be a known issue. I'll try to push some PRs into async_std/tokio in the meantime to workaround individual manifestations of this issue, but I won't make any promises.

By creating the future manually instead of relying on `async { .. }`, we workaround rustc's inefficient future layouting. On [a simple benchmark](https://github.com/hez2010/async-runtimes-benchmarks-2024) spawning 1M of tasks, this reduces memory use from about 512 bytes per future to about 340 bytes per future. More context: hez2010/async-runtimes-benchmarks-2024#1

purplesyringa · 2024-11-30T06:32:26Z

I've sent a PR to solve part of the problem for async_std. Any further practical improvements are likely blocked on rust-lang/rust#69826.

As for tokio, it seems to require much per-task memory by design. On x86-64, it aligns tasks to 128 bytes (cache prefetch line), and the headers are so large that the minimal task size is effectively raised to 256 bytes. Minor optimizations are probably possible, but aren't going to be easy in the slightest.

Perhaps the lesson here is to use futures when possible.

neon-sunset · 2024-11-30T19:16:25Z

Sadly, C# is the one language in this benchmark that I'm totally unfamiliar with. I'm reading the .NET code at the moment, and I think it's quite possible that their implementation is simply better than Rust's. Maybe polling (Rust) vs continuation (.NET) has something to do with this, I'm not sure. async_std's 512 bytes per task sounds unexpectedly high for no apparent reason, though.

This might be reasonable - Tasks in .NET serve a similar role Futures do in Rust but at the same time they are not a part of a singular coarser grained concurrency unit like in Rust where Futures are a part of a Task. There is no need to deterministically know the memory consumed by all the callstacks within a Task in advance since .NET is garbage-collected and heap allocates the state machine boxes for the tasks that yield asynchronously. The baseline allocation cost of the Task hence starts at about 100B, Another 100B or so are spent on allocating a Timer used by Task.Delay under the hood. This quite closely correlates with the memory consumption of 200MB + extra for the 1M coroutines. You are likely to see greater memory consumption as the call stack of asynchronous methods becomes deeper and the state captured by state machine boxes becomes larger.

At the same time, a Task in Rust if my understanding is correct has definitive knowledge of the memory taken by all of its constituent Futures. As a result, it has much smaller CPU overhead due to the way it is dispatched/polled and smaller amortized memory overhead once we start throwing complex code at it. Which could explain why it starts at 500B.

This difference is unlikely to be relevant that often, but might be interesting in a common pattern of concurrency seen in C# where you have multiple calls with data that is independent from each other:

using var http = new HttpClient { BaseAddress = someUrl }; 
var page1 = http.GetStringAsync("/page1");
var page2 = http.GetStringAsync("/page2");
Console.WriteLine(await page1 + await page2);

In Rust, if you would like to have whatever form of post-processing an async function does to be handled on a different thread you would have to spawn a Task. In .NET on the other hand this is the default behavior - the tasks are hot-started and any worker thread can steal the processing of a continuation from another worker's thread queue if the latter does not get to it in time.

Let me know if I got any details wrong and hopefully this sheds some light on the async differences between the two.

omid · 2024-12-01T16:17:35Z

I'm closing this in favor of the other merged PR.

By creating the future manually instead of relying on `async { .. }`, we workaround rustc's inefficient future layouting. On [a simple benchmark](https://github.com/hez2010/async-runtimes-benchmarks-2024) spawning 1M of tasks, this reduces memory use from about 512 bytes per future to about 340 bytes per future. More context: hez2010/async-runtimes-benchmarks-2024#1

Improve rust resource usage

4f6c4ba

purplesyringa mentioned this pull request Nov 30, 2024

Better memory footprint smol-rs/async-executor#137

Merged

purplesyringa mentioned this pull request Nov 30, 2024

Rewrite Rust benchmarks #5

Merged

hez2010 force-pushed the main branch from 9e15b21 to f8360cb Compare December 1, 2024 07:43

omid closed this Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rust resource usage #1

Improve rust resource usage #1

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024

purplesyringa commented Nov 29, 2024 •

edited

Loading

hauleth commented Nov 29, 2024

purplesyringa commented Nov 30, 2024

purplesyringa commented Nov 30, 2024

purplesyringa commented Nov 30, 2024

neon-sunset commented Nov 30, 2024 •

edited

Loading

omid commented Dec 1, 2024

Improve rust resource usage #1

Improve rust resource usage #1

Conversation

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024 • edited Loading

purplesyringa commented Nov 29, 2024 • edited Loading

purplesyringa commented Nov 29, 2024

omid commented Nov 29, 2024

purplesyringa commented Nov 29, 2024 • edited Loading

purplesyringa commented Nov 29, 2024

purplesyringa commented Nov 29, 2024 • edited Loading

hauleth commented Nov 29, 2024

purplesyringa commented Nov 30, 2024

purplesyringa commented Nov 30, 2024

purplesyringa commented Nov 30, 2024

neon-sunset commented Nov 30, 2024 • edited Loading

omid commented Dec 1, 2024

omid commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024 •

edited

Loading

purplesyringa commented Nov 29, 2024 •

edited

Loading

neon-sunset commented Nov 30, 2024 •

edited

Loading