Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rust resource usage #1

Closed
wants to merge 1 commit into from
Closed

Improve rust resource usage #1

wants to merge 1 commit into from

Conversation

omid
Copy link

@omid omid commented Nov 29, 2024

It lowered it to about 30% for rust_tokio

@purplesyringa
Copy link

IMO this is not a correct or fair replacement.

With all other languages and runtimes, what's tested is:

  1. How much memory futures require to exist, plus
  2. How much memory futures require to be driven by the event loop.

In my opinion, this is the only thing reasonable to measure.

In contrast, what this PR measures is:

  1. How much memory futures require to exist, plus
  2. How much memory one future requires to be driven by the event loop.

In other words, calling await sequentially makes the event loop acknowledge the futures one by one, never letting its internal priority queues grow by more than one element. In practice, calling await sequentially like this has an effect of running the tasks in sequence rather than in parallel -- hardly an intended outcome.

The reason this is invisible is that sleep(duration) is in fact equivalent to sleep_until(now + duration) (in pseudocode, anyway). Even replacing sleep(duration) with async { sleep(duration).await } would quickly demonstrate that this code is not run in parallel anymore.

One way to prevent this is to write

use std::env;
use tokio::time::{sleep, Duration};

#[tokio::main]
async fn main() {
    let args: Vec<String> = env::args().collect();
    let num_tasks = args[1].parse::<i32>().unwrap();
    let tasks = (0..num_tasks)
        .map(|_| tokio::spawn(sleep(Duration::from_secs(10))))
        .collect::<Vec<_>>();

    for task in tasks {
        task.await;
    }
}

Note the use of tokio::spawn, which lets the runtime start the future immediately. However, I have a feeling that this isn't going to reduce the memory use.

@omid
Copy link
Author

omid commented Nov 29, 2024

Nope, it doesn't evaluate them one by one.

The script takes 10 seconds, no matter how many tasks you run. So for 1M tasks it also takes 10 seconds. And it consumed more memory than 100k.

@purplesyringa
Copy link

I don't think you understood what I meant to say.

Your code does evaluate futures one by one. However, this seems to work as intended because "sleep for 10s" is translated to "sleep until now + 10s" upon future creation. So the futures are created, all waiting for approximately the same moment, and then they are evaluated one by one. The first feature takes a while to evaluate; the rest are awaited almost instantaneously.

This is purely a side effect of sleep time calculation being eager. You can clearly see that if sleep replaced with any other future, including just a wrapper around sleep, the code behaves differently: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e5f4c973216a695e06812bd455ab174.

In my opinion, it is exceedingly misleading to say that the code in this PR demonstrates that tokio uses less memory to run futures. As demonstrated, tokio loses this edge the moment sleep is replaced with any realistic future.

@purplesyringa
Copy link

To be more specific, the problem here is that futures don't "really" start execution until they are first polled, and they don't resume execution until they're polled either.

Your code starts futures sequentially, while all other examples in this repository start futures in parallel, including the tokio code you've replaced.

This effect is hidden when sleep is used, because sleep is basically a no-op and doesn't need to perform any work to start or resume. This is not the case for futures created with async blocks (or just async fn). This is not the case for any complicated futures that need to perform multiple asynchronous steps, like responding to a request over a socket.

This is only the case for some very specific futures provided by the runtime -- sleep in this case, but you might be able to find others too. I stand by the belief that relying on this barely documented, practically inapplicable fact in a benchmark is wrong.

@omid
Copy link
Author

omid commented Nov 29, 2024

So if you are right, this code should wait 10 seconds between "2" and "3"?

async fn main() {
    let args: Vec<String> = env::args().collect();
    let num_tasks = args[1].parse::<i32>().unwrap();
    let tasks = (0..num_tasks)
        .map(|_| sleep(Duration::from_secs(10)))
        .collect::<Vec<_>>();

    println!("1");
    std::thread::sleep(std::time::Duration::from_secs(10));
    println!("2");
    for task in tasks {
        task.await;
    }
    println!("3");
}

@purplesyringa
Copy link

purplesyringa commented Nov 29, 2024

Maybe we're having a language barrier problem here. This is not what I'm saying, at all. I hate telling you to try reading what I said again, but I'm out of ideas of how to explain this.

As a very high-level not-at-all-correct metaphor, maybe consider that sleep has a property that any future wrapped in tokio::spawn has ("working fast under sequential .await"), but very few futures not wrapped in tokio::spawn have, so relying on this detail is "unfair" in an async runtime benchmark, as it's not applicable in any realistic use case, like running actual async functions.

@purplesyringa
Copy link

Maybe a lower-level explanation will work better? I can write something up if you let me know how much you know how familiar you are with async internals, scheduling, and event loops in general?

@omid
Copy link
Author

omid commented Nov 29, 2024

I just checked tokio::time::sleep internals, understood what you mean.
So the original Rust code also has the same issues.

By having spawn in there, it consumes more memory than C#.
Maybe the C# code has also some issues? I cannot imagine C# becomes more memory efficient than a language like Rust.

I know the basics, but I'm not familiar with tokio internals.

@purplesyringa
Copy link

purplesyringa commented Nov 29, 2024

The original code does run futures in parallel, because futures::future::join_all has a more complex implementation than just for { .await }.

However, I agree that the two tokio/async_std benchmarks are somewhat incorrect, because they test the implementation of futures more than the async runtimes themselves. Let me restate that using futures is valid and is often done in practice, but it can't run the futures, say, in multiple threads, and IMO it should just be a separate benchmark (with either tokio or async_std, it doesn't matter which much).

The correct way to test the individual runtimes is to invoke tokio::spawn / async_std::task::spawn and then join the futures one by one (with for { .await }). Let me see if I can implement that.

@purplesyringa
Copy link

With the latter suggestion implemented, I get 386 MiB for tokio and 513 MiB for async_std, both on 1M tasks. I'm not yet sure why the memory use is that high, but it's quite possible that that's just how things are.

@purplesyringa
Copy link

purplesyringa commented Nov 29, 2024

Sadly, C# is the one language in this benchmark that I'm totally unfamiliar with. I'm reading the .NET code at the moment, and I think it's quite possible that their implementation is simply better than Rust's. Maybe polling (Rust) vs continuation (.NET) has something to do with this, I'm not sure. async_std's 512 bytes per task sounds unexpectedly high for no apparent reason, though.

@hauleth
Copy link

hauleth commented Nov 29, 2024

I think the main difference there about memory usage comes from simple place - Vec. IIRC by default it will allocate 2x memory from previous allocation when there is no memory. If the original code would simply use Vec::with_capacity(num_tasks) it would result in much better memory usage. But honestly, the code provided by @omid works exactly the same, but is:

  • more memory optimised (it will not overallocate)
  • more idiomatic

@purplesyringa
Copy link

I think the main difference there about memory usage comes from simple place - Vec.

No, it's quite easy to verify that's not the case. Even if you didn't verify that in practice, it's still clear that the next power of two after 1M is very close to 1M.

But honestly, the code provided by @omid works exactly the same

It only works the same by coincidence. Reading this thread would demonstrate that a semantic difference exists. Unless you're purely talking about the (0..num_tasks)...collect part, of course.

@purplesyringa
Copy link

I've done some research, and IMO, Rust is actually very inefficient re: coroutine memory use, apparently more so than C#. Hopefully this is fixed in rustc at some point, but this seems to be a known issue. I'll try to push some PRs into async_std/tokio in the meantime to workaround individual manifestations of this issue, but I won't make any promises.

purplesyringa added a commit to purplesyringa/async-executor that referenced this pull request Nov 30, 2024
By creating the future manually instead of relying on `async { .. }`, we
workaround rustc's inefficient future layouting. On
[a simple benchmark](https://github.com/hez2010/async-runtimes-benchmarks-2024)
spawning 1M of tasks, this reduces memory use from about 512 bytes per
future to about 340 bytes per future.

More context: hez2010/async-runtimes-benchmarks-2024#1
@purplesyringa
Copy link

I've sent a PR to solve part of the problem for async_std. Any further practical improvements are likely blocked on rust-lang/rust#69826.

As for tokio, it seems to require much per-task memory by design. On x86-64, it aligns tasks to 128 bytes (cache prefetch line), and the headers are so large that the minimal task size is effectively raised to 256 bytes. Minor optimizations are probably possible, but aren't going to be easy in the slightest.

Perhaps the lesson here is to use futures when possible.

@neon-sunset
Copy link

neon-sunset commented Nov 30, 2024

Sadly, C# is the one language in this benchmark that I'm totally unfamiliar with. I'm reading the .NET code at the moment, and I think it's quite possible that their implementation is simply better than Rust's. Maybe polling (Rust) vs continuation (.NET) has something to do with this, I'm not sure. async_std's 512 bytes per task sounds unexpectedly high for no apparent reason, though.

This might be reasonable - Tasks in .NET serve a similar role Futures do in Rust but at the same time they are not a part of a singular coarser grained concurrency unit like in Rust where Futures are a part of a Task. There is no need to deterministically know the memory consumed by all the callstacks within a Task in advance since .NET is garbage-collected and heap allocates the state machine boxes for the tasks that yield asynchronously. The baseline allocation cost of the Task hence starts at about 100B, Another 100B or so are spent on allocating a Timer used by Task.Delay under the hood. This quite closely correlates with the memory consumption of 200MB + extra for the 1M coroutines. You are likely to see greater memory consumption as the call stack of asynchronous methods becomes deeper and the state captured by state machine boxes becomes larger.

At the same time, a Task in Rust if my understanding is correct has definitive knowledge of the memory taken by all of its constituent Futures. As a result, it has much smaller CPU overhead due to the way it is dispatched/polled and smaller amortized memory overhead once we start throwing complex code at it. Which could explain why it starts at 500B.

This difference is unlikely to be relevant that often, but might be interesting in a common pattern of concurrency seen in C# where you have multiple calls with data that is independent from each other:

using var http = new HttpClient { BaseAddress = someUrl }; 
var page1 = http.GetStringAsync("/page1");
var page2 = http.GetStringAsync("/page2");
Console.WriteLine(await page1 + await page2);

In Rust, if you would like to have whatever form of post-processing an async function does to be handled on a different thread you would have to spawn a Task. In .NET on the other hand this is the default behavior - the tasks are hot-started and any worker thread can steal the processing of a continuation from another worker's thread queue if the latter does not get to it in time.

Let me know if I got any details wrong and hopefully this sheds some light on the async differences between the two.

@omid
Copy link
Author

omid commented Dec 1, 2024

I'm closing this in favor of the other merged PR.

@omid omid closed this Dec 1, 2024
purplesyringa added a commit to purplesyringa/async-executor that referenced this pull request Dec 3, 2024
By creating the future manually instead of relying on `async { .. }`, we
workaround rustc's inefficient future layouting. On
[a simple benchmark](https://github.com/hez2010/async-runtimes-benchmarks-2024)
spawning 1M of tasks, this reduces memory use from about 512 bytes per
future to about 340 bytes per future.

More context: hez2010/async-runtimes-benchmarks-2024#1
notgull pushed a commit to smol-rs/async-executor that referenced this pull request Dec 3, 2024
By creating the future manually instead of relying on `async { .. }`, we
workaround rustc's inefficient future layouting. On
[a simple benchmark](https://github.com/hez2010/async-runtimes-benchmarks-2024)
spawning 1M of tasks, this reduces memory use from about 512 bytes per
future to about 340 bytes per future.

More context: hez2010/async-runtimes-benchmarks-2024#1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants