-
-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider supporting non-Copy scalar types. #282
Comments
Just thinking out loud here: in theory, one should be able to replace every copy by a move or clone (whatever lifetimes allow) and then relax the trait bounds from Of course, the implementations will end up being more verbose. Except for any kind of unsafe code that relies on stuff being |
Yes, that's the idea. We have some code that do rely on the scalar type being copy though. I believe those are limited to places where the |
Any updates on this? |
@StevenDoesStuffs No, this requires a massive amount of work so we didn't have the time to explore this yet. |
I made a minimal-ish test case that gets different debug-mode codegen (including an explicit call to The closest rustc issue under the query The issue rust-lang/rust#59375 (found via If using Clone bounds causes a debug-mode performance regression, but has no release mode impact, will it still be considered? If not, I'll try making a rustc patch that at least inlines clone in this example before trying the Copy -> Clone refactoring. Example: fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.clone();
}
}
fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = *a;
}
}
#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
memcpyish_copy::<usize>(x, y)
}
fn main() {
let x = [1,2,3];
let mut y = [0,0,0];
let mut z = [0,0,0];
memcpyish_copy_usize(&x, &mut y);
memcpyish_clone_usize(&x, &mut z);
}
/*
$ rustc copy_debug_codegen_20191121.rs
$ objdump -d copy_debug_codegen_20191121 | grep '<_ZN[^>]*memcpyish_[^>]*>:' -A53
$ r2 copy_debug_codegen_20191121
[0x000062b0]> pd 5 @ sym.memcpyish_clone_usize
;-- memcpyish_clone_usize:
0x00006590 50 pushq %rax
0x00006591 e84afeffff callq sym.copy_debug_codegen_20191121::memcpyish_clone::hf5036453dec78982
0x00006596 58 popq %rax
0x00006597 c3 retq
0x00006598 0f1f84000000. nopl 0(%rax, %rax)
The block for _ZN27copy_debug_codegen_2019112114memcpyish_copy17h08e0b47cae3dd97dE is shorter, and does not contain a call to _ZN4core5clone5impls54_$LT$impl$u20$core..clone..Clone$u20$for$u20$usize$GT$5clone17h80760c01c3acfef1E
With -O, rustc recognizes that the optimized function bodies are identical, and emits two symbols for the same block of code:
$ rustc -O copy_debug_codegen_20191121.rs
$ r2 copy_debug_codegen_20191121
[0x000063f0]> s sym.memcpyish_copy_usize
[0x00006520]> pd 14
;-- memcpyish_clone_usize:
;-- memcpyish_copy_usize:
0x00006520 4889f0 movq %rsi, %rax
0x00006523 4839ce cmpq %rcx, %rsi
0x00006526 480f47c1 cmovaq %rcx, %rax
0x0000652a 4885c0 testq %rax, %rax
,=< 0x0000652d 7418 je 0x6547
| 0x0000652f 50 pushq %rax
| 0x00006530 4889fe movq %rdi, %rsi
| 0x00006533 48c1e003 shlq $3, %rax
| 0x00006537 4889d7 movq %rdx, %rdi
| 0x0000653a 4889c2 movq %rax, %rdx
| 0x0000653d ff15edc72200 callq 0x00006543 ; reloc.memcpy ; [0x232d30:8]=0
| 0x00006543 4883c408 addq $8, %rsp
`-> 0x00006547 c3 retq
0x00006548 0f1f84000000. nopl 0(%rax, %rax)
Interestingly, even in debug mode, `-C inline-threshold=` does change the binary (in that memcpyish_{copy,clone}_usize contain the actual implementation instead of a 4-instruction stub), but increasing it further doesn't seem to inline Clone
$ rustc -C inline-threshold=5 copy_debug_codegen_20191121.rs
*/ |
It will be considered only if the regression is not too significant (up to 10%) in debug mode. However, we can accept a regression if a the support of non-Copy types requires a feature to be enabled. That way the user will have to enable this feature explicitly before using non-Copy types, and thus explicitly accept this will cause a performance regression in debug mode. Existing users of Copy types won't be affected by the regression because they won't have the feature enabled.
Yeah, it would be great that if the type is I guess an alternative could be to have a trait like that, and use trait InlinedClone: Clone {
fn clone_inlined(&self) -> Self;
}
impl<T: Clone> InlinedClone for T {
#[inline(always)]
fn clone_inlined(&self) -> Self {
self.clone()
}
}
impl<T: Copy> InlinedClone for T {
#[inline(always)]
fn clone_inlined(&self) -> Self {
*self
}
} My guess is that the Of course this is not possible without specialization (overlapping impl), which is why it may not be the best option. |
Yet another option would be to add the pub trait Scalar: PartialEq + Debug + Any {
#[inline]
/// Tests if `Self` the same as the type `T`
///
/// Typically used to test of `Self` is a f32 or a f64 with `N::is::<f32>()`.
fn is<T: Scalar>() -> bool {
TypeId::of::<Self>() == TypeId::of::<T>()
}
fn inlined_clone(&self) -> Self;
}
impl<T: Copy + PartialEq + Debug + Any> Scalar for T {
#[inline(always)]
fn inlined_clone(&self) -> Self {
*self
}
} This should still allow the user to implement impl Scalar for MyNonCopyType {
#[inline(always)]
fn inlined_clone(&self) -> Self {
self.clone()
}
} |
How would that be accomplished without risking breaking downstream code when the feature is enabled? |
I tweaked the codegen demo to add the If I'm understanding correctly, adding It's worth noting that Codegen demo follows: trait InlinedClone: {
fn inlined_clone(&self) -> Self;
}
impl<T: Copy> InlinedClone for T {
#[inline(always)]
fn inlined_clone(&self) -> T {
*self
}
}
fn memcpyish_inlined_clone<T: InlinedClone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.inlined_clone();
}
}
fn memcpyish_clone<T: Clone>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = a.clone();
}
}
fn memcpyish_copy<T: Copy>(x: &[T], y: &mut [T]) {
for (a, b) in x.iter().zip(y) {
*b = *a;
}
}
#[no_mangle]
#[inline(never)]
fn memcpyish_inlined_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_inlined_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_clone_usize(x: &[usize], y: &mut [usize]) {
memcpyish_clone::<usize>(x, y)
}
#[no_mangle]
#[inline(never)]
fn memcpyish_copy_usize(x: &[usize], y: &mut [usize]) {
memcpyish_copy::<usize>(x, y)
}
fn main() {
let x = [1,2,3];
let mut y1 = [0,0,0];
let mut y2 = [0,0,0];
let mut y3 = [0,0,0];
memcpyish_copy_usize(&x, &mut y1);
memcpyish_clone_usize(&x, &mut y2);
memcpyish_inlined_clone_usize(&x, &mut y3);
}
/*
ending part of memcpyish_inlined_clone_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f6e)
| `----> 0x00006f7a 488b8424c800. movq local_c8h, %rax ; [0xc8:8]=0
| : 0x00006f82 488b8c24d000. movq local_d0h, %rcx ; [0xd0:8]=0x315b4 segment_end.LOAD0
| : 0x00006f8a 488b00 movq 0(%rax), %rax
| : 0x00006f8d 48894c2410 movq %rcx, local_10h
| : 0x00006f92 4889442408 movq %rax, local_8h
| ,==< 0x00006f97 eb00 jmp 0x6f99
| |: ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_inlined_clone::h0c8f347077976026 (0x6f97)
| `--> 0x00006f99 488b442410 movq local_10h, %rax ; [0x10:8]=0x1003e0003
| : 0x00006f9e 488b4c2408 movq local_8h, %rcx ; [0x8:8]=0
| : 0x00006fa3 488908 movq %rcx, 0(%rax)
\ `=< 0x00006fa6 eb92 jmp 0x6f3a
ending part of memcpyish_copy_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_copy::h32096c80aa649f39 (0x712e)
| `----> 0x0000713a 488b8424b800. movq local_b8h, %rax ; [0xb8:8]=0
| : 0x00007142 488b8c24c000. movq local_c0h, %rcx ; [0xc0:8]=0
| : 0x0000714a 488b00 movq 0(%rax), %rax
| : 0x0000714d 488901 movq %rax, 0(%rcx)
\ `=< 0x00007150 eba8 jmp 0x70fa
ending part of memcpyish_clone_usize:
| | : ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x704e)
| `----> 0x0000705a 488bbc24c800. movq local_c8h, %rdi ; [0xc8:8]=0
| : 0x00007062 488b8424d000. movq local_d0h, %rax ; [0xd0:8]=0x315b4 segment_end.LOAD0
| : 0x0000706a 4889442410 movq %rax, local_10h
| : 0x0000706f e8acf6ffff callq sym.core::clone::impls::__impl_core::clone::Clone_for_usize_::clone::h3d9a1e411c34f5b2
| : 0x00007074 4889442408 movq %rax, local_8h
| ,==< 0x00007079 eb00 jmp 0x707b
| |: ; CODE XREF from sym.copy_debug_codegen_20191205::memcpyish_clone::h36d27fecff3c0f2c (0x7079)
| `--> 0x0000707b 488b442410 movq local_10h, %rax ; [0x10:8]=0x1003e0003
| : 0x00007080 488b4c2408 movq local_8h, %rcx ; [0x8:8]=0
| : 0x00007085 488908 movq %rcx, 0(%rax)
\ `=< 0x00007088 eb90 jmp 0x701a
*/ |
Is there a specific need for external code to call it? If not, |
External code won't need to call it, but it will need to impl it (for |
Ah, good point. In that case I'd just plan for a deprecation cycle; a blanket impl sounds like a lot of complexity at best. |
I've written #684. What is the recommended way to benchmark debug performance? IIRC |
@aweinstock314 You can benchmark in "debug mode" by disabling optimization in bench mode. Simply add that to the [profile.bench]
opt-level = 0
lto = false |
See the comments on reddit.
The text was updated successfully, but these errors were encountered: