Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validity of references: Memory-related properties #77

Closed
RalfJung opened this issue Jan 10, 2019 · 114 comments
Closed

Validity of references: Memory-related properties #77

RalfJung opened this issue Jan 10, 2019 · 114 comments
Labels
A-validity Topic: Related to validity invariants

Comments

@RalfJung
Copy link
Member

RalfJung commented Jan 10, 2019

Discussing the memory-related properties of references: does &T have to point to allocated memory (with at least size_of::<T>() bytes being allocated)? If yes, does the memory have to contain data that satisfies the validity invariant of T?

If the answer to both of these questions is "yes", one consequence is that &! is uninhabited: There is no valid reference of type &!.

Currently, during LLVM lowering, we add a "dereferencable" attribute to references, indicating that the answer to the first question should be "yes". This is a rather unique case in that this is the only case where validity depends on the contents of memory. This opens some new, interesting questions:

  1. I mentioned above that size_of::<T>() many bytes need to be dereferencable. How do we handle unsized types? We could determine the size according to the metadata and the type of the unsized tail. For slices, that's really easy, but for trait objects this involves the vtable, so it would introduce yet another kind of dependy of validity on the memory. However, vtables must not be modified, and they never deallocated (right?), so this is a fairly weak form of dependency where if a pointer was a valid vtable pointer once, then it always will be.

    With more exotic forms of unsized types, this becomes less easy. extern type we can mostly ignore, we cannot even dynamically know their size so we basically can just assume it is 0, and check dereferencability for that. But what about custom DST? I don't think we want to make validity depend on executing arbitrary user-defined code. We could just check validity for the sized prefix of this unsized type, but that would introduce an inconsistency between primitive DST and user-defined custom DST. Is that a problem?

    For unsized types, even the requirement that the pointer be well-aligned becomes subtle because determining alignment has similar issues than determining the size.

  2. What about validity of ManuallyDrop<&T>? ManuallyDrop<T> certainly shares all the bit-level properties of T, because we perform layout optimization on it. But does ManuallyDrop<&T> have to be dereferencable?

Note that this is not about aliasing or provenance; those should be discussed separately -- a bunch of open issues already exist for provenance in general and stacked borrows specifically.

CURRENT STATE: The thread is long and there were many positions without a good summary. My own latest position can be found here.

@RalfJung RalfJung added active discussion topic A-validity Topic: Related to validity invariants labels Jan 10, 2019
@RalfJung
Copy link
Member Author

Notice that there is an alternative that would make validity not depend on memory, while maintaining the dereferencable attribute: the Stacked Borrows aliasing model includes an operation on references called "retagging" that, among other things, raises UB if the reference is not dereferencable. So, if we answer the two questions from the OP with "yes" and "no", respectively, we could equivalently say that validity does not make any requirements about references being dereferencable, but the aliasing model does. That would make validity be a property that only depends on raw bits, not on memory, which would simplify the discussion elsewhere (and resolve #50 (comment)).

With this approach, the properties of ManuallyDrop<&T> would be determined by whether retagging descends into the fields of ManuallyDrop or not.

@RalfJung
Copy link
Member Author

Concerning the second question, my personal thinking is that we should not require the pointed-to memory to be valid itself.

One good argument for making things UB with a strict validity invariant is bug-checking tools, but in this case actually doing recursive checking of references all the time is really costly, and probably makes it near impossible to actually develop useful tools.

On the other hand, it is very useful when writing unsafe code to be able to pass around a &mut T to some uninitialized data, and have another function write into that reference to initialize it. If we say that valid references must point to valid data, this pattern becomes UB. As a consequence, then, we should offer tons of new APIs in libstd that take raw pointers instead of references, so that they can be used for initialization.

Some examples:

@nikomatsakis
Copy link
Contributor

So @arielb1, for example, has traditionally maintained that having &T require that its referent is valid would invalidate far too much code. I'm inclined to agree. I think that our idea for ! patterns kind of assuaged my concerns about how to handle &!, so I feel comfortable with making the validity invariant shallow. (The argument about &mut T to an uninitialized T is also strong.)

I am also intrigued by this comment from @RalfJung :

Notice that there is an alternative that would make validity not depend on memory,

That seems like a very good property to have. I am inclined to pursue this approach, personally.

@nagisa
Copy link
Member

nagisa commented Feb 4, 2019

I (personally) think that not considering & to be "pointers" is the only sensible solution here (similar to how C++ does it, references behave more like plain values rather than pointers). My motivating example is that given a function like this:

fn generic<T>(foo: &T) {
    // body
}

it is way too easy to end up with something that will most likely indefinitely stay UB for T = ! where code would otherwise be valid for all other Ts. Making &! uninhabited avoids this problem altogether and we may be able to relax this later on if we figure out that:

  1. There are convincing use cases for &! not be uninhabited;
  2. Figure out how to make all the safe constructs for &T be defined with &!.

@arielb1
Copy link

arielb1 commented Feb 4, 2019

it is way too easy to end up with something that will most likely indefinitely stay UB for T = ! where code would otherwise be valid for all other Ts.

Could you come up with such an example - that is UB for T = ! but not UB for say T = bool and foo containing a pointer to an invalid bool?

@nagisa
Copy link
Member

nagisa commented Feb 4, 2019

@arielb1 I do not think I’m able to come up with an example (is there one?) where it would not be UB if bool had invalid bit pattern, but at least it is possible to produce a reference to anything valid at all for &bool.

I realized since I last wrote the comment that, in order to obtain &!, unsafe code is necessary one way or the other (even though @RalfJung says this is a property of the safety system, not value validity system, and those are independent). With that in mind, I’m fine with whatever ends up being decided here.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 5, 2019

We talked about this at the all-hands. @cramertj expressed interest in &! being uninhabited to be able to optimize functions away for being dead code. @Centril noted that in particular related to matches, if we just make validity of &T recursive, there is no question about automatically going below reference types in a match, such as in

fn foo<T>(x: &!) -> T { match x { } }

Even in unsafe code, the match can never cause issues on its own, the reference would already be invalid and hence you'd have UB earlier.

I believe we should handle all types consistently, meaning that if &! is uninhabited (from a validity perspective, not just from a safety perspective), then we should also say that &bool is UB if it does not point to a valid bool, and so on.

One issue with this is that this makes validity very hard to check for in a UB checker like Miri, or in a valgrind tool. You'd have to do a recursive walk following all the pointers. Also, it is unclear how much optimizations benefit from this (beyond removing dead code for &!) because a value that used to be valid at some point, might become invalid later when the contents of memory change.
Also, new hard questions then pop up about the interaction with Stacked Borrows, where I think it might be hard to make sure that transitively through the pointer chain, all the aliasing works out the right way. Retagging is currently a key ingredient for this, but if we do this transitively we'd have to Retag references that are stored in memory, which I don't think we want to do -- magically modifying memory seems like a bad idea.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 5, 2019

it is way too easy to end up with something that will most likely indefinitely stay UB for T = ! where code would otherwise be valid for all other Ts. Making &! uninhabited avoids this problem altogether.

I don't understand what you are saying here. Making &! uninhabited makes strictly more programs UB? How is that supposed to solve problems with programs being UB?

@RalfJung
Copy link
Member Author

RalfJung commented Feb 5, 2019

Also one thing @Centril brought up at the all-hands: we need more data. In particular, we should figure out if there are interesting patterns of unsafe code that rely on having references to invalid data, and that would be too disruptive to convert to raw pointers or too widely used to break.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 8, 2019

One issue with requiring references to be transitively valid: we have a whole bunch of existing reference-based APIs, such as for slices, that we could then not use. I expect this to cause a lot of trouble with existing code, but I am not sure.


Another proposal for references that enables @cramertj's optimizations could be: if reference's validity depends on memory in complex ways, we will need a notion of "bitstring validity". (Avoiding that is one argument for shallow validity, IMO.) We could define validity of a reference to require that the pointee is bitstring valid. This makes checking validity feasible and enables some optimizations. However, it would mean that &! is uninhabited while &&! is not.

@CAD97
Copy link

CAD97 commented Mar 13, 2019

Another data point is rust-lang/rfcs#2645 (FCP-merge) which theoretically will allow transmuting between &mut T and &mut MaybeUninit<T>, which removes some of the pressure of wanting to use &mut T to uninitialized memory.

I'm in favor of the validity invariant of &_ being shallow but the borrowing model requiring some amount of deep validity (though it could potentially be just one dereference deep) at most usage of the type, including reborrows for function calls..

@RalfJung
Copy link
Member Author

@CAD97 that RFC is not necessary for the transmute you mentioned -- it is only needed if we want to pass things by value. In memory, T and MaybeUninit<T> are already the same (as in, they have the same size and alignment; there might be a different for layout optimizations).

@petertodd
Copy link

Data point for a use-case where a reference to invalid data is useful: https://users.rust-lang.org/t/am-i-triggering-undefined-behavior-here/28364/10

tl;dr:

/// A byte array with the same size as type `T`
///
/// If `T: !Sized`, this size may be dynamic.
#[repr(packed)]
pub struct Bytes<T: ?Sized> {
    buf: ManuallyDrop<T>,
}

where Bytes<T> is never created directly, but rather is always used via a reference such as &Bytes<T> or &mut Bytes<T>.

Counterpoint: if MaybeUninit<T> accepted T: ?Sized the reference would be valid, and thus this wouldn't be an issue.

@gnzlbg
Copy link
Contributor

gnzlbg commented May 21, 2019

@petertodd can you walk me through why do you think that &Byte<T> is invalid in that case? I don't see any UB in the code you show in the post.

@RalfJung
Copy link
Member Author

"A byte array with the same size as type T" where T: ?Sized... that's weird.^^

We don't have a story currently for uninitialized unsized data -- and given some of the plans for custom DSTs, it will not be possible to support that in general (namely, for "thin" DSTs).

@gnzlbg
Copy link
Contributor

gnzlbg commented May 21, 2019

I don't follow. I don't see anything wrong with Bytes<T>. IIUC, @petertodd always uses it when T is properly initialized. If anything, there would be something wrong in the conversions from &[u8] to &Bytes<T> and vice-versa, but I don't see why @petertodd cannot use &[MaybeUninit<u8>] there instead, although depending on #71 that might not be necessary and @petertodd code might be correct.

@RalfJung
Copy link
Member Author

@gnzlbg the issue is in this function:

impl<T> Bytes<T> {
    pub fn from_slice(buf: &[u8]) -> &Bytes<T> {
        assert_eq!(buf.len(), mem::size_of::<T>());
        unsafe {
            &*Self::from_ptr(buf.as_ptr() as *const T)
        }   
    }
}

I can use that to turn a &[u8; 1] into a &Bytes<bool> even if the array contains a 3.

@gnzlbg
Copy link
Contributor

gnzlbg commented May 21, 2019

That function should be unsafe, and then the caller needs to assert the validity of the buf, or is there an expectation that we will be ever able to do better than that ?

@RalfJung
Copy link
Member Author

If we don't make validity of references recursive (and my opinion still is that we should not :D ), then that code is just fine.

@petertodd
Copy link

@gnzlbg So the full use-case where I came up with that beast is in-place "deserialization", e.g. for memmapped files (which as @RalfJung correctly noted elsewhere have issues with other processes changing them, but for sake of argument assume that has been solved).

As we know, for many types not all bit-patterns are valid. Thus we can have an API along the lines of:

unsafe trait Validate {
   type Error;
   /// Validate that `buf` represents a valid instance of `Self`.
   fn validate(buf: &Bytes<Self>) -> Result<(), Self::Error>;
}

The validate() function is safe to call, because Bytes can only be (safely) created from byte slices for which all bytes in the slice are initialized; Validate is unsafe to implement, because other code depends on it actually working.

This is why Bytes has a safety problem: the whole point is to verify that the bytes are valid for the type in question.

Unsized Types

So how do unsized types fit into all this? See https://users.rust-lang.org/t/am-i-triggering-undefined-behavior-here/28364/9

As @RalfJung correctly notes, it's not clear that you can always/will always be able to determine the size of an unsized type value from pointer metadata. However it's certainly possible to do this for a subset of types, such as slices. So I simply made a Pointee trait for the subset of types that can do this - essentially implementing part of what a custom DST API would do.

But MaybeUninit doesn't (yet) support unsized types, which leads me to this problem.

Alternative Solution

Wrap a pointer instead:

struct Bytes<'a, T: ?Sized> {
    marker: PhantomData<&'a ()>,
    ptr: *const T,
}

Which is fine for me - not quite as nice an API for what I'm doing, as I'll need BytesMut etc., but it'll work. I'm just bringing all this up to give an example of a potential use-case where validity of references matters.

@RalfJung
Copy link
Member Author

Another case of libstd manifesting invalid references.

@RalfJung
Copy link
Member Author

See rust-lang/rust-memory-model#2 for some earlier discussion of basically the same subject.

jrvanwhy added a commit to jrvanwhy/libtock-rs that referenced this issue May 20, 2021
According to the current Rust Reference [1], storing an uninitialized `u8` is undefined behavior. This may change in the future [2], but for now we should continue to assume it is undefined behavior.

Every use of `core::mem::uninitialized` in `ufmt` is to create a local `[u8; _]`, and therefore is an example of this undefined behavior. I removed the undefined behavior in the simplest way possible, which is to replace the initializers with `[u8; _]`.

[1] https://doc.rust-lang.org/reference/behavior-considered-undefined.html
[2] rust-lang/unsafe-code-guidelines#77
bors bot added a commit to tock/libtock-rs that referenced this issue May 20, 2021
306: Remove uses of `core::mem::uninitialized` from `ufmt`. r=jrvanwhy a=jrvanwhy

According to the current Rust Reference [1], storing an uninitialized `u8` is undefined behavior. This may change in the future [2], but for now we should continue to assume it is undefined behavior.

Every use of `core::mem::uninitialized` in `ufmt` is to create a local `[u8; _]`, and therefore is an example of this undefined behavior. I removed the undefined behavior in the simplest way possible, which is to replace the initializers with `[u8; _]`.

[1] https://doc.rust-lang.org/reference/behavior-considered-undefined.html
[2] rust-lang/unsafe-code-guidelines#77

Co-authored-by: Johnathan Van Why <[email protected]>
JohnTitor added a commit to JohnTitor/rust that referenced this issue May 17, 2022
interpret/validity: reject references to uninhabited types

According to https://doc.rust-lang.org/reference/behavior-considered-undefined.html, this is definitely UB. And we can check this without actually looking up anything in memory, we just need the reference value and its type, making this a great candidate for a validity invariant IMO and my favorite resolution of rust-lang/unsafe-code-guidelines#77.

With this PR, Miri with `-Zmiri-check-number-validity` implements all my preferred options for what the validity invariants of our types could be. :)

CTFE has been doing recursive checking anyway, so this is backwards compatible but might change the error output. I will submit a PR with the new Miri tests soon.

r? `@oli-obk`
@JakobDegen
Copy link
Contributor

I'm going to reproduce a comment I made over in #346 here, since I believe it is a new point in favor of requiring some amount of validity on Retag operations:

So I'm going to contest the claim that there's no known optimization benefit to this. The code below gives an example. It uses & instead of &mut but presumably the intent is that this should apply there equally. The code is nearly identical to code I gave in the write-up linked above, although the motivation is different.

enum E {
    A(UnsafeCell<u8>),
    B(u8),
}

let b = E::B(0);
opaque_func(&b);
assert_eq!(b, E::B(0));

The goal is to optimize out the assert. However, it is unclear how to justify this optimization. Specifically, we pass a &E to the opaque function, but we must somehow know that if the "active variant" of x is A, then the passed reference has write permission to the second byte of the enum, but if the active variant is B, then it only has read permission. The obvious way to implement this is to read from memory and check every time we do a retag operation.

This necessarily means that retagging (which happens basically every time a reference is copied) must assert validity conditions in at least some cases. We have some control over when this happens though. For example, there's no need to assert validity when retagging a &u8.

There are alternatives to this model as well - for example, we could try and adjust the aliasing model to make this unnecessary, by having some rule like ("retagged references get the same permissions as their parents" or such). It's unclear how precisely to do this though - a couple naive ideas I've thought of don't work - and so this would probably need quite some investigation.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 9, 2022

Indeed, that is the canonical example for having fully precise UnsafeCell tracking in Stacked Borrows. That's a different issue, #236, but it bears some relation to this one: doing fully precise UnsafeCell tracking requires checking some amount of validity behind references, on a retag. It would not require checking validity of &bool though, so your optimization is not an argument in favor of general validity-behind-references.

That said, I would find it odd to have some validity but not full validity. So if we want fully precise UnsafeCell tracking then I think we should also do proper validation. Then we could still either do that checking just one level deep, or fully recursively.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 12, 2022

(Relaying from #346 (comment))

I think a general "check validity behind references" is not possible. Consider:

// This function can be called with `rc` and `unique` pointing to
// overlapping regions of memory.
fn evil_alias<T>(rc: &RefCell<T>, unique: &mut T) {}

fn proof<T>(x: T) {
    let rc = RefCell::new(x);
    let mut bmut = rc.borrow_mut();
    evil_alias(&rc, &mut *bmut);
}

Now when evil_alias is called, to validate that rc points to a valid RefCell<T>, we have to read the T in there. But that means we are reading from memory that unique points to, with a different pointer, which violates the uniqueness assumptions of &mut!

@RalfJung RalfJung closed this as completed Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-validity Topic: Related to validity invariants
Projects
None yet
Development

No branches or pull requests