-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment (probably) shouldn't be required for empty vectors #287
Comments
@bkietz thanks for making a note of this. I am not able to dig deeply into this at the moment, I am thrilled that flatcc is being used with Arrow, and I did how some feedback to one of the Arrow designers back then. I have worked a bit with the Arrow format in Julia and Go. That said, I disagree about alignment. You could get away with it, sort of, for empty vectors, but the (empty) data still needs to be aligned to at least 32-bits because of the length header. Furthermore pointer types in some languages, notably C, are required to be aligned, at the very least if dereferenced, but even if not, who knows what aggressive optimizations a compiler might do. In C vectors can be accessed via raw pointers iff the user takes responsibility for endian conversion, or the platform is known little endian. Therefore there are functions that are expected to return aligned pointers even if the length is empty. Non-alignment can also lead to unexpected errros such as a user doing pointer math to get the header field, even if the user is not supposed to that directly, since the binary format is well documented. Even a compiler might do that, though unlikely. As for the google/flatbuffers consumers: No consumers I am aware of, including FlatCC readers, enforce alignment, notably, it is perfectly valid to read from an unaligned buffer, e.g. from a network buffer, iff you are sure you platform arch can handle it, such as x86 and some, but not all ARM. Verification is different though. Googles C++ verifier did not verify alignment very aggressively early on, AFAIR, but that does not mean it should not do that, only that it did not implement it. From a practical perspective, it would also be an odd exception in the verifier to first check and a fail, then add branch logic to check if it in fact happened to be an empty vector, and doing it before would be a an unwanted performance penality, and still a complication. Therefore I suggest you try to address the issue at the source. |
On a related note: I would like the verifier to be able to accept unaligned buffers but still verify that they are properly aligned relative to buffer start. This would be useful for verify buffers in unaligned network buffers without having to make a copy. I think most of the code is already doing that, but AFAIR not all the code, and there should be a mode flag to request such behavior. This could be added as a PR, but it is not something I plan to address short term. I would also be possible to have a verifier that entirely avoids checking for alignment. This would be done with a compile time config flag, preferably such that the main logic is unchanged but all align helper functions just return true. Again, subject to PR. |
I will not be able to do that; de-facto these empty vectors are acceptable to many other readers of the arrow format (and some are even checked in as gold files). Therefore we're going to need to patch our copy of flatcc to handle this corner case. |
The alignment requirement need not be an obstacle here.
|
I agree with @mikkelfj that vectors should be aligned even if empty. In C++, even if you don't dereference (which would be UB and/or a crash on some architectures), even just converting unaligned pointers may result in "unspecified" values.. while this is unlikely to happen in practice, it would be a bad idea to allow it. Thus, the alignment for a vector of If the C++ verifier doesn't check this, then that is a bug / oversight. My apologies. Ideally this would be fixed, but we'd need to audit if there are any past buggy serializers that allowed this unalignment first. @dbaileychess @bkietz how did the project end up with these unaligned vectors? which writing code in what language produced these? Remains the question how to recover from the situation. I guess Arrow needs a patched verifier that disables this particular check. I'd highly recommend fixing any writers to ensure alignment from now on. |
@bkietz FYI ardappel is is the original FB designer and dbaileychess is the current google maintainer. I support ardappel's response. I agree that a patched version of the flatcc verifier is probably the best short term solution, but I still recommend to address the problem at the source, perhaps through deprecation. Flatcc runtime files includes a config.h file that users can use to overwrite flags. This could also be used to add a flag patched behaviour: https://github.com/dvidelabs/flatcc/blob/master/config/config.h To address your suggestions:
This is not to be dismissive. I want to support an efficient solution for arrow, but it should not happen by changing the nature of the core format. Also, I am not sure if it was clear to you before, but a verifier is very different from a reader. A reader can take all the shortcuts it want, almost, because it can assume the buffer is valid in exchange for speed and other objectives. A verifier should ideally be fast, but it should first and foremost be correct and assume a buffer might be invalid. EDIT: the verififer uses https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_rtconfig.h while the main parser/compiler uses the above mentioned config. |
I was primarily testing against the C++ library (which is also what produced the repro case above). I have not checked if any other libraries produce such vectors. I'll try to extract a minimal producer-side repro using the C++ library.
That sounds reasonable to me. FWIW flatcc is the only implementation thus far which has balked at these empty vectors. The arrow integration tests pass flatbuffer metadata between C++, Java, Rust, C#, and Go. The intent is to make sure everything is inter-valid so all arrow implementers |
Argh.. looks like this has been a bug in the C++ library all along.
Note the early out on Worse, this same function is used for alignment to So this means all buffers generated by the C++ implementation have possibly unaligned empty vectors in them. And any languages that were implemented by studying/copying the C++ implementation in any way may have it too. So while to be correct, I still think an empty vector should be aligned, I have to admit it does not make sense for any verifier in any language to require it.. including C. This bug is essentially grandfathered in. I'd still think it be good to fix this bug where possible. Even that will be challenging, as it may grow the size of some FlatBuffers and brittle tests the world over will break :) In terms of how to fix it, it may be tempting to just remove the The call |
I'll open a draft PR to flatcc adding |
aside from minor typo: The question is whether it should be enabled by default? I tend to prefer to keep things as they are by default, partly because that is the correct behavior, and partly because anything else is technically not guaranteed to work and C compilers get increasingly aggressive with optimizations - this is the number one reason for bug reports and fixes in flatcc by far, though mostly due to harmless warnings. However, given the scale of the bug, I am not entirely sure. As for other languages not complaining: AFAIK only C and C++ implements verifier logic. Other languages generally just depend on the built-in language safety features such as buffer overrun protection. So it is no surprising that other languages do not complain, even C would not do that without explicit verification. However, that does not mean that all those languages have been reading data correctly, though they likely have. |
For reference, the flatcc verifier did have an issue related to verifying buffers that contained alignment of size 8 or above, but it only happened with size prefixed buffers (a feature that was added some years of the original FB release). It was a rare occurrence and only fixed in late 2023, so there is a chance that this fix might have triggered something with arrow. |
I don't think this is an issue for Flatbuffers. @aardappel Is correct in the
Pushes the size of the vector (0) into the buffer. That function aligns the value being written:
|
I made an test example:
And made it with this json:
And the annotated buffer is:
You can see the empty vector size field is placed at 0x30 (which is properly aligned) and just contains the value 0. Other padding is added after wards for the other field. I tried this with all sorts of sizes of both a and c and never saw a misplaced alignment for the empty vector. |
Here is an example where the previously written thing has a non-4 byte alignment (0x1E is not 4 byte aligned), but the C++ code does properly add in the 2 padding bytes to force the vector to be 4 byte aligned at 0x18.
|
@dbaileychess Thanks for looking into this. I seems flatcc verifier should remain as is. I'll wait for @bkietz to suggest next steps. |
@dbaileychess alignment to 4 bytes is indeed guaranteed by the presence of the vector size. However the empty vector in question has elements which require 8 byte alignment. I've annotated the buffer from my repro case:
The vector size for |
Thanks for making the annotation, it's always easier for me to see the layout! So my take is if a vector is empty, it doesn't really matter what alignment it is supposed to have. There is no data in the buffer for it to take an alignment. I don't count the size prefix location as the start of the vector either. The data part is the part directly after the I guess a question is, are you checking the length of the vector before trying to access the "backing" data of it? |
@mikkelfj I feel you should by default allow unaligned empty vectors for alignments > 4, because apparently that is the defacto standard. Checking for alignment >4 should be an option. @dbaileychess you are correct at least the size is never misaligned, thanks to the (redundant) per element alignment. But it is still a bug that the A We could codify going forward that empty vectors are merely aligned to 4, and assume C++ compilers will never be able to find out. Or, we can fix the C++ builder (and who knows what other languages) going forward to align to the type also? |
And, while I am ranting, can we reflect for a moment how relatively complicated such a simple thing as alignment can be, and how much effort over the life of FlatBuffers has been spent on it? Like I said in google/flatbuffers#5875, I would totally just skip alignment entirely if I could do it all over again :) |
Either users who specify this option take responsibility for checking size before otherwise accessing the vector field, or we need to also modify reader codegen to ensure that misaligned pointers are never materialized (though it would be odd to react to an rtconfig option in the compiler source...). |
@aardappel I'm fluctuating on whether flatcc should default allow unaligned empty because @dbaileychess suggested it was not a native bug, but then it was anyway, and that obviously affects that decision.
It would be terrible to have extra runtime checks for that, performance wise. There is already an unvoidable null check that hopefully gets optmized out of any inner loops. A verifier can special handle a failed vector alignment check and then see of the length is zero. This only adds overhead to the verifier, and only when relevant.
I want nested flatbuffers to be dead for that reason. It is by far the most complicated part of the flatcc builder runtime, and a nutrious bug feeder. I don't even think C++ tries to do it right, last I heard it just used 8 byte aligment. FlatCC tries to create nested flatbuffers that are properly aligned both relative to the nested buffer and to the parent buffer, and it is not pretty. But, aligment is important for many reasons, including performance, atomic access, GPUs, optimizations. I can see a second format without alignment and it would be much simpler and denser, but it is not FlatBuffers as we know it. @bkietz feel free to submit PR on the verifier, then we can always fight over what default to use, but we need the code in place it seems, not only for arrow. |
As to incorrect behaviour of unaligned empty buffers: I don't think we should runtime fix this, that is costly. We should require buffers to be valid on platforms that are sensitive to it, and rely on "it probably works anyway" on other platforms. I don't think we should define empty to be always 4 bytes, simply because of pointer types. |
Perhaps I don't understand the C implementation, but you always have to check the length of a vector before starting to read from it. How do you know how long it is? You shouldn't be dereferncing the .data() field without knowing the length. |
In fact C forbids that misaligned pointers be materialized at all; 6.3.2.3
So |
@mikkelfj we have 10+ years worth of the C++ implementation (and maybe others) allowing incorrectly aligned vectors (for I do believe C++ tries to correctly align nested FlatBuffers, but also possible there are implementations that don't. They have actually proven rather useful and are here to stay, so you'll have to deal with them :) |
@dbaileychess This is from memory, the code is 10 years old by now, but internally the C reader returns a pointer to the first element. AFAIR C++ returns a pointer (internally) to the size header field. The returned pointer is an abstract pointer type via typedef because it needs to be endian translated, but the accessor method goes like:
Here the problem is get_vec_at. It doesn't want to check the length. I don't even think it wants to check for null, but I am not sure. For little endian, this becomes an unconditional direct memory access instruction. EDIT: for clarity: get_vec_len(v) will in praxis do a ((uint32_t *)v)[-1] operation, or whatever changes were made to abstract away from aggressive optimizations. I think it actual first casts to uint8_t pointer (aka) char pointer, which is a safe cast, then moves the pointer down, then casts to uint32_t. Either way, it should be fairly robust to any alignment at 4 bytes or above. So I don't think C runtime reader code will have any problems with the current situation. But the pointer still needs to be the native element type internally and it is exposed to the user by some API methods. |
@bkietz that is probably true, wrt. ptr materialization. Certainly pointer casts have given a lot of headache in flatcc and elsewhere. However, for unreferenced pointers, it is very common to cast them at least to size_t, then tag pointers by adding one to e.g. distinguish between leaf and non-leaf tree nodes. The size_t type is then cast back to a pointer, but the pointer is cleaned up via cast to size_t and back before access. This is such a common pattern that it can be assumed valid. However, I also think it is valid to cast to size_t though not necessarily back if unaligned, even if it is done in praxis. The problem is really when the pointer is accessed or manipulated in ways that are not carefully implemented for such behavior. |
@aardappel I'm not arguing either way on what flatcc should do, I am trying to understand. I just stated that my previous statements were affected by conflicting information. That said, I think you are saying now that C++ does it correctly when writing 0 elements of size 8 [EDIT you did not say that], but earlier you said it did not, and @bkietz also seems to confirm this. If this is the case, C++ writer should be fixed, but it does not remove the 10 years of precedence. As for whatever other languages might do, it does not matter if C++ has the bug. If C++ does not have the bug, I think we need to figure out what other languages do. |
I think I have been quite clear that C++ has the bug. |
Updated README in verifier section to make a note of this: |
yes indeed, I already added a correction to my comment. I was misreading another sentence that referenced something else. Sorry about that. |
This is now resolved in #289 with FlatCC by defaulting to only requiring empty vectors to be aligned to the size field. The original stricter behavior can be enabled by a compile time flag. This affects the verifier only (with the admitted unlikely but potential risk of UB other code due to misaligned pointers). |
…3715) ### Rationale for this change Nanoarrow can now read and write IPC files as of apache/arrow-nanoarrow#585 so it should no longer be skipped as a producer/consumer ### What changes are included in this PR? Nanoarrow's tester is updated to point to the new integration executable and to report nanoarrow as a consumer/producer of IPC files. Notably the `null_trivial` case is skipped even though nanoarrow nominally supports it since it represents a corner case in which nanoarrow's flatbuffers library will not accept some vectors produced by other flatbuffers libraries dvidelabs/flatcc#287 ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #43680 Lead-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Flatcc's
verify_vector
requires that vectors be properly aligned even when the vector is empty. The consumers in https://github.com/google/flatbuffers do not enforce this and moreover the producers will sometimes emit an empty vector which they didn't bother to align.(Encountered in the context of arrow IPC, which uses flatbuffers to store metadata) A minimal example: https://gist.github.com/bkietz/6150b9aae6d1bea0c85c0527d7fb04d6
The text was updated successfully, but these errors were encountered: