-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SString typed on encoding #69748
Conversation
`SString<TEncoding>` was renamed to `EString<TEncoding>` (means "encoded string") and `SString` is now an alias of `EString<EncodingUnicode>`.
…directly to avoid depending on SBuffer implementation details too much in EString.
5307ae6
to
0a5056c
Compare
…tError when the temporary is destroyed.
…untime into sstring-explicit-encoding
…set up invalid state to start.
What is the problem that this is trying to solve? The beauty of SString has been that you did not have to think about the most efficient encoding to use when passing strings around. The encoding conversions were done lazily only once needed. We are losing this convenience with this change. |
I view that "beauty" as a problem of SString. This is native code where character encoding matters, so we should care about encoding and make it explicit and easy to handle, not hide it and make it hard to track. We have all of these cases where SString would convert to UTF16 under the hood without telling anyone. As a result, it made it really easy accidentally convert encoding and introduce bugs, particularly if you wanted to get a UTF8 string out of SString. You needed to be very particular in which APIs you used on a particular SString instance to keep it in UTF8 so you could use This PR also makes the costs of transcoding clear. You convert when you need to convert and no earlier. This way if an SString was created with UTF8, we don't need to worry about a method call being added that converts the underlying value to UTF16 and causes a perf issue with every other AppendUTF8 call having to transcode the string. Now we can make the transcoding explicit, so we don't suddenly have this perf cliff introduced. Lifetime semantics are also difficult to remember with SString as the This PR also simplifies the logic of SString as we don't have to handle mismatched encodings in every single member function that takes an SString. |
moduleName.GetUnicode(), | ||
namespaceOrClassName.GetUnicode(), | ||
methodName.GetUnicode()); | ||
message.Printf(W("A callback was made on a garbage collected delegate of type '%S!%s::%S'."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that Printf uses current Ansi encoding for %S
that won't match UTF8 string passed in.
@@ -1644,7 +1645,7 @@ BOOL MethodDesc::SatisfiesMethodConstraints(TypeHandle thParent, BOOL fThrowIfNo | |||
SString sParentName; | |||
TypeString::AppendType(sParentName, thParent); | |||
|
|||
SString sMethodName(SString::Utf8, GetName()); | |||
MAKE_WIDEPTR_FROMUTF8(sMethodName, GetName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are intentionally using SString
on exception throwing paths to avoid unnecessary stack consumption. MAKE_WIDEPTR_FROMUTF8
allocates a big stack buffer, so we are losing this optimization with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be also places where the switch to MAKE_WIDEPTR_FROMUTF8
may introduce stack overflows. It is hard to find them in the large delta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll revert these changes to use SString and add comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth discussing the overall change before you spent more time on this. I would like to hear @davidwrighton and @AaronRobinsonMSFT thoughts about the split string types.
I agree that getting UTF8 out of SString has been unnecessarily complicated.
Yes, and requires developers to think hard about the best places to do the transcoding. I am not sure whether it is a win on average. I believe that it makes it harder to write efficient code. Do you think that there are places after this change that do extra transcoding that was not done before? If not, what kind of process you have used to make sure that there are no extra transcodings introduced? |
This PR represents a compromise about some concerns I had with making this split. I like the idea of the compiler helping me not make a mistake in string encoding. I also appreciate the explicit nature of the transcoding rather than it happening implicitly. These are, in general, good ideas that should be encouraged. My initial counter to this work was around the eventual long-term goal of having After seeing @jkotas's comments align with some of my original thoughts, coupled with offline conversations with other runtime developers, I would say removing "printf" style manipulation of @davidwrighton may have more thoughts about this direction. |
Unfortunately, I find myself in agreement with @AaronRobinsonMSFT here. I really like the idea of being a bit more explicit in our codebase about what is happening in the system, but this change doesn't appear to fundamentally actually fix anything that is broken, doesn't make the codebase less voluminous or easier to maintain, and introduces a bunch of risk that is hard to work with (the stack usage problem @jkotas calls out is extremely difficult to see.) I'd prefer to see somewhat simpler changes that can be evaluated independently and are more clearly improvements, and or changes in behavior that we can evaluate. For instance, something that allows us to create/parse type names faster. Maybe that would be an improvement if we worked in utf8, or moved the logic to managed code. I don't know. I'd love to see perf numbers. A move to a PathString type for path strings, as paths in Linux/Windows are not utf8 or utf16, legal paths on our operating systems actually allow for non-Unicode sequences, as a Linux path is a null terminated series of bytes, and on Windows a path is a null terminated sequence of 16 bit numbers. A fix to just remove code that isn't used, such as the SStringRegEx. A change that deletes our wprintf implementation. |
I'll take a different route than this PR then. Here's a few ideas I have, let me know what you think of them:
|
I would delete |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Introduce an
EString<TEncoding>
type that takes an traits type representing a string encoding. MakeSString
an alias of this type with the UTF-16 encoding and update all usages ofSString
.This PR also removes some dead code that I discovered in the process, such as SStringRegEx.
This PR does not try to unify the different encodings for the majority of scenarios. It only tries to document and enforce our existing encoding usages throughout the runtime.