Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SString typed on encoding #69748

Closed

Conversation

jkoritzinsky
Copy link
Member

Introduce an EString<TEncoding> type that takes an traits type representing a string encoding. Make SString an alias of this type with the UTF-16 encoding and update all usages of SString.

This PR also removes some dead code that I discovered in the process, such as SStringRegEx.

This PR does not try to unify the different encodings for the majority of scenarios. It only tries to document and enforce our existing encoding usages throughout the runtime.

…directly to avoid depending on SBuffer implementation details too much in EString.
@jkoritzinsky jkoritzinsky force-pushed the sstring-explicit-encoding branch from 5307ae6 to 0a5056c Compare May 25, 2022 23:14
@jkoritzinsky jkoritzinsky marked this pull request as ready for review June 9, 2022 23:12
@jkotas
Copy link
Member

jkotas commented Jun 15, 2022

What is the problem that this is trying to solve?

The beauty of SString has been that you did not have to think about the most efficient encoding to use when passing strings around. The encoding conversions were done lazily only once needed. We are losing this convenience with this change.

@jkoritzinsky
Copy link
Member Author

What is the problem that this is trying to solve?

The beauty of SString has been that you did not have to think about the most efficient encoding to use when passing strings around. The encoding conversions were done lazily only once needed. We are losing this convenience with this change.

I view that "beauty" as a problem of SString. This is native code where character encoding matters, so we should care about encoding and make it explicit and easy to handle, not hide it and make it hard to track.

We have all of these cases where SString would convert to UTF16 under the hood without telling anyone. As a result, it made it really easy accidentally convert encoding and introduce bugs, particularly if you wanted to get a UTF8 string out of SString. You needed to be very particular in which APIs you used on a particular SString instance to keep it in UTF8 so you could use GetRawUtf8NoConvert. Otherwise, you needed to have a stack buffer and possibly transcode back to UTF8 from UTF16. This is particularly dangerous when you're editing code that is already using SString and you need to read all of the surrounding code and follow the logic on all of the branching paths to know exactly which encoding the SString is expected to be in when running code after your change or you might introduce a transcoding bug. Additionally, even if you're using the Get method variants that take a stack buffer, those methods might still convert the this instance to UTF16 from whatever the existing encoding is, which just adds to the difficulty to understand the state model of SString.

This PR also makes the costs of transcoding clear. You convert when you need to convert and no earlier. This way if an SString was created with UTF8, we don't need to worry about a method call being added that converts the underlying value to UTF16 and causes a perf issue with every other AppendUTF8 call having to transcode the string. Now we can make the transcoding explicit, so we don't suddenly have this perf cliff introduced.

Lifetime semantics are also difficult to remember with SString as the Get* methods that take a stack buffer have generally unintuitive lifetime rules around them (lifetime of the returned value is bound by this and the first parameter).

This PR also simplifies the logic of SString as we don't have to handle mismatched encodings in every single member function that takes an SString.

moduleName.GetUnicode(),
namespaceOrClassName.GetUnicode(),
methodName.GetUnicode());
message.Printf(W("A callback was made on a garbage collected delegate of type '%S!%s::%S'."),
Copy link
Member

@jkotas jkotas Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that Printf uses current Ansi encoding for %S that won't match UTF8 string passed in.

@@ -1644,7 +1645,7 @@ BOOL MethodDesc::SatisfiesMethodConstraints(TypeHandle thParent, BOOL fThrowIfNo
SString sParentName;
TypeString::AppendType(sParentName, thParent);

SString sMethodName(SString::Utf8, GetName());
MAKE_WIDEPTR_FROMUTF8(sMethodName, GetName());
Copy link
Member

@jkotas jkotas Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are intentionally using SString on exception throwing paths to avoid unnecessary stack consumption. MAKE_WIDEPTR_FROMUTF8 allocates a big stack buffer, so we are losing this optimization with these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be also places where the switch to MAKE_WIDEPTR_FROMUTF8 may introduce stack overflows. It is hard to find them in the large delta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll revert these changes to use SString and add comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth discussing the overall change before you spent more time on this. I would like to hear @davidwrighton and @AaronRobinsonMSFT thoughts about the split string types.

@jkotas
Copy link
Member

jkotas commented Jun 15, 2022

As a result, it made it really easy accidentally convert encoding and introduce bugs, particularly if you wanted to get a UTF8 string out of SString.

I agree that getting UTF8 out of SString has been unnecessarily complicated. GetRawUtf8NoConvert was introduced as a quick hack by somebody, without following the SString overall design. I wish we just introduced method on SString that converts it to UTF8 if necessary and gives you the pointer back.

This PR also makes the costs of transcoding clear.

Yes, and requires developers to think hard about the best places to do the transcoding. I am not sure whether it is a win on average. I believe that it makes it harder to write efficient code.

Do you think that there are places after this change that do extra transcoding that was not done before? If not, what kind of process you have used to make sure that there are no extra transcodings introduced?

@AaronRobinsonMSFT
Copy link
Member

It may be worth discussing the overall change before you spent more time on this. I would like to hear @davidwrighton and @AaronRobinsonMSFT thoughts about the split string types.

This PR represents a compromise about some concerns I had with making this split. I like the idea of the compiler helping me not make a mistake in string encoding. I also appreciate the explicit nature of the transcoding rather than it happening implicitly. These are, in general, good ideas that should be encouraged.

My initial counter to this work was around the eventual long-term goal of having SString be naturally UTF-8 the vast majority of the time. With that goal in mind, I see little reason to elevate the encoding into the type system since it should become less common. Additionally, there is the potential duplication of code paths with mulitple types – consider the case where a method can accept UTF-8 or UTF-16. This becomes cumbersome because we either require two functions, blech, or templatize the function which is also undesirable. The current UX of SString is annoying, but I am not convinced a new series of types with strict encoding semantics is needed as opposed to gradually paring down the SString API surface. Removing the internal versions of printf and wprintf will make transcoding rarer and, when needed, only occur at the margins of user interaction.

After seeing @jkotas's comments align with some of my original thoughts, coupled with offline conversations with other runtime developers, I would say removing "printf" style manipulation of WCHAR strings and when needed making explicit "tostring" functions would have more longer-term value and we can avoid this SString redesign.

@davidwrighton may have more thoughts about this direction.

@davidwrighton
Copy link
Member

Unfortunately, I find myself in agreement with @AaronRobinsonMSFT here. I really like the idea of being a bit more explicit in our codebase about what is happening in the system, but this change doesn't appear to fundamentally actually fix anything that is broken, doesn't make the codebase less voluminous or easier to maintain, and introduces a bunch of risk that is hard to work with (the stack usage problem @jkotas calls out is extremely difficult to see.)

I'd prefer to see somewhat simpler changes that can be evaluated independently and are more clearly improvements, and or changes in behavior that we can evaluate.

For instance, something that allows us to create/parse type names faster. Maybe that would be an improvement if we worked in utf8, or moved the logic to managed code. I don't know. I'd love to see perf numbers.

A move to a PathString type for path strings, as paths in Linux/Windows are not utf8 or utf16, legal paths on our operating systems actually allow for non-Unicode sequences, as a Linux path is a null terminated series of bytes, and on Windows a path is a null terminated sequence of 16 bit numbers.

A fix to just remove code that isn't used, such as the SStringRegEx.

A change that deletes our wprintf implementation.

@jkoritzinsky
Copy link
Member Author

I'll take a different route than this PR then. Here's a few ideas I have, let me know what you think of them:

  • Add a contract check for the representation in GetUTF8NoConvert and any other methods that currently assume a representation and don't check it.
  • Add a ConvertToUTF8 method to SString so we can force the current SString representation to UTF8 instead of needing to use the stack buffer.
  • Remove the ANSI representation from SString to simplify the codebase (it's used a very small number of times) and explore moving both the ANSI and ASCII representations to use UTF8 so SString will have 2 representations, not 4.
  • Providing a debug-only mechanism to assert the representation of an SString so we can ensure that we aren't transcoding more than expected.

@jkotas
Copy link
Member

jkotas commented Jun 20, 2022

Add a contract check for the representation in GetUTF8NoConvert and any other methods that currently assume a representation and don't check it.

I would delete GetUTF8NoConvert method and replace it with GetUTF8 method that works exactly like GetUnicode (ie converts if necessary).

@ghost
Copy link

ghost commented Jul 22, 2022

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@jkoritzinsky jkoritzinsky deleted the sstring-explicit-encoding branch July 22, 2022 23:07
@ghost ghost locked as resolved and limited conversation to collaborators Aug 22, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants