Make SString typed on encoding #69748

jkoritzinsky · 2022-05-24T19:59:05Z

Introduce an EString<TEncoding> type that takes an traits type representing a string encoding. Make SString an alias of this type with the UTF-16 encoding and update all usages of SString.

This PR also removes some dead code that I discovered in the process, such as SStringRegEx.

This PR does not try to unify the different encodings for the majority of scenarios. It only tries to document and enforce our existing encoding usages throughout the runtime.

…t-encoding

`SString<TEncoding>` was renamed to `EString<TEncoding>` (means "encoded string") and `SString` is now an alias of `EString<EncodingUnicode>`.

…directly to avoid depending on SBuffer implementation details too much in EString.

…tError when the temporary is destroyed.

…t-encoding

…untime into sstring-explicit-encoding

… as empty

…set up invalid state to start.

…t-encoding

jkotas · 2022-06-15T00:44:56Z

What is the problem that this is trying to solve?

The beauty of SString has been that you did not have to think about the most efficient encoding to use when passing strings around. The encoding conversions were done lazily only once needed. We are losing this convenience with this change.

jkoritzinsky · 2022-06-15T01:04:31Z

What is the problem that this is trying to solve?

The beauty of SString has been that you did not have to think about the most efficient encoding to use when passing strings around. The encoding conversions were done lazily only once needed. We are losing this convenience with this change.

I view that "beauty" as a problem of SString. This is native code where character encoding matters, so we should care about encoding and make it explicit and easy to handle, not hide it and make it hard to track.

We have all of these cases where SString would convert to UTF16 under the hood without telling anyone. As a result, it made it really easy accidentally convert encoding and introduce bugs, particularly if you wanted to get a UTF8 string out of SString. You needed to be very particular in which APIs you used on a particular SString instance to keep it in UTF8 so you could use GetRawUtf8NoConvert. Otherwise, you needed to have a stack buffer and possibly transcode back to UTF8 from UTF16. This is particularly dangerous when you're editing code that is already using SString and you need to read all of the surrounding code and follow the logic on all of the branching paths to know exactly which encoding the SString is expected to be in when running code after your change or you might introduce a transcoding bug. Additionally, even if you're using the Get method variants that take a stack buffer, those methods might still convert the this instance to UTF16 from whatever the existing encoding is, which just adds to the difficulty to understand the state model of SString.

This PR also makes the costs of transcoding clear. You convert when you need to convert and no earlier. This way if an SString was created with UTF8, we don't need to worry about a method call being added that converts the underlying value to UTF16 and causes a perf issue with every other AppendUTF8 call having to transcode the string. Now we can make the transcoding explicit, so we don't suddenly have this perf cliff introduced.

Lifetime semantics are also difficult to remember with SString as the Get* methods that take a stack buffer have generally unintuitive lifetime rules around them (lifetime of the returned value is bound by this and the first parameter).

This PR also simplifies the logic of SString as we don't have to handle mismatched encodings in every single member function that takes an SString.

jkotas · 2022-06-15T01:25:08Z

src/coreclr/vm/dllimportcallback.cpp

-        moduleName.GetUnicode(),
-        namespaceOrClassName.GetUnicode(),
-        methodName.GetUnicode());
+    message.Printf(W("A callback was made on a garbage collected delegate of type '%S!%s::%S'."),


I believe that Printf uses current Ansi encoding for %S that won't match UTF8 string passed in.

jkotas · 2022-06-15T01:29:18Z

src/coreclr/vm/genmeth.cpp

@@ -1644,7 +1645,7 @@ BOOL MethodDesc::SatisfiesMethodConstraints(TypeHandle thParent, BOOL fThrowIfNo
                SString sParentName;
                TypeString::AppendType(sParentName, thParent);

-                SString sMethodName(SString::Utf8, GetName());
+                MAKE_WIDEPTR_FROMUTF8(sMethodName, GetName());


We are intentionally using SString on exception throwing paths to avoid unnecessary stack consumption. MAKE_WIDEPTR_FROMUTF8 allocates a big stack buffer, so we are losing this optimization with these changes.

There may be also places where the switch to MAKE_WIDEPTR_FROMUTF8 may introduce stack overflows. It is hard to find them in the large delta.

I’ll revert these changes to use SString and add comments.

It may be worth discussing the overall change before you spent more time on this. I would like to hear @davidwrighton and @AaronRobinsonMSFT thoughts about the split string types.

jkotas · 2022-06-15T01:38:59Z

As a result, it made it really easy accidentally convert encoding and introduce bugs, particularly if you wanted to get a UTF8 string out of SString.

I agree that getting UTF8 out of SString has been unnecessarily complicated. GetRawUtf8NoConvert was introduced as a quick hack by somebody, without following the SString overall design. I wish we just introduced method on SString that converts it to UTF8 if necessary and gives you the pointer back.

This PR also makes the costs of transcoding clear.

Yes, and requires developers to think hard about the best places to do the transcoding. I am not sure whether it is a win on average. I believe that it makes it harder to write efficient code.

Do you think that there are places after this change that do extra transcoding that was not done before? If not, what kind of process you have used to make sure that there are no extra transcodings introduced?

AaronRobinsonMSFT · 2022-06-15T22:37:56Z

It may be worth discussing the overall change before you spent more time on this. I would like to hear @davidwrighton and @AaronRobinsonMSFT thoughts about the split string types.

This PR represents a compromise about some concerns I had with making this split. I like the idea of the compiler helping me not make a mistake in string encoding. I also appreciate the explicit nature of the transcoding rather than it happening implicitly. These are, in general, good ideas that should be encouraged.

My initial counter to this work was around the eventual long-term goal of having SString be naturally UTF-8 the vast majority of the time. With that goal in mind, I see little reason to elevate the encoding into the type system since it should become less common. Additionally, there is the potential duplication of code paths with mulitple types – consider the case where a method can accept UTF-8 or UTF-16. This becomes cumbersome because we either require two functions, blech, or templatize the function which is also undesirable. The current UX of SString is annoying, but I am not convinced a new series of types with strict encoding semantics is needed as opposed to gradually paring down the SString API surface. Removing the internal versions of printf and wprintf will make transcoding rarer and, when needed, only occur at the margins of user interaction.

After seeing @jkotas's comments align with some of my original thoughts, coupled with offline conversations with other runtime developers, I would say removing "printf" style manipulation of WCHAR strings and when needed making explicit "tostring" functions would have more longer-term value and we can avoid this SString redesign.

@davidwrighton may have more thoughts about this direction.

davidwrighton · 2022-06-17T20:21:25Z

Unfortunately, I find myself in agreement with @AaronRobinsonMSFT here. I really like the idea of being a bit more explicit in our codebase about what is happening in the system, but this change doesn't appear to fundamentally actually fix anything that is broken, doesn't make the codebase less voluminous or easier to maintain, and introduces a bunch of risk that is hard to work with (the stack usage problem @jkotas calls out is extremely difficult to see.)

I'd prefer to see somewhat simpler changes that can be evaluated independently and are more clearly improvements, and or changes in behavior that we can evaluate.

For instance, something that allows us to create/parse type names faster. Maybe that would be an improvement if we worked in utf8, or moved the logic to managed code. I don't know. I'd love to see perf numbers.

A move to a PathString type for path strings, as paths in Linux/Windows are not utf8 or utf16, legal paths on our operating systems actually allow for non-Unicode sequences, as a Linux path is a null terminated series of bytes, and on Windows a path is a null terminated sequence of 16 bit numbers.

A fix to just remove code that isn't used, such as the SStringRegEx.

A change that deletes our wprintf implementation.

jkoritzinsky · 2022-06-20T19:20:43Z

I'll take a different route than this PR then. Here's a few ideas I have, let me know what you think of them:

Add a contract check for the representation in GetUTF8NoConvert and any other methods that currently assume a representation and don't check it.
Add a ConvertToUTF8 method to SString so we can force the current SString representation to UTF8 instead of needing to use the stack buffer.
Remove the ANSI representation from SString to simplify the codebase (it's used a very small number of times) and explore moving both the ANSI and ASCII representations to use UTF8 so SString will have 2 representations, not 4.
Providing a debug-only mechanism to assert the representation of an SString so we can ensure that we aren't transcoding more than expected.

jkotas · 2022-06-20T20:30:21Z

Add a contract check for the representation in GetUTF8NoConvert and any other methods that currently assume a representation and don't check it.

I would delete GetUTF8NoConvert method and replace it with GetUTF8 method that works exactly like GetUnicode (ie converts if necessary).

ghost · 2022-07-22T23:02:38Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

jkoritzinsky added 7 commits May 23, 2022 15:13

Make SString generic on an encoding traits type.

d37d5e4

Merge branch 'main' of github.com:dotnet/runtime into sstring-explici…

42fa6ab

…t-encoding

Fix linux build

f913366

Rename a few things to reduce the overall number of changes.

32ddf53

`SString<TEncoding>` was renamed to `EString<TEncoding>` (means "encoded string") and `SString` is now an alias of `EString<EncodingUnicode>`.

Fix linux build.

b2588e6

A few more changes to reduce diffs.

bc07888

More diff reduction

8ebc5df

jkoritzinsky added the area-VM-coreclr label May 24, 2022

ghost assigned jkoritzinsky May 24, 2022

Various fixes for SString. Reimplement the move semantics on SBuffer …

0a5056c

…directly to avoid depending on SBuffer implementation details too much in EString.

jkoritzinsky force-pushed the sstring-explicit-encoding branch from 5307ae6 to 0a5056c Compare May 25, 2022 23:14

jkoritzinsky and others added 17 commits May 26, 2022 10:17

Fix base type member usage with templates

26f4944

Store the result of GetPathToLoad in a local so we don't trash GetLas…

824aa67

…tError when the temporary is destroyed.

Fix reuse bug in assembly loading.

b146aae

Merge branch 'main' of github.com:dotnet/runtime into sstring-explici…

e95cbef

…t-encoding

Merge branch 'sstring-explicit-encoding' of github.com:jkoritzinsky/r…

1a4d621

…untime into sstring-explicit-encoding

Remove outdated comment

7b87012

Fix unix build

a7dcfad

Fix immutable empty shortcut and additional cleanup.

c445d35

Add static_assert message

84876d8

Fix cast issue on 32-bit

2f01a7a

Fix default constructor to correctly mark UTF8/ASCII default EStrings…

5aa2896

… as empty

Various fixes

b4d2fbd

Fix canary and buffer sharing handling. Fix another constructor that …

bcfd0b7

…set up invalid state to start.

Merge branch 'main' into sstring-explicit-encoding

5a22849

Fix build break

7523a1f

Fix super-large array rank error case handling.

a6fc06d

Fix some typos from bad find-replace in nativeaot code

5f53cc7

jkoritzinsky requested a review from AaronRobinsonMSFT June 9, 2022 23:12

jkoritzinsky marked this pull request as ready for review June 9, 2022 23:12

jkoritzinsky requested a review from MichalStrehovsky as a code owner June 9, 2022 23:12

Merge branch 'main' of github.com:dotnet/runtime into sstring-explici…

93e98cf

…t-encoding

jkoritzinsky requested review from davidwrighton and jkotas June 15, 2022 00:11

jkotas reviewed Jun 15, 2022

View reviewed changes

This was referenced Jun 21, 2022

Change SString::GetUTF8NoConvert to GetUTF8 that converts the SString #71101

Merged

Remove SString::UIndex and SString::UIterator as they are never used. #71103

Merged

AaronRobinsonMSFT marked this pull request as draft June 22, 2022 17:45

jkoritzinsky mentioned this pull request Jun 23, 2022

Remove SString ANSI representation. #71186

Merged

ghost closed this Jul 22, 2022

jkoritzinsky deleted the sstring-explicit-encoding branch July 22, 2022 23:07

ghost locked as resolved and limited conversation to collaborators Aug 22, 2022

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make SString typed on encoding #69748

Make SString typed on encoding #69748

jkoritzinsky commented May 24, 2022

jkotas commented Jun 15, 2022

jkoritzinsky commented Jun 15, 2022

jkotas Jun 15, 2022 •

edited

Loading

jkotas Jun 15, 2022 •

edited

Loading

jkotas Jun 15, 2022

jkoritzinsky Jun 15, 2022

jkotas Jun 15, 2022

jkotas commented Jun 15, 2022

AaronRobinsonMSFT commented Jun 15, 2022

davidwrighton commented Jun 17, 2022

jkoritzinsky commented Jun 20, 2022

jkotas commented Jun 20, 2022 •

edited

Loading

ghost commented Jul 22, 2022

Make SString typed on encoding #69748

Make SString typed on encoding #69748

Conversation

jkoritzinsky commented May 24, 2022

jkotas commented Jun 15, 2022

jkoritzinsky commented Jun 15, 2022

jkotas Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

jkotas Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

jkotas Jun 15, 2022

Choose a reason for hiding this comment

jkoritzinsky Jun 15, 2022

Choose a reason for hiding this comment

jkotas Jun 15, 2022

Choose a reason for hiding this comment

jkotas commented Jun 15, 2022

AaronRobinsonMSFT commented Jun 15, 2022

davidwrighton commented Jun 17, 2022

jkoritzinsky commented Jun 20, 2022

jkotas commented Jun 20, 2022 • edited Loading

ghost commented Jul 22, 2022

jkotas Jun 15, 2022 •

edited

Loading

jkotas Jun 15, 2022 •

edited

Loading

jkotas commented Jun 20, 2022 •

edited

Loading