Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashSet<string> breaks if copied from one that uses a non-singleton default equality comparer #107222

Closed
wnayes opened this issue Aug 31, 2024 · 8 comments · Fixed by #107613
Closed
Assignees
Labels
area-System.Collections bug help wanted [up-for-grabs] Good issue for external contributors in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@wnayes
Copy link

wnayes commented Aug 31, 2024

Description

There seems to be a hidden assumption within HashSet<string> that, if broken, leads to a set containing strings that fail to be looked up. The assumption is that the instance returned by EqualityComparer<string>.Default is the only instance of its type.

This is a simple reproduction of the issue:

[TestMethod]
public void HashSet_MinimalIssueReproduction()
{
    // Create another instance of GenericEqualityComparer<string> using reflection.
    Type comparerType = EqualityComparer<string>.Default.GetType();
    ConstructorInfo ctorInfo = comparerType.GetConstructor(Array.Empty<Type>());
    EqualityComparer<string> comparer = (EqualityComparer<string>)ctorInfo.Invoke(Array.Empty<object>());

    // Create a set using that comparer, and then run it through a copy constructor.
    HashSet<string> set = new HashSet<string>(comparer) { "a", "b", "c", "d" };
    HashSet<string> setClone = new HashSet<string>(set);

    // The copied set is broken!
    Assert.IsTrue(setClone.Contains("d"), "Fails!");
}

Of course, the question would then be - why would there ever be a second instance of the type backed by EqualityComparer<string>.Default?

The NonRandomizedStringEqualityComparer in its ISerializable implementation indicates via GetObjectData that it should serialize as a GenericEqualityComparer<string>:

void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context)
{
    // We are doing this to stay compatible with .NET Framework.
    // Our own collection types will never call this (since this type is a wrapper),
    // but perhaps third-party collection types could try serializing an instance
    // of this.
    info.SetType(typeof(GenericEqualityComparer<string>));
}

So an implementation of a serializer that respects ISerializable would probably create a new instance of GenericEqualityComparer<string> when serializing NonRandomizedStringEqualityComparer (the default comparer that HashSet<string> seems to use, at least in my testing).

In other words, a realistic impact of this bug is: if you round trip a HashSet<string> through an ISerializable serializer, and then copy it via copy constructor, the copy will be broken. Here is a demonstration of this:

[TestMethod]
public void HashSet_SerializationIssueExample()
{
    HashSet<string> set = new HashSet<string>() { "a", "b", "c", "d" };

    // Mimic ISerializable serialization
    SerializationInfo info = new SerializationInfo(typeof(HashSet<string>), new FormatterConverter());
    StreamingContext context = new StreamingContext(StreamingContextStates.All);
    set.GetObjectData(info, context);
    ConstructorInfo constructor = typeof(HashSet<string>).GetConstructor(BindingFlags.NonPublic | BindingFlags.Instance, null, new[] { typeof(SerializationInfo), typeof(StreamingContext) }, null);
    HashSet<string> setRoundTripped = (HashSet<string>)constructor.Invoke(new object[] { info, context });
    setRoundTripped.OnDeserialization(null);

    Assert.IsTrue(setRoundTripped.Contains("d"), "This works.");
    HashSet<string> setClone = new HashSet<string>(setRoundTripped);
    Assert.IsTrue(setClone.Contains("d"), "Fails!");
}

Reproduction Steps

Run either of the above test cases in .NET 8.

Expected behavior

The HashSet<string> should always report that it contains strings that are within it. The tests should pass on .NET Framework and .NET 8.

Actual behavior

The tests pass on .NET Framework, and fail on .NET 8. (I did not try .NET 9, but I have been looking at source.dot.net when investigating this, and don't see any indication it would be fixed.)

Regression?

This was a regression noticed when migrating to .NET 8 from .NET Framework.

Known Workarounds

A serializer can special case these three singletons:

  • EqualityComparer<string>.Default
  • StringComparer.Ordinal
  • StringComparer.OrdinalIgnoreCase

If a serializer sees any of these, and upon deserializing ensures it returns the original singleton, assumptions can hold.

Configuration

.NET 8 / Windows 11, unlikely to be specific.

Other information

I'm not exactly sure how you would fix this, or if it would be realistic to do so at this point.

My understanding of the problem is that HashSet<string> ends up copying its backing bucket data incorrectly when the copy is made. It goes through this optimized code path when it shouldn't:

if (collection is HashSet<T> otherAsHashSet && EqualityComparersAreEqual(this, otherAsHashSet))
{
    ConstructFrom(otherAsHashSet);
}

In the problem circumstance, this is the new copy HashSet<string> that is being created, and it is using a NonRandomizedStringEqualityComparer instance that it gets from GetStringComparer. The NonRandomizedStringEqualityComparer basically "wraps" the EqualityComparer<string>.Default, which is a GenericEqualityComparer<string>.

The otherAsHashSet is in the unexpected state where its comparer is a non-singleton GenericEqualityComparer<string>, which isn't wrapped. The comparer isn't wrapped because GetStringComparer only wraps 3 specific singletons, when the given comparer is reference equal to one of them. (This is the assumption I refer to at the start of the issue.)

The problem arises because EqualityComparersAreEqual is implemented as set1.Comparer.Equals(set2.Comparer) and the Comparer property "unwraps" a wrapped comparer. this.Comparer will return an unwrapped GenericEqualityComparer<string>, and otherAsHashSet.Comparer returns a GenericEqualityComparer<string> that was never wrapped. I think these two non-reference equal instances of GenericEqualityComparer<string> are considered equal by Equals, leading to the bucket data copy in ConstructFrom. But it's not safe to copy bucket data, since the two sets are actually using comparers that use different hash codes.

For the bucket data copy scenario, I think it would be more correct to check set1._comparer.Equals(set2._comparer) so the unwrapping doesn't hide the fact that the comparers are actually different.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 31, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@wnayes wnayes changed the title HashSet<string> breaks if copied from one using non-singleton default equality comparer HashSet<string> breaks if copied from one that uses a non-singleton default equality comparer Sep 1, 2024
@KalleOlaviNiemitalo
Copy link

Can the bug be reproduced without serialization?

  1. Create one HashSet<string> with StringComparer.Ordinal, which it recognizes and replaces with NonRandomizedStringEqualityComparer.WrappedAroundStringComparerOrdinal.
  2. Add hash collisions until it rehashes with forceNewHashCodes: true.
  3. Create a second HashSet<string> with the HashSet(IEnumerable<T>, IEqualityComparer<T>?) constructor, passing the first HashSet<string> and again StringComparer.Ordinal.

In this scenario, I think EqualityComparersAreEqual would return true, and the constructor would call ConstructFrom(otherAsHashSet), which would clone the buckets even though the second HashSet<string> is using NonRandomizedStringEqualityComparer and the first one is not.

@wnayes
Copy link
Author

wnayes commented Sep 3, 2024

Can the bug be reproduced without serialization?

It seems like this is the case. I tried a few brute attempts at hitting the forceNewHashCodes code path, but I wasn't able to get it to hit the code path and change comparers internally.

If I trigger the code path by reflection, it seems like it does cause the issue.

#if !NETFRAMEWORK
[TestMethod]
public void HashSet_BreaksAfterRandomizeHashes()
{
    HashSet<string> set = new HashSet<string>() { "a", "b" };

    // Get the Resize(int, bool) method via reflection:
    MethodInfo method = typeof(HashSet<string>).GetMethod("Resize", BindingFlags.NonPublic | BindingFlags.Instance, new Type[] { typeof(int), typeof(bool) });
    method.Invoke(set, new object[] { 4, true });

    Assert.IsTrue(set.Contains("a"));

    HashSet<string> setClone = new HashSet<string>(set);

    Assert.IsTrue(set.Contains("a"));
    Assert.IsTrue(setClone.Contains("a"), "Fails!");
}
#endif

The copy constructor notably has to not only head through ConstructFrom but also satisfy the threshold >= capacity check, so I had to be a little careful above with how I resize it.

@KalleOlaviNiemitalo
Copy link

Copying a Dictionary<TKey, TValue> uses AddRange, which compares if (source._comparer == _comparer), so it doesn't seem to have the same bug as HashSet<string>.

@eiriktsarpalis
Copy link
Member

If I'm understanding the issue correctly, the problem lies with the EqualityComparersAreEqual method which determines equivalence based on the publicly advertised Comparer property instead of the actual _comparer being currently used? Seems like something we should try fixing.

@eiriktsarpalis eiriktsarpalis added bug and removed untriaged New issue has not been triaged by the area owner labels Sep 9, 2024
@eiriktsarpalis eiriktsarpalis added this to the 10.0.0 milestone Sep 9, 2024
@eiriktsarpalis eiriktsarpalis added the help wanted [up-for-grabs] Good issue for external contributors label Sep 9, 2024
@eiriktsarpalis
Copy link
Member

I've marked this as up-for-grabs as it seems like a straightforward change that should be accompanied with relevant testing. Perhaps it might be worth auditing other potential misuses of the Comparer property inside the implementation.

@jeffhandley
Copy link
Member

@wnayes Thank you for the exceptionally detailed issue and the investigation you put into this! We will backport this fix into .NET 9 RC2, which releases in October. .NET 9's GA release is in November. Will you be able to target .NET 9 for your scenario once that release is available?

@wnayes
Copy link
Author

wnayes commented Sep 11, 2024

Glad it is being fixed! I was able to work around the particular instance of the issue I had encountered by updating an in-house serializer, but it will be good to not have to worry about the issue occurring in other ways. The product I work on targets LTS releases, so I would be looking forward to it on .NET 10.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Collections bug help wanted [up-for-grabs] Good issue for external contributors in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
4 participants