Skip to content

Data Corruption With Ref Locals, Punning, and Pinned Object Heap #76929

@kevin-montrose

Description

@kevin-montrose

Description

This was a bear to diagnose, and I'm still not 100% on what exactly is happening but the scenario is:

  • Have a bunch of byte[] s allocated on the POH
  • Those byte[]s are referenced by a ConcurrentDictionary
  • Have a bunch of threads getting those byte[]s, and punning it via MemoryMarshal.Cast
    • An earlier version used a ref byte and some unsafe code, but I've removed the unsafe code to eliminate it as a possible cause
  • Have some other threads removing byte[]s from the ConcurrentDictionary
  • After some time, data corruption occurs

I first discovered this as random looking pointers getting written into those byte[] arrays, but in the process of winnowing down to a smaller reproduction null reference exceptions, seg faults, and other "you've corrupted the process"-style errors became more likely. I interpret this as the same corruption happening, but because my punned arrays are smaller the corruption is more likely to hit something else.

I first noticed this in .NET 7 RC (7.0.0-rc.1.22427.1 specifically) but it has also been reproduced in .NET 6.

Reproduction Steps

I have a gist I used to winnow down the repro some.

Latest is copied here:

// drop this into a test project

[StructLayout(LayoutKind.Explicit, Size = Size)]
private struct Punned
{
    internal const int Size = 8;

    [FieldOffset(0)]
    public ulong A;
}

/// <summary>
/// This spawns a bunch of threads, half of which do integrity checks on a punned byte[]
/// and half of which randomly replace a referenced byte[].
/// 
/// Sometimes things just break: either field corruption, null ref, or access violation.
/// 
/// Reproduces in DEBUG builds and RELEASE builds.
/// 
/// Tends to take < 10 iterations, but not more than 100.  You know, on my machine.
/// 
/// Only reproduces if you use the POH, SOH and LOH are fine.
/// </summary>
[Fact]
public void Repro()
{
    // DOES repro with ALLOC_SIZE >= Punned.Size
    //        and with USE_POH == true
    //
    // does not repro if USE_POH == false

    // tweak these to mess with alignment and heap
    const int ALLOC_SIZE = Punned.Size;
    const bool USE_POH = true;

    Assert.True(Punned.Size == Unsafe.SizeOf<Punned>(), "Hey, this isn't right");
    Assert.True(ALLOC_SIZE >= Punned.Size, "Hey, this isn't right");

    const int MAX_KEY = 1_000_000;

    var iter = 0;
    while (true)
    {
        Debug.WriteLine($"Iteration: {iter}");
        iter++;

        var dict = new ConcurrentDictionary<int, byte[]>();

        // allocate
        for (var i = 0; i < MAX_KEY; i++)
        {
            dict[i] = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
        }

        // start all the threads
        using var startThreads = new SemaphoreSlim(0, Environment.ProcessorCount);

        var modifyThreads = new Thread[Environment.ProcessorCount / 2];
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i] = ModifyingThread(i, startThreads, dict);
        }

        var checkThreads = new Thread[Environment.ProcessorCount - modifyThreads.Length];
        using var stopCheckThreads = new SemaphoreSlim(0, checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i] = IntegrityThread(i, MAX_KEY / checkThreads.Length, startThreads, stopCheckThreads, dict);
        }

        // let 'em go
        startThreads.Release(modifyThreads.Length + checkThreads.Length);

        // wait for modifying threads to finish...
        for (var i = 0; i < modifyThreads.Length; i++)
        {
            modifyThreads[i].Join();
        }

        // stop check threads..
        stopCheckThreads.Release(checkThreads.Length);
        for (var i = 0; i < checkThreads.Length; i++)
        {
            checkThreads[i].Join();
        }
    }

    static Thread IntegrityThread(
        int threadIx,
        int step,
        SemaphoreSlim startThreads,
        SemaphoreSlim stopThreads,
        ConcurrentDictionary<int, byte[]> dict
    )
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t =
            new Thread(
                () =>
                {
                    threadStarted.Release();

                    startThreads.Wait();

                    while (!stopThreads.Wait(0))
                    {
                        for (var i = 0; i < MAX_KEY; i++)
                        {
                            var keyIx = (threadIx * step + i) % MAX_KEY;

                            ref Punned punned = ref Pun(dict[keyIx]);

                            Check(ref punned);
                        }
                    }
                }
             );
        t.Name = $"{nameof(Repro)} Integrity #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static Thread ModifyingThread(int threadIx, SemaphoreSlim startThreads, ConcurrentDictionary<int, byte[]> dict)
    {
        using var threadStarted = new SemaphoreSlim(0, 1);

        var t = new
            Thread(
                () =>
                {
                    threadStarted.Release();

                    var rand = new Random(threadIx);

                    startThreads.Wait();

                    for (var i = 0; i < 1_000_000; i++)
                    {
                        var keyIx = rand.Next(MAX_KEY);

                        var newArr = GC.AllocateArray<byte>(Punned.Size, pinned: USE_POH);
                        Assert.True(newArr.All(x => x == 0));

                        // make sure it comes up reasonable
                        ref Punned punned = ref Pun(newArr);
                        Assert.Equal(0UL, punned.A);

                        // this swaps out the only reference to a byte[]
                        // EXCEPT for any of the checking threads, which only
                        // grab it through a ref
                        dict.AddOrUpdate(keyIx, static (_, passed) => passed, static (_, _, passed) => passed, newArr);
                    }
                }
            );
        t.Name = $"{nameof(Repro)} Modify #{threadIx}";
        t.Start();

        threadStarted.Wait();

        return t;
    }

    static ref Punned Pun(byte[] data)
    {
        var span = data.AsSpan();

        var punned = MemoryMarshal.Cast<byte, Punned>(span);

        return ref punned[0];
    }

    static void Check(ref Punned val)
    {
        // all possible bit patterns are well known
        var a = val.A;
        Assert.True(a == 0);
    }
}

This will fail either in Check, with a null ref in an impossible place (usually AddOrUpdate), or with some variant of "runtime has become corrupt". The NRE is most common with the above, but earlier revisions usually failed in Check.

In my testing this only happens if the POH is used (toggle USE_POH to verify), and at all (legal) sizes for the byte[]s (change ALLOC_SIZE to verify).

Expected behavior

I would expect the attached code to run fine forever.

Actual behavior

Crashes with some sort of data corruption.

Regression?

No, this reproduces (at least in part) on .NET 6.

Known Workarounds

Don't use the POH I guess?

Configuration

This was first noticed on:

  • Microsoft Windows 11 Home Insider Preview: 10.0.25151 N/A Build 25151
  • AMD64 Family 23 Model 96 Stepping 1 AuthenticAMD ~2000 Mhz: AMD Ryzen™ 7 4980U
  • .NET 7: Microsoft.WindowsDesktop.App 7.0.0-rc.1.22427.1

It is also reproducing, at least in part, on .NET 6.

It has been reproduced on a colleagues machine as well, but I don't have the specifics beyond also x64, Windows, and .NET 7 & 6.

Other information

When I've found a corrupted byte[] (instead of a NRE or other crash), it looks very pointer-y but seems to point to memory outside of any heap.

This makes me think some sort of GC bug, perhaps as part of growing or shrinking the POH, but that is ~98% guesswork.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions