Skip to content

Conversation

@AntonLapounov
Copy link
Member

@AntonLapounov AntonLapounov commented Aug 30, 2022

Under some conditions we may either create or try to create too many GC heaps, which may lead to buffer overruns in GC code. For example, a user may override the number of processors available to the process (e.g., by setting DOTNET_PROCESSOR_COUNT=100 and DOTNET_GCNoAffinitize=1). Or an app may be running under Windows 11 on a machine with asymmetric processor groups, where GetSystemInfo and GetProcessAffinityMask calls may return inconsistent results1. In those cases, g_num_active_processors (the number of heaps we create if not additionally overridden) may become greater than g_num_processors (the number of slots we allocate for handle tables). That leads to buffer overruns when a thread with the home heap number that is greater or equal to g_num_processors tries to create a GC handle.

Additionally, the heap_select::init function in gc.cpp may overrun the proc_no_to_numa_node array at this line:

                proc_no_to_numa_node[proc_no[i]] = cur_node_no;

in case we exited the previous loop early and left the proc_no[i] value uninitialized.

The fix is to ensure that g_num_active_processors is never greater than g_num_processors and to iterate only through the initialized part of the proc_no array.

Footnotes

  1. We may want to address that separately.

@AntonLapounov AntonLapounov added this to the 7.0.0 milestone Aug 30, 2022
@AntonLapounov AntonLapounov requested a review from Maoni0 August 30, 2022 23:17
@ghost ghost assigned AntonLapounov Aug 30, 2022
@ghost
Copy link

ghost commented Aug 30, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Under some conditions we may either create or try to create too many GC heaps, which may lead to buffer overruns in GC code. For example, a user may override the number of processors available to the process (e.g., by setting DOTNET_PROCESSOR_COUNT=100 and DOTNET_GCNoAffinitize=1). Or an app may be running under Windows 11 on a machine with asymmetric processor groups, where GetSystemInfo and GetProcessAffinityMask calls may return inconsistent results1. In those cases, g_num_active_processors (the number of heaps we create if not additionally overridden) may become greater than g_num_processors (the number of slots we allocate for handle tables). That leads to buffer overruns when a thread with the home heap number that is greater or equal to g_num_processors tries to create a GC handle. Additionally, the heap_select::init function in gc.cpp may overrun the proc_no_to_numa_node array at this line:

                proc_no_to_numa_node[proc_no[i]] = cur_node_no;

in case we exited the previous loop early and left the proc_no[i] value uninitialized.

The fix is to ensure that g_num_active_processors is never greater than g_num_processors and to iterate only through the initialized part of the proc_no array.

Author: AntonLapounov
Assignees: -
Labels:

area-GC-coreclr

Milestone: 7.0.0

Footnotes

  1. We may want to address that separately.

uint16_t procIndex = 0;
size_t cnt = heap_number;
for (uint16_t i = 0; i < GCToOSInterface::GetTotalProcessorCount(); i++)
for (uint16_t i = 0; i < MAX_SUPPORTED_CPUS; i++)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the implementation in gcenv.os.cpp; see also #206 for the context.

@AntonLapounov AntonLapounov requested a review from cshung August 30, 2022 23:29
@AntonLapounov
Copy link
Member Author

AntonLapounov commented Aug 31, 2022

The failure in the Linux_musl arm leg seems unrelated; it has been reported ([1], [2]) in the past.

Assert failure(PID 22 [0x00000016], Thread: 59 [0x003b]): RawGetMethodTable()
    File: /__w/1/s/src/coreclr/gc/gc.cpp Line: 4539
    Image: /root/helix/work/correlation/dotnet

@jkotas
Copy link
Member

jkotas commented Aug 31, 2022

The failure in the Linux_musl arm leg seems unrelated; it has been reported (#69927 (comment), #69204 (comment)) in the past.

This assert is a generic sign of GC heap corruption that can have many different root causes. All past instances of this crash are believed to be fixed. We do not have an active issue on it.

If you believe that this is unrelated, you should get a new active issue created on it.

@AntonLapounov
Copy link
Member Author

AntonLapounov commented Aug 31, 2022

I have opened #74895, but I cannot get the call stack from the dump. LLDB on Ubuntu ARM64 displays error: Don't know how to parse core file. Unsupported OS. and WinDbg displays just these two frames:

00  ld_musl_armhf_so!sigsetjmp+0x35
01  ld_musl_armhf_so!raise+0x2e

@janvorli
Copy link
Member

janvorli commented Sep 1, 2022

You cannot open a dump taken on a MUSL based distro on a Glibc based ones. And in fact, to get meaningful stack trace, you need to use exactly the same distro and version of the distro (it also has to have the same version of glibc) where the core was taken.

@janvorli
Copy link
Member

janvorli commented Sep 1, 2022

Docker is the way I use in these cases.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@AntonLapounov AntonLapounov merged commit 2184b18 into dotnet:main Sep 2, 2022
@AntonLapounov AntonLapounov deleted the FixBufferOverrunsInGC branch September 2, 2022 00:00
@AntonLapounov
Copy link
Member Author

/backport to release/7.0

@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2022

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/2975767222

@AntonLapounov
Copy link
Member Author

And in fact, to get meaningful stack trace, you need to use exactly the same distro and version of the distro (it also has to have the same version of glibc) where the core was taken.

Do we have instructions written anywhere? If not, we should document them.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants