Skip to content

Conversation

@LebedevRI
Copy link
Contributor

@gbaraldi @vchuravy @vtjnash @d-netto once again, latest release branch is broken. Does nobody care?

Results in reliable julia segfault immediately upon startup. I don't know what is going on or why it is crashing or what is wrong, but this is very reproducible and bisection result is reliable.

Notably, does not seem to be happening in master.

See the following issue for more details:
Fixes #54560

This reverts commit 53180e4.

This is a respin of #56342

… queue (JuliaLang#53424)"

Results in reliable `julia` segfault immediately upon startup.
I don't know what is going on or why it is crashing or what is wrong,
but this is very reproducible and bisection result is reliable.

Notably, does not seem to be happening in `master`.

See the following issue for more details:
Fixes JuliaLang#54560

This reverts commit 53180e4.
@Keno
Copy link
Member

Keno commented Feb 7, 2025

Hi @LebedevRI, sorry for the frustration. A couple of points

Does nobody care?

First, please refrain from these kinds of side comments, they are counter-productive and we prefer to not use them here.

Now for the actual issue.

I don't know what is going on or why it is crashing or what is wrong

That is part of the problem. We generally avoid changing things based on observation alone without understanding the root cause. An additional issue here is that it does not reproduce for the developers or on CI, so it was somewhat low priority.

That said, I just read through the thread, and I think there's a conclusion to be had:

  1. I agree with the general assessment that malloc/calloc does not guarantee the appropriate alignment. We should add a checked version that catches this in the future.
  2. However, given the size of the allocated struct, I would have ordinarily expected the resulting allocation to be cache line aligned, just because of the internal large object allocation policies. Therefore, there is likely an additional root cause, either a change in glibc allocation policies or a compiler bug. It would be good to track this down if we can, so that we understand why you're seeing it and nobody else.
  3. For the time being, I concur that we should revert this change both on master and on 1.11.
  4. We should then attempt to revive this change on master (and possibly 1.11) by switching the allocation of the thread state to an allocator that can guarantee appropriate alignment.

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

Thanks for reviving the PR @LebedevRI and thanks for your comments @Keno.

@Keno: indeed, alignment issues in libc's allocator were my first guess here, but the last time I checked this doesn't reproduce when you start Julia with a single thread -- #54560 (comment).

If there were indeed alignment issues, not sure if I could understand why this bug would manifest with multiple threads, but not with a single thread.

@gbaraldi
Copy link
Member

gbaraldi commented Feb 7, 2025

It's not super surpising to me. Lots of threading specific data structures don't really get utilized when running single threaded, so I imagine it's just luck

@d-netto
Copy link
Member

d-netto commented Feb 7, 2025

The segfault happens when memsetting a struct in jl_init_thread_heap.

We run this code independently of thread counts.

@LebedevRI
Copy link
Contributor Author

Does nobody care?

First, please refrain from these kinds of side comments, they are counter-productive and we prefer to not use them here.

Yeah, i'm sorry, that should have been worded better.
It's just too frustrating, and i'm just getting back to things...

Now for the actual issue.

I don't know what is going on or why it is crashing or what is wrong

That is part of the problem. We generally avoid changing things based on observation alone without understanding the root cause. An additional issue here is that it does not reproduce for the developers or on CI, so it was somewhat low priority.

That said, I just read through the thread, and I think there's a conclusion to be had:

  1. I agree with the general assessment that malloc/calloc does not guarantee the appropriate alignment. We should add a checked version that catches this in the future.
  2. However, given the size of the allocated struct, I would have ordinarily expected the resulting allocation to be cache line aligned, just because of the internal large object allocation policies. Therefore, there is likely an additional root cause, either a change in glibc allocation policies or a compiler bug. It would be good to track this down if we can, so that we understand why you're seeing it and nobody else.
  3. For the time being, I concur that we should revert this change both on master and on 1.11.
  4. We should then attempt to revive this change on master (and possibly 1.11) by switching the allocation of the thread state to an allocator that can guarantee appropriate alignment.

Hurray.
Note that it was also backported to 1.10 in 996fe30. CC @KristofferC.

The segfault happens when memsetting a struct in jl_init_thread_heap.

We run this code both independently of thread counts.

This, unfortunately, doesn't mean much.
First, the debug info could simply be wrong.
Second, UB has a tendency to travel back in time, the real issue would have happened later in code,
but optimizations back-propagated it up to the first externally-observable side-effect (the memset),
that acted as a barrier/catchment for it.

@Keno
Copy link
Member

Keno commented Feb 8, 2025

Closing in favor of #57310

@Keno Keno closed this Feb 8, 2025
@LebedevRI LebedevRI deleted the revert-pr53424-for-1.11 branch February 8, 2025 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants