Skip to content

Conversation

@timholy
Copy link
Member

@timholy timholy commented Feb 25, 2022

Fixes #44338

I tested this with

(pkgs) pkg> st
Status `/tmp/tim/pkgs/Project.toml`
  [5c1252a2] GeometryBasics v0.4.1
  [7073ff75] IJulia v1.23.2
  [98e50ef6] JuliaFormatter v0.22.4
  [add582a8] MLJ v0.17.1
  [d491faf4] MLJModels v0.15.4
  [ccf2f8ad] PlotThemes v2.0.1
  [7792a7ef] StrideArraysCore v0.2.12

which only intermittently triggers #44338, but the frequency was well above 1 in 10 and probably more like 1 in 3. After a few manual runs, I did this:

$ for i in {1..50}
> do
>     rm -rf ~/.julia/compiled/v1.9/*
>     ./julia --startup=no -e 'using Pkg; Pkg.activate("/tmp/tim/pkgs"); Pkg.precompile()'
>     echo "$i done"
> done

49 of these 50 runs completed successfully. The one exception threw this backtrace:

signal (11): Segmentation fault
in expression starting at none:0
jl_collect_backedges at /home/tim/src/julia-master/src/dump.c:1229
ijl_save_incremental at /home/tim/src/julia-master/src/dump.c:2676
jl_write_compiler_output at /home/tim/src/julia-master/src/precompile.c:65
ijl_atexit_hook at /home/tim/src/julia-master/src/init.c:207
jl_repl_entrypoint at /home/tim/src/julia-master/src/jlapi.c:707
jl_load_repl at /home/tim/src/julia-master/cli/loader_lib.c:271
main at /home/tim/src/julia-master/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/tim/src/julia-master/usr/bin/julia-debug (unknown line)
Allocations: 8546957 (Pool: 8542808; Big: 4149); GC: 10
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types ~/src/julia-master/usr/share/julia/stdlib/v1.9/Pkg/src/Types.jl:67
 [2] precompile(ctx::Pkg.Types.Context, pkgs::Vector{String}; internal_call::Bool, strict::Bool, warn_loaded::Bool, already_instantiated::Bool, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Pkg.API ~/src/julia-master/usr/share/julia/stdlib/v1.9/Pkg/src/API.jl:1427
 [3] precompile
   @ ~/src/julia-master/usr/share/julia/stdlib/v1.9/Pkg/src/API.jl:1060 [inlined]
 [4] #precompile#225
   @ ~/src/julia-master/usr/share/julia/stdlib/v1.9/Pkg/src/API.jl:1057 [inlined]
 [5] precompile (repeats 2 times)
   @ ~/src/julia-master/usr/share/julia/stdlib/v1.9/Pkg/src/API.jl:1057 [inlined]
 [6] top-level scope
   @ none:1

which is also htable-related, but one that we've used this way for a long time. In particular, the crash probably occurred in a part of that line that has been part of Julia for 5 years:

julia/src/dump.c

Line 1077 in fb85cb3

if (callees != HT_NOTFOUND && module_in_worklist(caller->def.method->module)) {

One other interesting case: while almost all of these runs took between 90-92s each to compile, on one run it took 436s. I don't have any data about what caused that, so it's not really actionable.

@timholy timholy added the backport 1.8 Change should be backported to release-1.8 label Feb 25, 2022
@timholy timholy mentioned this pull request Feb 25, 2022
47 tasks
@truedichotomy
Copy link

Thanks for acting on this bug so quickly!

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be correct, since it doesn't change the access patterns, just the initialization size

@timholy
Copy link
Member Author

timholy commented Feb 25, 2022

Operationally it fixes the problem, though I agree it's not the cause. Just checking, you saw #44338 (comment), right? The hang happens when the htable resizes, so I figured preventing the need to resize is not an unreasonable way to fix this.

Is there a lock I need to acquire or something?

@JeffBezanson
Copy link
Member

Looks like memory corruption? E.g. some of malloc's internal state getting clobbered.

@timholy
Copy link
Member Author

timholy commented Feb 25, 2022

I'll try firing up valgrind. FWIW analyzegc comes clean on #43990 (if I'm interpreting its output correctly).

@truedichotomy
Copy link

Tested this PR and it works with my system & all the packages.

@timholy
Copy link
Member Author

timholy commented Feb 28, 2022

Valgrind suggests there's a leak:

==2151518== 13,312 bytes in 8 blocks are definitely lost in loss record 11,916 of 12,166
==2151518==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2151518==    by 0x51D167E: equalhash_lookup_bp_r (equalhash.c:22)
==2151518==    by 0x51D16EF: equalhash_put_r (equalhash.c:22)
==2151518==    by 0x51D2202: fl_table_put (table.c:116)
==2151518==    by 0x51C37CD: apply_cl (flisp.c:1230)
==2151518==    by 0x51C7540: _applyn (flisp.c:685)
==2151518==    by 0x51C7D9C: fl_map1 (flisp.c:2223)
==2151518==    by 0x51C7D9C: fl_map1 (flisp.c:2198)
==2151518==    by 0x51C37CD: apply_cl (flisp.c:1230)
==2151518==    by 0x51C7540: _applyn (flisp.c:685)
==2151518==    by 0x51C7D9C: fl_map1 (flisp.c:2223)
==2151518==    by 0x51C7D9C: fl_map1 (flisp.c:2198)
==2151518==    by 0x51C37CD: apply_cl (flisp.c:1230)
==2151518==    by 0x51C8114: do_trycatch (flisp.c:901)

There are several "copies" of this report. But of all the htable uses, the only leak seems to come from here.

Consequently, I suspect this isn't directly related. valgrind seems to shed no light on the problem. Maybe ASAN would? But see #44361. I'm pretty stuck until that's fixed.

@JeffBezanson
Copy link
Member

Definitely looks like a job for rr.

@timholy
Copy link
Member Author

timholy commented Mar 2, 2022

This was tricky. In case it helps someone else, I usually get fatal rr errors, like

tim@flash:~/src/julia-master$ rr record ./julia --startup=no --project=/tmp/tim/pkgs -e 'using Pkg; Pkg.offline(true); Pkg.precompile()'
rr: Saving execution to trace directory `/home/tim/.local/share/rr/julia-0'.
Precompiling project...
  Progress [===========================>             ]  319/489
  ✗ NNlib
  ◒ GMT
  ◐ DataFrames
[FATAL /home/roc/rr/rr/src/record_syscall.cc:5984:rec_process_syscall_arch()] 
 (task 24400 (rec:24400) at time 7654240)
...

which is kind of a bummer. I found one system where I was able to capture a trace, but then during replay

Precompiling project...
  Progress [==========>                              ]  111/489
  ◑ ColorTypes
  ◒ Parsers
  ◓ BinDeps
  ◒ TimerOutputs
  ◒ TableTraits
[FATAL /home/tim/src/rr/rr/src/ReplaySession.cc:1187:prepare_syscallbuf_records()] 
 (task 2360459 (rec:2355138) at time 1108050)
...

☹️

So I ran a bash for loop, repeatedly recompiling a single package (Crayons, which precompiles quickly but occasionally exhibits this bug) while recording, until it got stuck:

for i in {1..500}
do
    rm -rf ~/.julia/compiled/v1.9/*
    date -Iseconds && rr record ./julia --startup=no --project=/tmp/tim/pkgs2 -e 'using Pkg; Pkg.offline(true); Pkg.precompile()'
done

(the date helps you figure out how long it's been running if you page away from the window). If it hangs, hit Ctrl-C twice and use rr on the next-to-last recording (it always seems to try to start the next one before intercepting the 2nd Ctrl-C).

Now it's tricky. The error is in malloc.c so I installed the glibc-sources package for my distro and unpackged the .xz file. The error appears to be in the equivalent of this line; info locals is

(rr) info locals
next = 0x555f9d4240a0
iters = 1
nb = <optimized out>
idx = 80
bin = <optimized out>
victim = 0x555f9d423890
size = 2064
victim_index = <optimized out>
remainder = <optimized out>
remainder_size = <optimized out>
block = <optimized out>
bit = <optimized out>
map = <optimized out>
fwd = <optimized out>
bck = 0x3
tcache_unsorted_count = 1
tcache_nb = 0
tc_idx = 127
return_cached = 0
__PRETTY_FUNCTION__ = "_int_malloc"

and

(rr) p bck->fd
Cannot access memory at address 0x13

which seems like it might explain the crash. AFAICT (and I do not have a clear understanding of how malloc.c works), mchunkptr->bk=0x3 is used as sentinel:

Thread 1 hit Watchpoint 3: -location *(mchunkptr)0x55f4fffe8a00

Old value = {mchunk_prev_size = 0, mchunk_size = 2065, fd = 0x55f4ffed6cb0, bk = 0x3, fd_nextsize = 0x0, bk_nextsize = 0x0}
New value = {mchunk_prev_size = 0, mchunk_size = 2065, fd = 0x7f2a0368cbe0 <main_arena+96>, bk = 0x3, fd_nextsize = 0x0, bk_nextsize = 0x0}
_int_malloc (av=av@entry=0x7f2a0368cb80 <main_arena>, bytes=bytes@entry=2048) at malloc.c:3792
3792	in malloc.c
(rr) c
Continuing.

Thread 1 hit Watchpoint 5: -location *(mchunkptr)0x55f4ffed6cb0

Old value = {mchunk_prev_size = 18446744073709547520, mchunk_size = 1025, fd = 0x7f2a0368cbe0 <main_arena+96>, bk = 0x55f4fffe8a00, fd_nextsize = 0x0, bk_nextsize = 0x0}
New value = {mchunk_prev_size = 18446744073709547520, mchunk_size = 1025, fd = 0x7f2a0368cfd0 <main_arena+1104>, bk = 0x55f500037720, fd_nextsize = 0x0, bk_nextsize = 0x0}
_int_malloc (av=av@entry=0x7f2a0368cb80 <main_arena>, bytes=bytes@entry=2048) at malloc.c:3884
3884	in malloc.c
(rr) c
Continuing.

Thread 1 received signal SIGSEGV, Segmentation fault.
_int_malloc (av=av@entry=0x7f2a0368cb80 <main_arena>, bytes=bytes@entry=2048) at malloc.c:3742
3742	in malloc.c
(rr) p victim
$16 = (mchunkptr) 0x55f4fffe8a00
(rr) p bck
$17 = (mchunkptr) 0x3

So it looks a bit as if a check for a potential error condition failed to make sure that the check was valid?

I don't know if anyone else has any other thoughts, but I may have reached the limit of how far I can carry this. If there is a possibility that this is a malloc bug and/or no one else has any good ideas, then it seems we have two options:

Thoughts?

@timholy timholy changed the title Pre-allocate htables when serializing external methodinstances Fix memory error during precompilation Mar 2, 2022
@timholy
Copy link
Member Author

timholy commented Mar 2, 2022

Found it! Phew.

There's no reason I can see not to do the pre-allocation, so I decided to add the real fix to this PR. But I've tested this change in isolation, and it solves the problem on its own (>1000 successful precompilation of Crayons without a hitch).

@timholy timholy added the merge me PR is reviewed. Merge when all tests are passing label Mar 2, 2022
@timholy timholy dismissed vtjnash’s stale review March 2, 2022 23:35

Problem fixed

@DilumAluthge
Copy link
Member

@vtjnash Does this look good to you now?

@timholy
Copy link
Member Author

timholy commented Mar 3, 2022

This definitively fixes a serious bug and there is otherwise not much to this. I don't think it needs to wait on a review.

@timholy timholy merged commit b4ea0f7 into master Mar 3, 2022
@timholy timholy deleted the teh/prealloc_htable branch March 3, 2022 08:32
@timholy timholy removed the merge me PR is reviewed. Merge when all tests are passing label Mar 3, 2022
KristofferC pushed a commit that referenced this pull request Mar 3, 2022
Comment on lines -289 to +293
htable_new(&visited, 0);
if (list) {
assert(jl_is_array(list));
size_t n0 = jl_array_len(list);
htable_new(&visited, n0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The htable_free below now access uninitialized memory and must be moved also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in #44446

@vtjnash
Copy link
Member

vtjnash commented Mar 3, 2022

I don't know this part of the code, but the change makes sense now

KristofferC pushed a commit that referenced this pull request Mar 7, 2022
@KristofferC KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pkg does not complete precompilation, randomly hangs on different packages

6 participants