Skip to content

Conversation

@krynju
Copy link
Member

@krynju krynju commented Sep 11, 2021

So since the UID was a rand() very rarely weird things like keyerrors would happen, especially in case when you're rerunning tests or using @btime

Noticed by running these usecases and by looking at EAGER_ID_MAP and thunk_dict.

I'm 50% sure it fixes the occasional hangs as well, but I'll try to confirm it next week

fixes #267

@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2021

Codecov Report

Merging #281 (b9a4d41) into master (6a1aa1f) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #281    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files          34      35     +1     
  Lines        2754    2913   +159     
=======================================
- Misses       2754    2913   +159     
Impacted Files Coverage Δ
src/thunk.jl 0.00% <0.00%> (ø)
src/Dagger.jl 0.00% <0.00%> (ø)
src/sch/Sch.jl 0.00% <0.00%> (ø)
src/lib/util.jl 0.00% <0.00%> (ø)
src/processor.jl 0.00% <0.00%> (ø)
src/sch/eager.jl 0.00% <0.00%> (ø)
src/lib/logging.jl 0.00% <0.00%> (ø)
src/lib/logging-extras.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a1aa1f...b9a4d41. Read the comment docs.

@krynju
Copy link
Member Author

krynju commented Sep 12, 2021

I checked for hangs and overall it doesn't seem to hang now, but the scheduler really slows down at some point when ~100k thunks were processed, so it feels like a hang sometimes. That will need another issue for tracking

@jpsamaroo
Copy link
Member

I don't understand how this can fix the hangs, since the RNG is thread-safe. Is there some insight that I'm missing?

@krynju
Copy link
Member Author

krynju commented Sep 12, 2021

I'm not so sure myself and I wouldn't consider the hangs issue 100% gone yet, but I definitely noticed that they happen less often now and usually I'm just getting super slowdowns instead.

Whenever i crl-c'd out of the hangs the stack trace would usually point at a wait on a future. My initial theory is that if sometimes the EAGER_ID_MAP was pointing at the wrong tid the wait would then be called on a future that wouldn't notify the waiting task for some reason?(maybe the task related to that future didn't start yet and there's a deadlock) Not sure though, so I'd have to investigate more

But as I said, my synthetic hang scenarios would usually lock up after 150 iterations or so, with this i can easily go above 1k and usually get a slowdown after 2k iterations instead of a hang

@krynju
Copy link
Member Author

krynju commented Sep 24, 2021

Hey, so do we merge this at some point or is there some reason not to?
I've been running it since I opened this PR and it's all good - no keyerrors
My groupby branch has this merged as well and it's been passing since forever.

@jpsamaroo
Copy link
Member

Ok, I'll merge this now, although I still don't really understand why it fixes it.

@jpsamaroo jpsamaroo merged commit 343f74a into JuliaParallel:master Sep 24, 2021
@jpsamaroo
Copy link
Member

Thanks again!

@krynju
Copy link
Member Author

krynju commented Sep 25, 2021

It fixes the key error, which still should have been rare because that's a rand over uint64, but it wasn't rare, which is weird to me.
The hangs, now that we've found the actual reason, don't seem to be affected by this at all, so my initial comment about it is invalid at this point i guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eager scheduler error when @spawn inside a thunk

3 participants