Skip to content

Conversation

@vtjnash
Copy link
Member

@vtjnash vtjnash commented Nov 1, 2019

This seems to be exacerbating Windows CI problems such as #31810 and failing its own test. Disable it by default for this release (e.g. reverts 75cb006), so we can get this into the next RC, and explore fixes on master.

This seems to be exacerbating Windows CI problems such as #31810
and failing its own test. Disable it by default for this release.
@mbauman
Copy link
Member

mbauman commented Nov 1, 2019

Is the feature buggy or are the tests buggy?

@vtjnash
Copy link
Member Author

vtjnash commented Nov 1, 2019

I don't know—I'm just trying to get CI back to a stable state. Here's an example failure: https://build.julialang.org/#/builders/65/builds/5492/steps/2/logs/stdio

@vtjnash vtjnash closed this Nov 4, 2019
@vtjnash vtjnash reopened this Nov 4, 2019
@StefanKarpinski
Copy link
Member

Why not just fix the tempfile generation on Windows instead of letting a single buggy platform (Win32) dictate our API everywhere? Or, I dunno, disable the tests on the buggy platform?

@vtjnash
Copy link
Member Author

vtjnash commented Nov 4, 2019

It was merged too close to the freeze and so seems that there's been inadequate time to stabilize this.

@StefanKarpinski
Copy link
Member

That's some serious bullcrap. This functionality works fine on all platforms that don't have broken tempname; for some reason tempname on Win32 generates file names with ONLY TWO CHARACTERS OF RANDOM NAME. Why don't you fix that bug instead of this? Or, you know, we don't have to run these overly thorough tests on that broken platform. What actual problem has this change caused, aside from triggering test failures on Win32 due to a pre-existing bug on that platform?

@StefanKarpinski
Copy link
Member

This PR demonstrates the actual problem: #33765. Which is, as I've said for months now, that on Windows tempname does not generate unique names. Even with only 1000 names they are not unique. This is a serious bug that predates the change I made, which didn't cause this problem, it just exposed that pre-existing problem. So reverting my change is just sweeping the problem under the carpet, not fixing anything. If you want to sweep it under the rug rather than fixing it, then sweep it under the rug honestly—by skipping the tests that expose the problem on the broken platforms.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 5, 2019

I'm not entirely sure what that's supposed to prove. It's documented that function is not expected to return unique names. Also chill out and quit shouting bad information. It uses four hex characters to represent 65k unique names, as you'd see if you'd been looking at the logs or the docs.

It appears to be breaking many tests, not just this section, or I would have done as you suggested and just wrapped them in test_broken.

@davidanthoff
Copy link
Contributor

Why not just use UUIDs for the temporary file names on Windows? Even the MS docs seem to suggest that. Wouldn't that significantly simplify things, plus any danger of non unique file names would be gone? If the UUID stdlib is not available in base, couldn't one just call one of the GUID creation APIs from the win32 API, given that this would probably only be for win in any case?

@KristofferC KristofferC mentioned this pull request Nov 7, 2019
19 tasks
@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 7, 2019

@vtjnash what it proves is that the cleanup code is not the culprit. Cleanup works fine regardless of whether non-unique tempnames are generated or not. The tests on the other hand, assume that tempnames are actually unique—which they aren't on Windows—and fail if that assumption is wrong. Not because the cleanup code isn't working—it's working just fine—but because the number of unique names to cleanup is unpredictable. The exact reasons why don't really matter for the purposes of this PR, but clearly tempname does not return even 65k unique names, regardless of what you're saying and I recall seeing logs where tempnames on Windows only had two hex digis of random characters, which would mean only 256 random names, not 65k.

Rather than changing this default, which isn't causing any problems that didn't already exist, and which is helping to cleanup a lot of temp junk that was previously being left around by Julia processes, I'm suggesting that we:

  1. Turn of the failing tests on platforms with broken tempname (i.e. Windows)
  2. Open an issue about tempname brokenness, with a note to turn the cleanup tests back on when the issue is fixed.
  3. Fix the tempname problem, e.g. by using GUID's on Windows as David suggested.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 7, 2019

Even if there are 65k different names returned randomly by tempname, that's still a 50% chance of collision at around 256 tempnames. If I'm right about the two hex digits thing, then it's a 50% chance of collision at 16 tempnames. Either way, I think we should use GUIDs on Windows.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

Yeah, I'd prefer to see that fixed, but if we don't have that soon, I'm still proposing merging this as a stop-gap measure.

isn't causing any problems that didn't already exist
While almost true, it seems to be wreaking havoc on CI. A PR sometimes needs to be more that just conceptually better than before, it also needs to not break existing (even buggy) code

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

If I'm right about the two hex digits thing
You're not, so I don't know why you keep insisting on it.

There's another factor though, which might help show why this is so frequently failing as a direct result of this new code:

julia> @testset "foo" begin @show tempname() end;
tempname() = "C:\\cygwin64\\tmp\\jl_5410.tmp"

julia> @testset "foo" begin @show tempname() end;
tempname() = "C:\\cygwin64\\tmp\\jl_5410.tmp"

julia> @testset "foo" begin @show tempname() end;
tempname() = "C:\\cygwin64\\tmp\\jl_5410.tmp"

Yep, you saw that's right: a @testset currently guarantees that all of your tempname() calls will return the same value.

@davidanthoff
Copy link
Contributor

That is because @testset is always setting the same random number seed? I think the whole idea of using the standard rng from Julia to generate a random file name is not sound, to be honest...

In terms of options to fix this: is it important to have a 8.3 filename? If not, the solution really seems super simple: just generate a GUID and use that as the filename and call it done. If 8.3 filenames are important, then something like .Net's GetRandomFilename seems to be the best option.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

In terms of options to fix this: is it important to have a 8.3 filename?
Longer names could end up being more likely to collide (because internally they might get truncated), but I think that's only on a FAT filesystem, so probably not too much of a concern.

If I did the math correctly, it seems that the chance of collision means #33765 is 200x more likely to pass on macOS and 1000x more likely on Linux, so if that PR was measured to be broken on Win32 half of the time, you would need to run the test on posix about 1000 times on MacOS and 20_000 times on Linux to see the same sorts of failures popping up a few times.

@davidanthoff
Copy link
Contributor

I think a simple call UuidCreate might be the easiest way to create that GUID/UUID on Windows?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 7, 2019

Yeah, I'd prefer to see that fixed, but if we don't have that soon, I'm still proposing merging this as a stop-gap measure.

I don't get it. If we can't fix the root issue, the most sensible thing to do is disable the flaky tests on Windows. There's no actual problem with the tempfile cleanup functionality so why disable it? It's only the tests that are problematic. The tempfile cleanup works like a charm on all platforms, and I've yet to see any evidence otherwise.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

The actual problem is with the tempfile cleanup functionality. The test are probably fine and doing a good job demonstrating the issue. While I'd reported the issue before (#31810), the occurrence of it (such as https://build.julialang.org/#/builders/65/builds/5335/steps/2/logs/stdio) seems to have increased. This is probably due to a confluence of bad factors, (mostly that all processes are in fact racing to try to generate the same tempfile name, since 5e2ff12 and 736e13e), but that means the default to true may be multiply the failure rate by introducing many new races (and it also can affect code that was previously race-free through correct use of mktemp/rm).

@StefanKarpinski
Copy link
Member

The actual problem is with the tempfile cleanup functionality.

Please back that up with some kind of evidence. The cleanup functionality has eager cleanup and non-eager cleanup, as indicated by the boolean flag associated with each path to be cleaned up. The non-eager cleanup doesn't delete anything until the process exits, which is not a problem even if the same tempfile has been used in multiple places. Eager cleanup is only used for higher-order variants of mktemp and mkdir where there's a delimited point after which you know the temp path should be deleted as soon as possible. But neither of those has the non-uniqueness problem, only tempname does—and paths generated by tempname don't get cleaned up until process exit.

The only corner case that might cause problems is if tempname and higher order mktemp generate the same temp path and mktemp registers the path for eager cleanup. Then the path could get cleaned up before the tempname code is done with it. However, that seems like a pretty unlikely corner case. It's also pretty easy to make sure that doesn't happen. I can make a PR.

@StefanKarpinski
Copy link
Member

Here. #33786 fixes that corner case: if there is a temp name collision and one usage wants "asap" cleanup while the other wants cleanup on process exit, the non-asap one wins. That means that even if there collisions between the paths generated by tempname and higher order mktemp nothing gets cleanup up until the process exits.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

Makes sense, though I think if we were reaching that code we'd have been printing a warning?

@StefanKarpinski
Copy link
Member

What should the warning say? It seems like it will just be telling the user that Julia itself is doing something broken, which is usually just annoying rather than helpful.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 7, 2019

@error "mktemp cleanup"—or however that formats

@mbauman
Copy link
Member

mbauman commented Nov 7, 2019

I think you're talking past each other.

Jameson is saying he thinks that #33786 will only fix the issue that results in this error log getting printed. But that's not happening, so he doesn't think #33786 will actually fix the issue that's been plaguing the buildbots.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 8, 2019

Jameson is saying he thinks that #33786 will only fix the issue that results in this error log getting printed.

Those seem entirely unrelated: that happens on Windows when someone uses the do-block mktemp/dir and leaves some file open that prevents cleanup when the block is done. The only fix for that issue is to not leave temp files open preventing cleanup. The cleanup functionality will try to delete the paths later if possible, but what's the problem with that? It keeps trying until until it succeeds and should do so silently without complaining if it can't delete things because it calls rm with the force=true flag. At least that's my understanding of what rm(force=true, recursive=true) should do, but Windows is a mysterious beast and I don't have access to a Windows system, so I don't know if that's what it does or not. If the problem is that when trying to delete "locked" paths, rm with force=true throws, then that can probably be fixed with a judicious application of try/catch in the purge code. (Edit: implemented in #33794.)

It would help a lot if @vtjnash could explain what he thinks the actual problem is. Is it a problem that the error log is getting printed? If so, we can just delete the line that prints the error log (although that seems less than ideal since then people will definitely not fix the cases where they leave temp files open). But the error logging is unrelated to the cleanup. What is the actual problem that's been hitting the build bots? The thing that's been mentioned and linked to repeatedly is the failure where the length of TEMP_CLEANUP is wrong, which is strictly a symptom of non-unique temp path generation. If there's some other problem that's the concern, please explain what it is.

@StefanKarpinski
Copy link
Member

If this rm call failing is an issue on Windows (it's unclear if it is), then this PR should fix it: #33794.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 8, 2019

For what it's worth, the original failure linked in this thread might have been caused by temp path collisions between mktemp and tempname; that's the only cause I can think of that's related to the temp file cleanup. If that is the case then #33786 should have fixed it. It does not look like a failure of rm in temp_cleanup_purge which would not exhibit that way—that would look like a call to generate a new temp path throwing an exception from within the rm call.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 14, 2019

Please back that up with some kind of evidence

OK, please look at CI occasionally. As has been happening frequently lately, here's yet another instance where the Test test failed that seems plausibly attributable to this cleanup=true functionality triggering (this test usually runs while there's a couple other processes concurrently exiting): https://build.julialang.org/#/builders/32/builds/6084/steps/2/logs/stdio

This should probably be improved by #33785, but in the meantime, we've had yet another week of unusable CI because we didn't merge this temporary workaround.

@vtjnash vtjnash closed this Nov 14, 2019
@vtjnash vtjnash reopened this Nov 14, 2019
@vtjnash
Copy link
Member Author

vtjnash commented Nov 20, 2019

Yet another example of the frequent CI failures this has been causing: https://build.julialang.org/#/builders/32/builds/6253/steps/2/logs/stdio

This one is independently interesting though, since it failed at a different point than we've seen more commonly, which appears to show that #33794 is necessary but not sufficient:

      From worker 2:	IOError: stat: permission denied (EACCES) for file "C:\\cygwin\\tmp\\jl_3E82.tmp"
      From worker 2:	Stacktrace:
      From worker 2:	 [1] stat(::String) at .\stat.jl:69
      From worker 2:	 [2] ispath at .\stat.jl:311 [inlined]
      From worker 2:	 [3] temp_cleanup_purge(::Bool) at .\file.jl:488
      From worker 2:	 [4] temp_cleanup_purge() at .\file.jl:481
      From worker 2:	 [5] _atexit() at .\initdefs.jl:316
      From worker 2:	 [6] exit at .\initdefs.jl:28 [inlined]
      From worker 2:	 [7] exit() at .\initdefs.jl:29
      From worker 2:	 [8] (::Distributed.var"#114#116"{Distributed.RemoteDoMsg})() at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:315
      From worker 2:	 [9] run_work_thunk(::Distributed.var"#114#116"{Distributed.RemoteDoMsg}, ::Bool) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Distributed\src\process_messages.jl:79
      From worker 2:	 [10] (::Distributed.var"#113#115"{Distributed.RemoteDoMsg})() at .\task.jl:349

@StefanKarpinski
Copy link
Member

Ref #33996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants