-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
RFC: for the release, change tempfile cleanup default back to false #33741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This seems to be exacerbating Windows CI problems such as #31810 and failing its own test. Disable it by default for this release.
|
Is the feature buggy or are the tests buggy? |
|
I don't know—I'm just trying to get CI back to a stable state. Here's an example failure: https://build.julialang.org/#/builders/65/builds/5492/steps/2/logs/stdio |
|
Why not just fix the tempfile generation on Windows instead of letting a single buggy platform (Win32) dictate our API everywhere? Or, I dunno, disable the tests on the buggy platform? |
|
It was merged too close to the freeze and so seems that there's been inadequate time to stabilize this. |
|
That's some serious bullcrap. This functionality works fine on all platforms that don't have broken |
|
This PR demonstrates the actual problem: #33765. Which is, as I've said for months now, that on Windows |
|
I'm not entirely sure what that's supposed to prove. It's documented that function is not expected to return unique names. Also chill out and quit shouting bad information. It uses four hex characters to represent 65k unique names, as you'd see if you'd been looking at the logs or the docs. It appears to be breaking many tests, not just this section, or I would have done as you suggested and just wrapped them in test_broken. |
|
Why not just use UUIDs for the temporary file names on Windows? Even the MS docs seem to suggest that. Wouldn't that significantly simplify things, plus any danger of non unique file names would be gone? If the UUID stdlib is not available in base, couldn't one just call one of the GUID creation APIs from the win32 API, given that this would probably only be for win in any case? |
|
@vtjnash what it proves is that the cleanup code is not the culprit. Cleanup works fine regardless of whether non-unique tempnames are generated or not. The tests on the other hand, assume that tempnames are actually unique—which they aren't on Windows—and fail if that assumption is wrong. Not because the cleanup code isn't working—it's working just fine—but because the number of unique names to cleanup is unpredictable. The exact reasons why don't really matter for the purposes of this PR, but clearly tempname does not return even 65k unique names, regardless of what you're saying and I recall seeing logs where tempnames on Windows only had two hex digis of random characters, which would mean only 256 random names, not 65k. Rather than changing this default, which isn't causing any problems that didn't already exist, and which is helping to cleanup a lot of temp junk that was previously being left around by Julia processes, I'm suggesting that we:
|
|
Even if there are 65k different names returned randomly by |
|
Yeah, I'd prefer to see that fixed, but if we don't have that soon, I'm still proposing merging this as a stop-gap measure.
|
There's another factor though, which might help show why this is so frequently failing as a direct result of this new code: Yep, you saw that's right: a |
|
That is because In terms of options to fix this: is it important to have a 8.3 filename? If not, the solution really seems super simple: just generate a GUID and use that as the filename and call it done. If 8.3 filenames are important, then something like .Net's |
If I did the math correctly, it seems that the chance of collision means #33765 is 200x more likely to pass on macOS and 1000x more likely on Linux, so if that PR was measured to be broken on Win32 half of the time, you would need to run the test on posix about 1000 times on MacOS and 20_000 times on Linux to see the same sorts of failures popping up a few times. |
|
I think a simple call UuidCreate might be the easiest way to create that GUID/UUID on Windows? |
I don't get it. If we can't fix the root issue, the most sensible thing to do is disable the flaky tests on Windows. There's no actual problem with the tempfile cleanup functionality so why disable it? It's only the tests that are problematic. The tempfile cleanup works like a charm on all platforms, and I've yet to see any evidence otherwise. |
|
The actual problem is with the tempfile cleanup functionality. The test are probably fine and doing a good job demonstrating the issue. While I'd reported the issue before (#31810), the occurrence of it (such as https://build.julialang.org/#/builders/65/builds/5335/steps/2/logs/stdio) seems to have increased. This is probably due to a confluence of bad factors, (mostly that all processes are in fact racing to try to generate the same tempfile name, since 5e2ff12 and 736e13e), but that means the default to |
Please back that up with some kind of evidence. The cleanup functionality has eager cleanup and non-eager cleanup, as indicated by the boolean flag associated with each path to be cleaned up. The non-eager cleanup doesn't delete anything until the process exits, which is not a problem even if the same tempfile has been used in multiple places. Eager cleanup is only used for higher-order variants of The only corner case that might cause problems is if |
|
Here. #33786 fixes that corner case: if there is a temp name collision and one usage wants "asap" cleanup while the other wants cleanup on process exit, the non-asap one wins. That means that even if there collisions between the paths generated by |
|
Makes sense, though I think if we were reaching that code we'd have been printing a warning? |
|
What should the warning say? It seems like it will just be telling the user that Julia itself is doing something broken, which is usually just annoying rather than helpful. |
|
|
|
I think you're talking past each other. Jameson is saying he thinks that #33786 will only fix the issue that results in this error log getting printed. But that's not happening, so he doesn't think #33786 will actually fix the issue that's been plaguing the buildbots. |
Those seem entirely unrelated: that happens on Windows when someone uses the do-block It would help a lot if @vtjnash could explain what he thinks the actual problem is. Is it a problem that the error log is getting printed? If so, we can just delete the line that prints the error log (although that seems less than ideal since then people will definitely not fix the cases where they leave temp files open). But the error logging is unrelated to the cleanup. What is the actual problem that's been hitting the build bots? The thing that's been mentioned and linked to repeatedly is the failure where the length of |
|
If this |
|
For what it's worth, the original failure linked in this thread might have been caused by temp path collisions between |
OK, please look at CI occasionally. As has been happening frequently lately, here's yet another instance where the Test test failed that seems plausibly attributable to this This should probably be improved by #33785, but in the meantime, we've had yet another week of unusable CI because we didn't merge this temporary workaround. |
|
Yet another example of the frequent CI failures this has been causing: https://build.julialang.org/#/builders/32/builds/6253/steps/2/logs/stdio This one is independently interesting though, since it failed at a different point than we've seen more commonly, which appears to show that #33794 is necessary but not sufficient: |
|
Ref #33996 |
This seems to be exacerbating Windows CI problems such as #31810 and failing its own test. Disable it by default for this release (e.g. reverts 75cb006), so we can get this into the next RC, and explore fixes on master.