- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 393
radicle tuning #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
        
      
    
  
     Merged
                    radicle tuning #1726
Changes from all commits
      Commits
    
    
            Show all changes
          
          
            3 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      
    File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the journey test CI flakiness observed here be related to the journey test CI problems discussed in #1634 and worked around in #1725? Since, here, it is connecting, and just not seeing the same refs as are expected, my first thought is that this is separate. But if somehow a previous run of
git-daemoncould still be listening, then maybe it would be possible to connect to it and not are the expected refs.In my experiments on the earlier breakage, I was unable to observe any process with an image name like
git-daemon, and in your experiments commented in #1634, while it seemed like the port was not yet available to be bound by a new process, it was not actually still open for connections (which agrees with what I observed). So I suspect that the problem here, with connecting but seeing different refs, may be unrelated. But I figured I'd bring it up just in case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chiming in!
I think the successful connection is
ncthat connected to a server that is on the way down.And then the
gixeven managed to connect, but it was doing so while the server shut down, so it unwound while the connection was used.This probably means that
gixshould offergix servesoon to be able to be able to deal with sockets properly :D.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I think running
git daemonin such a way that we can detect if it is immediately failing may make it possible to avoid this problem entirely. There are a few ways to do that:&in the same way as that is done now, but check the status immediately.git daemonwith--detach. I think this might be sufficient to ensure it synchronously fails if it cannot bind to the port, I'm not sure. This would then require different logic to stop the daemon.git daemonwith--inetd. Something to look at if way 3 seems like it's on the right track but ends up being unsuitable.I'll look into this further to see if it can be done in a simple and reliable way.
In addition, we may be able to have to choose a different port based on what is available. But I am unsure if that can easily be done in a portable (no pun intended) way. It would have the benefit of allowing the journey tests to be run in parallel, such as with each other on different builds, and of allowing them to be run at the same time as other tests or processes that use
git-daemonwith a different port. The latter seems potentially valuable.Ah, I see. What's interesting is that, in my testing,
git-daemonwas never running (according topgrep) when a newgit -c ... daemon ...was about to be run, as well as not accepting any connections. However, it may be that the race condition that happened here could in principle have happened before but (at least in my testing) never did.That sounds like a great idea. I had the thought that it should not be specific to the
git://protocol, since there would be a benefit in testing to being able to use it to test multiple protocols. But if I understand the description in #307, that seems already to be the plan (or maybe the plan, at that time, even excludedgit://--should #307 be updated?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can write a
--pid-filewhich should make it trivial to take it down.I tried this:
However, it never failed to startup, and still ran into the same issue repeatedly. The first time it usually worked, but the second launch now sees "Connection reset by peer" repeatedly.
That would be useful, but would also make it necessary to configure the port used for the
gitprotocol. It's no big deal, just something I thought I'd mention.Right, I forgot about that issue 😅. But at least I have updated it - it's a server and it can reasonably handle multiple transports by abstracting over them, similar on how the client already does it. It's still far enough away to have to use
git-daemonfor the foreseeable future though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
Could the
nc -zconnection(s) actually be triggering a bug ingit-daemonwhere, if it receives a connection with no data, it malfunctions later, such as by confusing later connections with the trivialncconnection and sending them RST packets?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Although I ran it many times, however it affected things was probably nondeterministic and definitely brittle.
Since the only reason for disabling caching was to avoid this problem, and the relevant tests are turned back off on CI now, I'll open a PR shortly to reenable caching in the
test-journeyjob.Edit: I've opened #1728 for that.
It seems to me that the time between
*.shruns may be very relevant here.I do not mean immediately preceding test cases in the same run of the journey test script, but rather earlier runs of the journey tests in builds with different features.
On CI, within a single run of the same
test-journeyjob, nontrivialcargo buildcommands that compile artifacts do run between separate invocations ofjourney.sh. The helper function that launchesgit-daemonis called at least once in each of thejourney.shruns that tests at least one thing related togit://.So the timing between two launches of
git-daemonis sometimes affected by how long those build commands take. This will happen when one of the launches is the last one before a nontrivial build command and the other launch is the first one after that same nontrivial build command. (I don't know if it happens at other times, but I don't think so.)Details follow.
The per-feature recipes
On CI, several different feature sets are tested by running the journey tests, in the same CI job, and artifacts are compiled in between the runs. There are four feature sets for which
journey.shis run separately and, ahead of each one, there are explicitcargo buildcommands:gitoxide/justfile
Lines 200 to 222 in 69ee6a3
All four of those build +
journey.shrecipes are run in sequence (though in a different order) from theci-journey-testsrecipe, which thetest-journeyCI job runs. The build commands are non-trivial; they build artifacts, and sometimes download source code. Without caching, I believe they take longer.Verification in logs
That
git://protocol journey tests are run, then later artifacts are compiled, then later moregit://protocol journey tests run, can be verified in the logs of CI runs that have journey tests enabled, such as this run in the last commit on main that journey tests.Specifically, in this log, produced by viewing and saving the full log from that CI run,
git://protocol tests are logged at lines 1236, 1295, 1783, 1815, 2128, and 2187, while builds that report compiling at least one artifact are logged at lines 717, 1055, 1431, 1651, and 1919. (Searching forgit://andCompilingmay be easier, that checking those, though.)A more useful view of the logs
The progression of the most relevant events is:
max-purebuild../tests/journey.shrun formax-pure, in whichgit-daemonis used in agix remote refstest and again agix free pack receivetest.smallbuild../tests/journey.shrun forsmall, in whichgit-daemonis not used.lean-asyncbuild../tests/journey.shrun forlean-async, in whichgit-daemonis used in agix remote refstest and agix free pack receivetest.http-client-curl-rustlsbuild../tests/journey.shrun for all default features plushttp-client-curl-rustls, in whichgit-daemonis used in agix remote refstest and agix-free-pack-receivetest.Within the individual steps listed above that run
./tests/journey.sh, nothing is built. But in between them, builds take place, and I suspect that the variation in how long those take is a factor that contributes to whatever strange behavior ofgit-daemonwe are seeing here.This progression also applies in failing runs
The log examined above is from a run that succeeded, so a bit more is needed to make the case that build times are plausibly relevant to failures, in the way that failures have actually occurred.
Consider the failing run from #1634, linked from #1634 (comment). This runs journey tests for different feature sets on in the same order (no changes to
justfilehave been made since somewhat before then), with the test of all default features plushttp-client-curl-rustlscoming last. GitHub Actions logs are eventually deleted, so just in case, here's a gist of that run's log. (It should not be confused with the above gist.)The same progression is happening there: journey tests on builds with some feature sets run, including journey tests that run and use
git-daemon, then other builds run, then the journey test of default features plushttp-client-curl-rustlsrunsgit-daemonand fails to communicate with it.Interleaved builds cover some cache-eligible artifacts
I counted build steps are non-trivial above when they compiled at least one artifact. I did this to make clear that artifacts really do get built between runs of
git-daemonin the journey tests, due to the way the journey tests are run in four sub-recipes defined in thejustfile, each of which starts withcargo buildcommands.But
rust-cacheavoids caching workspace crates (Swatinem/rust-cache#37 (comment), #1668 (comment)). So the kind of cachingrust-cacheprovides is unlikely to have a significant effect on build times unless the artifacts that have to be rebuilt include artifacts for crates that are obtained from the registry rather than being part of this workspace.It turns out they do. Probably most significantly, because it precedes the
./tests/journey.shcommand in which failures occur in runs that have failures, thecargo build --features http-client-curl-rustlscommand builds 29 crates eligible for caching, as well as a few more interleaved withgix-*crates that are ineligible for caching. Of the crates eligible for caching that it builds, it even downloads 6 of them.(I think there may be ways other than timing that this affects networking on the test machine. They would also be very strange and unintuitive. But the problem usually or always happens on GNU/Linux, at least on CI, and does not usually happen on macOS, at least in CI tests of macOS, as mentioned in #1634 (comment). So the cause, whatever it is, is something unintuitive.)
Synthesis (not really a conclusion)
I do very much agree that some bug in
git-daemon, or some problem related to it that it not specific to gitoxide, is probably at the root of this.But the order in which
cargo buildand./tests/journey.shcommands run, as well as their observed effects, shows that some delays betweengit-daemonruns are affected by how much workcargo buildsteps do. Furthermore, acargo buildstep betweengit daemonusing tests that always succeed, and those that have been observed to fail, has to download some cache-eligible crates and to build artifacts for them and a number of other cache-eligible crates, whenrust-cacheis omitted.Both due to those builds and more generally, timing could, as far as I know, still be a significant trigger.
I should be able to test how plausible this is by splitting the tests into one job per feature-set sub-recipe, or by changing the order in which the
ci-journey-testsrecipe lists those recipes.Addendum 1: the
127.0.0.1vs.::1connection (pun intended)Although I didn't know until this PR that the workaround in #1725 would prove insufficient, I had hoped to find a better way, and while looking for one I tried forcing
git-daemonto listen on the IPv6::1rather than the IPv4127.0.0.1. The most relevant of the changed code was this (where$portjust expands to9418):gitoxide/tests/helpers.sh
Lines 77 to 82 in 387a106
In #1634 (comment), I had made this partially inaccurate claim:
I now believe that what was really happening is that whenever the problem connecting to
git-daemonis going to occur, the changing of forcing::1causesncto block forever. Runs that succeeded before appear to have succeeded, still to have succeeded, while the journey tests aftercargo build --features http-client-curl-rustlshave it block forever.This is shown in the logs from the commit with the
::1experiment (not to be confused with any other logs cited above). The "The operation was canceled" message documents automatic cancellation of the job after it blocked for 360 minutes.One way to describe this is that forcing IPv6 causes the
nccommand, which is intended to ensuregixis able to connect togit-daemon, to fail when agixcommand would fail. That is, ifgixwould not be able to usegit-daemon, at least as testedncunder the::1modifications seems to be doing its intended job of blocking--just, the condition it is waiting for never happens. Whether this::1experiment really illuminated anything meaningful about the cause of the problem, I am not sure.Addendum 2:
rust-cachealso affects code generationThis seems even weirder and less plausible, but maybe code in gitoxide crates or their dependencies really is the issue and the problem varies depending on whether intermediate artifacts are created.
The
rust_cacheaction setsCARGO_INCREMENTAL=0. If unsafe code somewhere has undefined behavior and optimizations are turned on, then I think that could cause a bug due to UB to manifest. This seems unlikely for debug builds--such as those exercised in the journey tests. But some crates are configured to useopt-level = 3even in debug builds.(This is also something I should be able to test, by creating a situation where the tests run but the failures do not occur and then setting
CARGO_INCREMENTAL=0to see if it has any effect.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed analysis - stunning work!
As the removal of
rust_cachehas seemingly solved the problem, it very much felt like a timing issue, possibly also related to network use when downloading crates.Even though I definitely like this explanation more, that artefacts differ in meaningful ways when a cache is involved.
It still doesn't make much sense to me though, especially since most failures seem to stem from a trivial connection error, something that I'd trust to be 'correct'.
These days it's easy for me to reproduce the issue, it's basically not working for me anymore at all, most of the time.
Something that changed is that it now errors due to connection reset, not to connection refused.
It's also always the second time the
git-daemonstarts up when it fails.Just for fun I added this patch:
It still failed, so timing shouldn't be an issue. I could also see, while it was waiting, that
git-daemonwas up and running.Inversely, putting the
sleep 10beforelaunch-git-daemonallowed to validate that the previousgit-daemonwasn't running anymore.Both invocations should be completely independent.
To see if it's related to
remote refsorfree pack receive, I replaced thefree pack receiveportion with a copy ofremote refs. And that works.A clear indication that the
git-daemonis crashing or terminating the connection forcefully.When replacing the
remote refspart with a copy of the seemingly failingfree pack receivepart, it seems to work but hangs at this line:After cranking up debug-logging in the daemon I get this:
The last invocation is a fetch request without any
want <object-id>line, but thedoneline (as sent bygix) should indicate the negotiation ended, send me the pack. But because nothing is wanted, it seems to just hang - it's likely a deadlock -gixwould now try to read the response, but theupload-packprogram isn't sending anything, but apparently reads as well as it's also hanging there.When
kill -9is applied to the upload-pack process, the test just passes and it continues to the second round. There it failed - in this case - because the previous git-daemon didn't shutdown.I will investigate this further, as
gixshould probably not submit fetch requests without a single want.Besides that, I don't think this will fix the actual issue, as thus far the first invocation of
git-daemonseems to work much better, still.It remains a very illusive issue.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a recently updated dependency is a factor, and what happens if
--lockedis used on a system where the problem recently began to occur more often.I wonder also if a system (or Homebrew-provided) package is a factor.
curlwas recently updated on many systems, to fix a CVE (though I don't know how those specific changes could affect this).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After my recent trial session I don't even feel it's worth trying those fringes of possible causes as it left me with something very clear to pursue: a hang when a fetch is performed without a single 'want'.
I also learned that
gix remote refsworks twice, but oncegix free pack receiveis involved, it starts failing and resetting the connection. If it is run first, it hangs, and if it runs second, the first connection to the daemon is reset. It also became clear that the daemon is up and operational and that it deliberately resets the connection.This would make sense as
free pack receivechanged the most, it now uses an entirely new and much more faithful protocol implementation, and one that never got tested with the puregit://. It's probably a legitimate bug.Now that I am writing this I also realise that I can turn on packetline tracing on the client so it would be clear what it says to the server before it shuts down.
I hope I get to it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if this works.
The culprit seems to have been that
free pack receive, albeit relying ongix-protocol, could still mess it up by forgetting to properly shut down the connection (by sending and EOF/flush packet), and it was able to request a pack without ever specifying an object to receive, causing a deadlock (upload pack kept reading even though the client indicate it was done). Especially the latter sounds like an attack vector for a server, as it feels a client could cause any amount of connections to be opened and held open, which is much more expensive for the server than it would be for the client.If it was absolutely no effort, I'd definitely feel curious enough to try if it really does work like I think it would… and… I tried it, but via HTTPS I was only able to go into a negotiation loop. However, thinking about it, it's still trivial to hang the server forever by implementing the protocol incorrectly, so I suppose everything server must protect against that. The bug I ran into locally I couldn't reproduce using HTTP though.
Ultimately, it was quite useful to have an inferior implementation of
fetchas implemented bygix free pack receiveas it helped to make the underlyinggix-protocolimplementation more sturdy. It now incorporates more of what thegix::fetchimplementation already checks.