-
Notifications
You must be signed in to change notification settings - Fork 25
[VC-35738] Use errgroup for all go routines #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
group, gctx := errgroup.WithContext(ctx) | ||
defer func() { | ||
// TODO: replace Fatalf log calls with Errorf and return the error | ||
cancel() | ||
if err := group.Wait(); err != nil { | ||
logs.Log.Fatalf("failed to wait for controller-runtime component to stop: %v", err) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was cut and pasted earlier in the file so that the errgroup can be initialized and used for the management server (/metrics
, /healthz
, /debug/pprof
) and for the VenafiConnection manager; both of which were previously run in unsupervised goroutines.
a6d67f4
to
0cb53e6
Compare
case <-gctx.Done(): | ||
return | ||
case <-time.After(config.Period): | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the errgroup go routines exit (with nil or error) the main context will be cancelled, which will cause this blocking loop to exit immediately instead of waiting for the time period.
// The agent must stop if the management server stops | ||
cancel() | ||
return nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR we should add a signal handler to catch sigterm (ctrl-c) and cancel the parent context, and add another Go routine which calls server.Stop
after the context is cancelled.
As it stands, sigterm will cause the process (all threads) to immediately exit without any graceful shutdown.
0cb53e6
to
54dc42e
Compare
// TODO(wallrj): Pass a context to gatherAndOutputData, so that we don't | ||
// have to wait for it to finish before exiting the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done in another PR.
Instead of using log.Fatal in some of them Signed-off-by: Richard Wall <[email protected]>
54dc42e
to
864edd3
Compare
} | ||
}() | ||
// The agent must stop if the management server stops | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling cancel
, can we return an error when the server is stopped unexpectedly early?
Normally when the context is not canceled, the server should keep running, and when the context is canceled; we do not have to call cancel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when we return an error, the group context will be automatically canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree, that an error should be returned.
Something like fmt.Errorf("the API server stopped unexpectedly")
.
But I'll come back to that in another PR.
This cancel
will work for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
In the a future PR it might be good to also move the main server definition in a different function/file, and just have the server's start and stop calls in the go routines.
return fmt.Errorf("failed to start %q data gatherer %q: %v", kind, dgConfig.Name, err) | ||
} | ||
// The agent must stop if any of the data gatherers stops | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This caused a bug.
Most of the DataGatherer.Run
implementations return immediately.
But one (Dynamic) runs an informer which blocks until the supplied channel is closed.
By calling cancel here, I stop all the errgroup go routines including the VenafiConnection helper which is needed later when the data gets posted to Venafi.
As a result, the POST hangs and the agent gets stuck.
jetstack-secure/pkg/datagatherer/local/local.go
Lines 41 to 44 in 1f00f09
func (g *DataGatherer) Run(stopCh <-chan struct{}) error { // no async functionality, see Fetch return nil } jetstack-secure/pkg/datagatherer/k8s/discovery.go
Lines 50 to 53 in 1f00f09
func (g *DataGathererDiscovery) Run(stopCh <-chan struct{}) error { // no async functionality, see Fetch return nil } jetstack-secure/pkg/datagatherer/k8s/dynamic.go
Lines 263 to 287 in 1f00f09
// Run starts the dynamic data gatherer's informers for resource collection. // Returns error if the data gatherer informer wasn't initialized, Run blocks // until the stopCh is closed. func (g *DataGathererDynamic) Run(stopCh <-chan struct{}) error { if g.informer == nil { return fmt.Errorf("informer was not initialized, impossible to start") } // attach WatchErrorHandler, it needs to be set before starting an informer err := g.informer.SetWatchErrorHandler(func(r *k8scache.Reflector, err error) { if strings.Contains(fmt.Sprintf("%s", err), "the server could not find the requested resource") { logs.Log.Printf("server missing resource for datagatherer of %q ", g.groupVersionResource) } else { logs.Log.Printf("datagatherer informer for %q has failed and is backing off due to error: %s", g.groupVersionResource, err) } }) if err != nil { return fmt.Errorf("failed to SetWatchErrorHandler on informer: %s", err) } // start shared informer g.informer.Run(stopCh) return nil }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the effort to remove the use of the legacy
log
module and replace it withklog
.Instead of creating free goroutines and relying on
log.Fatal
to terminate the process if any of them fail,this PR instead adds those goroutines to the existing
errgroup
and replaces those instances oflog.Fatal
with returned errors which can be bubbled up to the parent Cobra command and handled centrally, in #599.These changes have been extracted from #593 for easier review.
Testing
Occupy the server listening port and observe that the process exits with a meaningful error message.