[VC-35738] Use errgroup for all go routines #601

wallrj · 2024-10-30T15:02:05Z

This is part of the effort to remove the use of the legacy log module and replace it with klog.

Instead of creating free goroutines and relying on log.Fatal to terminate the process if any of them fail,
this PR instead adds those goroutines to the existing errgroup and replaces those instances of log.Fatal with returned errors which can be bubbled up to the parent Cobra command and handled centrally, in #599.

These changes have been extracted from #593 for easier review.

Testing

Occupy the server listening port and observe that the process exits with a meaningful error message.

python3 -m http.server 8081 &

$ POD_NAMESPACE=venafi ./preflight agent --output-path /dev/null --api-token should-not-be-required  --agent-config-file examples/cert-manager-agent.yaml --enable-pprof --period 1s
2024/10/30 15:41:00 Preflight agent version: development ()
2024/10/30 15:41:00 Using the Jetstack Secure API Token auth mode since --api-token was specified.
2024/10/30 15:41:00 Using deprecated Endpoint configuration. User Server instead.
2024/10/30 15:41:00 pprof profiling was enabled.
2024/10/30 15:41:00 error messages will not show in the pod's events because the POD_NAME environment variable is empty
2024/10/30 15:41:00 starting "k8s/secrets.v1" datagatherer
2024/10/30 15:41:00 starting "k8s/certificates.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 starting "k8s/ingresses.v1beta1.networking.k8s.io" datagatherer
2024/10/30 15:41:00 starting "k8s/certificaterequests.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/secrets.v1": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/certificates.v1alpha2.cert-manager.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/ingresses.v1beta1.networking.k8s.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/certificaterequests.v1alpha2.cert-manager.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/secrets.v1" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/certificates.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/ingresses.v1beta1.networking.k8s.io" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/certificaterequests.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 Data saved to local file: /dev/null
2024/10/30 15:41:00 failed to wait for controller-runtime component to stop: failed to run the health check server: listen tcp :8081: bind: address already in use

wallrj · 2024-10-30T15:45:18Z

pkg/agent/run.go

+	group, gctx := errgroup.WithContext(ctx)
+	defer func() {
+		// TODO: replace Fatalf log calls with Errorf and return the error
+		cancel()
+		if err := group.Wait(); err != nil {
+			logs.Log.Fatalf("failed to wait for controller-runtime component to stop: %v", err)
+		}
+	}()


This code was cut and pasted earlier in the file so that the errgroup can be initialized and used for the management server (/metrics, /healthz, /debug/pprof) and for the VenafiConnection manager; both of which were previously run in unsupervised goroutines.

wallrj · 2024-10-30T16:17:20Z

pkg/agent/run.go

+		case <-gctx.Done():
+			return
+		case <-time.After(config.Period):
+		}


If any of the errgroup go routines exit (with nil or error) the main context will be cancelled, which will cause this blocking loop to exit immediately instead of waiting for the time period.

wallrj · 2024-10-30T16:24:33Z

pkg/agent/run.go

+		// The agent must stop if the management server stops
+		cancel()
+		return nil
+	})


In a future PR we should add a signal handler to catch sigterm (ctrl-c) and cancel the parent context, and add another Go routine which calls server.Stop after the context is cancelled.

As it stands, sigterm will cause the process (all threads) to immediately exit without any graceful shutdown.

https://pkg.go.dev/os/signal#hdr-Default_behavior_of_signals_in_Go_programs

wallrj · 2024-10-30T16:31:33Z

pkg/agent/run.go

+	// TODO(wallrj): Pass a context to gatherAndOutputData, so that we don't
+	// have to wait for it to finish before exiting the process.


This can be done in another PR.

Instead of using log.Fatal in some of them Signed-off-by: Richard Wall <[email protected]>

inteon · 2024-10-31T13:28:33Z

pkg/agent/run.go

 		}
-	}()
+		// The agent must stop if the management server stops
+		cancel()


Instead of calling cancel, can we return an error when the server is stopped unexpectedly early?
Normally when the context is not canceled, the server should keep running, and when the context is canceled; we do not have to call cancel.

Also, when we return an error, the group context will be automatically canceled.

I think I agree, that an error should be returned.
Something like fmt.Errorf("the API server stopped unexpectedly").
But I'll come back to that in another PR.
This cancel will work for now.

tfadeyi

lgtm

In the a future PR it might be good to also move the main server definition in a different function/file, and just have the server's start and stop calls in the go routines.

wallrj · 2024-11-06T15:33:22Z

pkg/agent/run.go

 				return fmt.Errorf("failed to start %q data gatherer %q: %v", kind, dgConfig.Name, err)
 			}
+			// The agent must stop if any of the data gatherers stops
+			cancel()


This caused a bug.
Most of the DataGatherer.Run implementations return immediately.
But one (Dynamic) runs an informer which blocks until the supplied channel is closed.
By calling cancel here, I stop all the errgroup go routines including the VenafiConnection helper which is needed later when the data gets posted to Venafi.
As a result, the POST hangs and the agent gets stuck.

jetstack-secure/pkg/datagatherer/local/local.go

Lines 41 to 44 in 1f00f09

func (g *DataGatherer) Run(stopCh <-chan struct{}) error {

// no async functionality, see Fetch

return nil

}

jetstack-secure/pkg/datagatherer/k8s/discovery.go

Lines 50 to 53 in 1f00f09

func (g *DataGathererDiscovery) Run(stopCh <-chan struct{}) error {

// no async functionality, see Fetch

return nil

}

jetstack-secure/pkg/datagatherer/k8s/dynamic.go

Lines 263 to 287 in 1f00f09

// Run starts the dynamic data gatherer's informers for resource collection.

// Returns error if the data gatherer informer wasn't initialized, Run blocks

// until the stopCh is closed.

func (g *DataGathererDynamic) Run(stopCh <-chan struct{}) error {

if g.informer == nil {

return fmt.Errorf("informer was not initialized, impossible to start")

}

// attach WatchErrorHandler, it needs to be set before starting an informer

err := g.informer.SetWatchErrorHandler(func(r *k8scache.Reflector, err error) {

if strings.Contains(fmt.Sprintf("%s", err), "the server could not find the requested resource") {

logs.Log.Printf("server missing resource for datagatherer of %q ", g.groupVersionResource)

} else {

logs.Log.Printf("datagatherer informer for %q has failed and is backing off due to error: %s", g.groupVersionResource, err)

}

})

if err != nil {

return fmt.Errorf("failed to SetWatchErrorHandler on informer: %s", err)

}

// start shared informer

g.informer.Run(stopCh)

return nil

}

Fixed in #605

[VC-35738] Allow DataGatherer.Run to return without stopping the other errgroup Go Routines #605

wallrj commented Oct 30, 2024

View reviewed changes

wallrj force-pushed the VC-35738/use-errgroup branch 2 times, most recently from a6d67f4 to 0cb53e6 Compare October 30, 2024 16:12

wallrj commented Oct 30, 2024

View reviewed changes

wallrj force-pushed the VC-35738/use-errgroup branch from 0cb53e6 to 54dc42e Compare October 30, 2024 16:30

wallrj commented Oct 30, 2024

View reviewed changes

wallrj mentioned this pull request Oct 31, 2024

[VC-35738] Bubble up the errors from sub-commands and handle errors centrally instead of using log.Fatal at multiple sites #599

Merged

Base automatically changed from VC-35738/pprof-on-8081 to VC-35738/feature October 31, 2024 09:31

wallrj mentioned this pull request Oct 31, 2024

[VC-35738] Feature Branch #602

Merged

12 tasks

Use errgroup for all go routines

864edd3

Instead of using log.Fatal in some of them Signed-off-by: Richard Wall <[email protected]>

wallrj force-pushed the VC-35738/use-errgroup branch from 54dc42e to 864edd3 Compare October 31, 2024 09:40

wallrj changed the title ~~WIP: [VC-35738] Use errgroup for all go routines~~ [VC-35738] Use errgroup for all go routines Oct 31, 2024

wallrj requested a review from tfadeyi October 31, 2024 09:42

inteon reviewed Oct 31, 2024

View reviewed changes

tfadeyi approved these changes Oct 31, 2024

View reviewed changes

wallrj merged commit c87f7c3 into VC-35738/feature Nov 1, 2024
2 checks passed

wallrj deleted the VC-35738/use-errgroup branch November 1, 2024 12:06

wallrj mentioned this pull request Nov 1, 2024

[VC-35738] Stop the API server when the context is cancelled #604

Merged

wallrj commented Nov 6, 2024

View reviewed changes

wallrj mentioned this pull request Nov 6, 2024

[VC-35738] Allow DataGatherer.Run to return without stopping the other errgroup Go Routines #605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VC-35738] Use errgroup for all go routines #601

[VC-35738] Use errgroup for all go routines #601

Uh oh!

wallrj commented Oct 30, 2024 •

edited

Loading

Uh oh!

wallrj Oct 30, 2024

Uh oh!

wallrj Oct 30, 2024

Uh oh!

wallrj Oct 30, 2024

Uh oh!

wallrj Oct 30, 2024

Uh oh!

inteon Oct 31, 2024

Uh oh!

inteon Oct 31, 2024

Uh oh!

wallrj Nov 1, 2024

Uh oh!

tfadeyi left a comment

Uh oh!

Uh oh!

wallrj Nov 6, 2024 •

edited

Loading

Uh oh!

wallrj Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// TODO(wallrj): Pass a context to gatherAndOutputData, so that we don't
		// have to wait for it to finish before exiting the process.

	func (g *DataGatherer) Run(stopCh <-chan struct{}) error {
	// no async functionality, see Fetch
	return nil
	}

	func (g *DataGathererDiscovery) Run(stopCh <-chan struct{}) error {
	// no async functionality, see Fetch
	return nil
	}

	// Run starts the dynamic data gatherer's informers for resource collection.
	// Returns error if the data gatherer informer wasn't initialized, Run blocks
	// until the stopCh is closed.
	func (g *DataGathererDynamic) Run(stopCh <-chan struct{}) error {
	if g.informer == nil {
	return fmt.Errorf("informer was not initialized, impossible to start")
	}

	// attach WatchErrorHandler, it needs to be set before starting an informer
	err := g.informer.SetWatchErrorHandler(func(r *k8scache.Reflector, err error) {
	if strings.Contains(fmt.Sprintf("%s", err), "the server could not find the requested resource") {
	logs.Log.Printf("server missing resource for datagatherer of %q ", g.groupVersionResource)
	} else {
	logs.Log.Printf("datagatherer informer for %q has failed and is backing off due to error: %s", g.groupVersionResource, err)
	}
	})
	if err != nil {
	return fmt.Errorf("failed to SetWatchErrorHandler on informer: %s", err)
	}

	// start shared informer
	g.informer.Run(stopCh)

	return nil
	}

[VC-35738] Use errgroup for all go routines #601

[VC-35738] Use errgroup for all go routines #601

Uh oh!

Conversation

wallrj commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfadeyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wallrj Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wallrj commented Oct 30, 2024 •

edited

Loading

wallrj Nov 6, 2024 •

edited

Loading