Skip to content

Conversation

wallrj
Copy link
Member

@wallrj wallrj commented Oct 30, 2024

This is part of the effort to remove the use of the legacy log module and replace it with klog.

Instead of creating free goroutines and relying on log.Fatal to terminate the process if any of them fail,
this PR instead adds those goroutines to the existing errgroup and replaces those instances of log.Fatal with returned errors which can be bubbled up to the parent Cobra command and handled centrally, in #599.

These changes have been extracted from #593 for easier review.

Testing

Occupy the server listening port and observe that the process exits with a meaningful error message.

python3 -m http.server 8081 &

$ POD_NAMESPACE=venafi ./preflight agent --output-path /dev/null --api-token should-not-be-required  --agent-config-file examples/cert-manager-agent.yaml --enable-pprof --period 1s
2024/10/30 15:41:00 Preflight agent version: development ()
2024/10/30 15:41:00 Using the Jetstack Secure API Token auth mode since --api-token was specified.
2024/10/30 15:41:00 Using deprecated Endpoint configuration. User Server instead.
2024/10/30 15:41:00 pprof profiling was enabled.
2024/10/30 15:41:00 error messages will not show in the pod's events because the POD_NAME environment variable is empty
2024/10/30 15:41:00 starting "k8s/secrets.v1" datagatherer
2024/10/30 15:41:00 starting "k8s/certificates.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 starting "k8s/ingresses.v1beta1.networking.k8s.io" datagatherer
2024/10/30 15:41:00 starting "k8s/certificaterequests.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/secrets.v1": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/certificates.v1alpha2.cert-manager.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/ingresses.v1beta1.networking.k8s.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 failed to complete initial sync of "k8s-dynamic" data gatherer "k8s/certificaterequests.v1alpha2.cert-manager.io": timed out waiting for Kubernetes caches to sync
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/secrets.v1" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/certificates.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/ingresses.v1beta1.networking.k8s.io" datagatherer
2024/10/30 15:41:00 successfully gathered 0 items from "k8s/certificaterequests.v1alpha2.cert-manager.io" datagatherer
2024/10/30 15:41:00 Data saved to local file: /dev/null
2024/10/30 15:41:00 failed to wait for controller-runtime component to stop: failed to run the health check server: listen tcp :8081: bind: address already in use

Comment on lines +77 to +84
group, gctx := errgroup.WithContext(ctx)
defer func() {
// TODO: replace Fatalf log calls with Errorf and return the error
cancel()
if err := group.Wait(); err != nil {
logs.Log.Fatalf("failed to wait for controller-runtime component to stop: %v", err)
}
}()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was cut and pasted earlier in the file so that the errgroup can be initialized and used for the management server (/metrics, /healthz, /debug/pprof) and for the VenafiConnection manager; both of which were previously run in unsupervised goroutines.

@wallrj wallrj force-pushed the VC-35738/use-errgroup branch 2 times, most recently from a6d67f4 to 0cb53e6 Compare October 30, 2024 16:12
case <-gctx.Done():
return
case <-time.After(config.Period):
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the errgroup go routines exit (with nil or error) the main context will be cancelled, which will cause this blocking loop to exit immediately instead of waiting for the time period.

// The agent must stop if the management server stops
cancel()
return nil
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR we should add a signal handler to catch sigterm (ctrl-c) and cancel the parent context, and add another Go routine which calls server.Stop after the context is cancelled.

As it stands, sigterm will cause the process (all threads) to immediately exit without any graceful shutdown.

@wallrj wallrj force-pushed the VC-35738/use-errgroup branch from 0cb53e6 to 54dc42e Compare October 30, 2024 16:30
Comment on lines +205 to +206
// TODO(wallrj): Pass a context to gatherAndOutputData, so that we don't
// have to wait for it to finish before exiting the process.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in another PR.

Base automatically changed from VC-35738/pprof-on-8081 to VC-35738/feature October 31, 2024 09:31
@wallrj wallrj mentioned this pull request Oct 31, 2024
12 tasks
Instead of using log.Fatal in some of them

Signed-off-by: Richard Wall <[email protected]>
@wallrj wallrj force-pushed the VC-35738/use-errgroup branch from 54dc42e to 864edd3 Compare October 31, 2024 09:40
@wallrj wallrj changed the title WIP: [VC-35738] Use errgroup for all go routines [VC-35738] Use errgroup for all go routines Oct 31, 2024
@wallrj wallrj requested a review from tfadeyi October 31, 2024 09:42
}
}()
// The agent must stop if the management server stops
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling cancel, can we return an error when the server is stopped unexpectedly early?
Normally when the context is not canceled, the server should keep running, and when the context is canceled; we do not have to call cancel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when we return an error, the group context will be automatically canceled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree, that an error should be returned.
Something like fmt.Errorf("the API server stopped unexpectedly").
But I'll come back to that in another PR.
This cancel will work for now.

Copy link
Contributor

@tfadeyi tfadeyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

In the a future PR it might be good to also move the main server definition in a different function/file, and just have the server's start and stop calls in the go routines.

@wallrj wallrj merged commit c87f7c3 into VC-35738/feature Nov 1, 2024
2 checks passed
@wallrj wallrj deleted the VC-35738/use-errgroup branch November 1, 2024 12:06
return fmt.Errorf("failed to start %q data gatherer %q: %v", kind, dgConfig.Name, err)
}
// The agent must stop if any of the data gatherers stops
cancel()
Copy link
Member Author

@wallrj wallrj Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused a bug.
Most of the DataGatherer.Run implementations return immediately.
But one (Dynamic) runs an informer which blocks until the supplied channel is closed.
By calling cancel here, I stop all the errgroup go routines including the VenafiConnection helper which is needed later when the data gets posted to Venafi.
As a result, the POST hangs and the agent gets stuck.

  • func (g *DataGatherer) Run(stopCh <-chan struct{}) error {
    // no async functionality, see Fetch
    return nil
    }
  • func (g *DataGathererDiscovery) Run(stopCh <-chan struct{}) error {
    // no async functionality, see Fetch
    return nil
    }
  • // Run starts the dynamic data gatherer's informers for resource collection.
    // Returns error if the data gatherer informer wasn't initialized, Run blocks
    // until the stopCh is closed.
    func (g *DataGathererDynamic) Run(stopCh <-chan struct{}) error {
    if g.informer == nil {
    return fmt.Errorf("informer was not initialized, impossible to start")
    }
    // attach WatchErrorHandler, it needs to be set before starting an informer
    err := g.informer.SetWatchErrorHandler(func(r *k8scache.Reflector, err error) {
    if strings.Contains(fmt.Sprintf("%s", err), "the server could not find the requested resource") {
    logs.Log.Printf("server missing resource for datagatherer of %q ", g.groupVersionResource)
    } else {
    logs.Log.Printf("datagatherer informer for %q has failed and is backing off due to error: %s", g.groupVersionResource, err)
    }
    })
    if err != nil {
    return fmt.Errorf("failed to SetWatchErrorHandler on informer: %s", err)
    }
    // start shared informer
    g.informer.Run(stopCh)
    return nil
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants