-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
1.30.0-beta.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.29.6
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes && kops rolling-update cluster --yes
5. What happened after the commands executed?
After the update, dns-controller reports the following in its logs:
W0702 18:21:16.134048 1 dnscontroller.go:134] Unexpected error in DNS controller, will retry in 2m40s: error querying for zones: error querying for DNS zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region
The DNS records for the cluster are never updated after new control plane nodes are brought up during the rolling update and so eventually the rolling update fails:
I0702 18:17:33.236104 2974 instancegroups.go:553] Cluster did not validate within deadline: error listing nodes: Get "https://my.cluster.com/api/v1/nodes": dial tcp x.y.z.a:443: i/o timeout.
E0702 18:17:33.236525 2974 instancegroups.go:512] Cluster did not validate within 30m0s
6. What did you expect to happen?
dns-controller should have updated the DNS records and the rolling update should have completed successfully.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
It's a pretty vanilla AWS cluster, can provide if needed though.
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
The rolling update isn't the problem here, it's dns-controller.
9. Anything else we need to know?
Manually editing the dns-controller deployment and adding the AWS_DEFAULT_REGION environment variable is sufficient to get dns-controller to start updating DNS records successfully again.
Slack thread for context: https://kubernetes.slack.com/archives/C3QUFP0QM/p1719945935453279
Relevant code is here:
| func newRoute53() (*Interface, error) { |
Based on the dns-controller error message, it seems like IMDS is not queried for the region, nor does the
cfg.Region == "" check work.