dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1

/kind bug

**1. What `kops` version are you running? The command `kops version`, will display
 this information.**
1.30.0-beta.1

**2. What Kubernetes version are you running? `kubectl version` will print the
 version if a cluster is running or provide the Kubernetes version specified as
 a `kops` flag.**
1.29.6

**3. What cloud provider are you using?**
AWS

**4. What commands did you run?  What is the simplest way to reproduce this issue?**
`kops update cluster --yes && kops rolling-update cluster --yes`

**5. What happened after the commands executed?**
After the update, dns-controller reports the following in its logs:
```
W0702 18:21:16.134048       1 dnscontroller.go:134] Unexpected error in DNS controller, will retry in 2m40s: error querying for zones: error querying for DNS zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region
```

The DNS records for the cluster are never updated after new control plane nodes are brought up during the rolling update and so eventually the rolling update fails:
```
I0702 18:17:33.236104    2974 instancegroups.go:553] Cluster did not validate within deadline: error listing nodes: Get "https://my.cluster.com/api/v1/nodes": dial tcp x.y.z.a:443: i/o timeout.
E0702 18:17:33.236525    2974 instancegroups.go:512] Cluster did not validate within 30m0s
```

**6. What did you expect to happen?**
dns-controller should have updated the DNS records and the rolling update should have completed successfully.

**7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -o yaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.**

It's a pretty vanilla AWS cluster, can provide if needed though.

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**

The rolling update isn't the problem here, it's dns-controller.

**9. Anything else we need to know?**

Manually editing the dns-controller deployment and adding the `AWS_DEFAULT_REGION` environment variable is sufficient to get dns-controller to start updating DNS records successfully again.

Slack thread for context: https://kubernetes.slack.com/archives/C3QUFP0QM/p1719945935453279

Relevant code is here: https://github.com/kubernetes/kops/blob/c40a9d268986428d884189cb59ddeed736a1ed5c/dnsprovider/pkg/dnsprovider/providers/aws/route53/route53.go#L47
Based on the dns-controller error message, it seems like IMDS is not queried for the region, nor does the `cfg.Region == ""` check work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions