Skip to content

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

@danports

Description

@danports

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.30.0-beta.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.29.6

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes && kops rolling-update cluster --yes

5. What happened after the commands executed?
After the update, dns-controller reports the following in its logs:

W0702 18:21:16.134048       1 dnscontroller.go:134] Unexpected error in DNS controller, will retry in 2m40s: error querying for zones: error querying for DNS zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region

The DNS records for the cluster are never updated after new control plane nodes are brought up during the rolling update and so eventually the rolling update fails:

I0702 18:17:33.236104    2974 instancegroups.go:553] Cluster did not validate within deadline: error listing nodes: Get "https://my.cluster.com/api/v1/nodes": dial tcp x.y.z.a:443: i/o timeout.
E0702 18:17:33.236525    2974 instancegroups.go:512] Cluster did not validate within 30m0s

6. What did you expect to happen?
dns-controller should have updated the DNS records and the rolling update should have completed successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

It's a pretty vanilla AWS cluster, can provide if needed though.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

The rolling update isn't the problem here, it's dns-controller.

9. Anything else we need to know?

Manually editing the dns-controller deployment and adding the AWS_DEFAULT_REGION environment variable is sufficient to get dns-controller to start updating DNS records successfully again.

Slack thread for context: https://kubernetes.slack.com/archives/C3QUFP0QM/p1719945935453279

Relevant code is here:

func newRoute53() (*Interface, error) {

Based on the dns-controller error message, it seems like IMDS is not queried for the region, nor does the cfg.Region == "" check work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions