GDC Connected Cluster Provisioner

This solution automates the provisioning and configuration of Google Distributed Cloud connected clusters at scale during pre staging processes as edge zones are turned up.

Overview

The GDC connected cluster provisioner solution is automation which optimizes for:

Declared Intent. Cluster parameters should be specified well ahead of when the cluster can be provisioned. Clusters parameters should be able to be defined months in advanced with cluster creation happening once the Edge Zone is available.
Safety. By design, errors should not result in fleet wide impact. We do this by preferring manual remediation over automated. Once a cluster is created, there are only a few supported update actions and no supported delete operations.
End to end automation: With preconfigured declared intent, we design the automation to run without any human intervention.
Extensibility. This solution is an opinionated deployment pipeline and will not cover 100% of provisioning workflows or GCP environment requirements. Extension of the provisioning logic is expected.

High Level Architecture - Cluster Creation

Zone Watcher: A Cloud Function which polls against the Cluster Intent Data and the available Edge Zones. If there is a declared cluster for a new zone, it will kick off Cloud Build to provision the cluster. Otherwise, it will skip.
Edge Zone: There are 2 GDC APIs which are leveraged to detect the availability of an Edge Zone
- The Zone created as part of an order. This provides the globallyUniqueId or edge zone node location for use during cluster provisioning as well as the state of the zone.
- The available machines in a given GCP project. This is used to determine whether a cluster is already running on a set of machines or not through the hostedNode property. If a cluster is already provisioned, it will not trigger provisioning.
Cluster Intent Data: A CSV file which holds the parameters necessary for cluster creation. Example: example-source-of-truth.csv
Cloud Build Job: This is a bash script which queries the cluster intent database to read the necessary parameter to create a cluster, bootstrap configsync and other fleet services, and validate the completion of the provisioning process.

High Level Architecture - Cluster Modification

Cluster Watcher: A Cloud Function which polls against the Cluster Intent Data and the available clusters. If there are any supported modifications that need to be made, it will kick off the Cloud Build job.
GDC Clusters: The GDC Cluster resource. The Cloud watcher function queries against this api to compare parameters against the cluster intent data while the cloud build job will call the appropriate update commands to modify the cluster.
Cluster Intent Data: A CSV file which holds the parameters necessary for cluster creation. Example: example-source-of-truth.csv
Cloud Build Job: This is a bash script which queries the cluster intent database to read the necessary parameters to modify the cluster.

Supported Modifications

By design, the solution does not support destructive operations across the fleet. Beyond cluster creation, these are the supported actions:

Adding new VLANs
Updating the maintenance window, or removing the maintenance window
Updating the maintenance exclusion window(s), or removing the maintenance exclusion windows.

Other modifications not listed like deleting a VLAN, or reconfiguring ConfigSync should be scripted outside of this solution.

Pre-Requisites

The solution is designed to run within a GCP organization. It is expected that the user will have the following:

A GCP project to host the solution resources.
A GCP project to host the GDC clusters.
A GCP project to host the GDC machines.
A git repository containing the cluster intent data.
A git token to authenticate with the git repository.
Adequate permissions to deploy the GCP resources from terraform.

Required Roles for Terraform Agent

GCP Role Name	Projects
roles/cloudbuild.builds.editor	Main
roles/cloudfunctions.admin	Main
roles/cloudscheduler.admin	Main
roles/iam.serviceAccountAdmin	Main
roles/resourcemanager.projectIamAdmin	All
roles/iam.serviceAccountUser	Main
roles/serviceusage.serviceUsageAdmin	All
roles/storage.admin	Main

Cluster Intent Git PAT Token

When using Github, a personal access token must be created and uploaded to Secret Manager. When using Gitlab, a project access token must be configured and uploaded to Secret Manager. The automated cluster provisioning solution uses these tokens to query against the cluster intent data, which is a CSV file stored in a git repository.

ConfigSync

This project assumes the usage of ConfigSync for handling declared cluster configuration and any necessary workload configuration in the pre-staging environment.

Installation

cd bootstrap

cp terraform.tfvars.example terraform.tfvars
# update the terraform.tfvars as needed

terraform init -backend-config=env/prod.gcs.tfbackend 
terraform plan
terraform apply -var="environment=prod"

This will deploy all the GCP resources for the automated cluster provisioning solution. Use the environment=... terraform variable to separate out multiple instances of the solution. For example, having separated dev vs. prod instances is helpful to validate and ensure any development doesn't disrupt active provisioning.

Usage

Once the solution is deployed, most usage interaction is expected to happen through the cluster intent data csv file. An example can be found here, where each row is one cluster for a given location. The expected sequence would be:

In a GCP project, place an order through the UI or API. This will generate a corresponding Zone resource.
Add a new line into the cluster intent data csv file filling out store_id, machine_project_id, and location as the key to find the appropriate edge zone. Then fill out all the other required parameters in the CSV file.
Wait for the next reconciliation loop... and done! If this is a new cluster, you'll see a new Cloud Build job which contains the provisioning logic.

Cluster Intent

Cluster Intent Data Format

Parameter	Required	Description
store_id	yes	This is the same as the order's zone name. It is used to look up state and the corresponding EdgeContainer Zone.
zone_name	no	(Optional.) In situations where there is no order placed, or when one wants to bypass the gdc hardware management api logic, you can specify the zone_name here which will skip all api calls to the hardware management api.
machine_project_id	yes	The GCP project that hosts the edge zone.
fleet_project_id	yes	The GCP project that will host the cluster.
cluster_name	yes	The name of the cluster.
location	yes	The GCP region. Note that this has to be the same region that the order was placed in. Order region == Edge Zone region == Cluster region.
node_count	yes	The number of nodes in a cluster
cluster_ipv4_cidr	yes	The desired IPv4 CIDR block for Kubernetes pods.
services_ipv4_cidr	yes	The desired IPv4 CIDR block for Kubernetes services.
external_load_balancer_ipv4_address_pools	yes	The desired IPv4 CIDR block for ingress traffic of GDC load balancers.
sync_repo	yes	The git repository used for ConfigSync's RootSync object.
sync_branch	yes	The branch used for ConfigSync's RootSync object.
sync_dir	yes	The path within the repository used for ConfigSync's RootSync object.
git_token_secrets_manager_name	yes	Secrets Manager secret for the git PAT token to deploy into the cluster for ConfigSync to pull github configuration
cluster_version	yes	Initial cluster version to provision the cluster
maintenance_window_start	no	(Optional.) Start time of the MW
maintenance_window_end	no	(Optional.) End time of the MW
maintenance_window_recurrence	no	(Optional.) Frequency of the MW
maintenance_exclusion_name_1	no	(Optional.) Name of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns `maintenance_exclusion_name_2` and `maintenance_exclusion_name_3`
maintenance_exclusion_start_1	no	(Optional.) Start of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns `maintenance_exclusion_start_2` and `maintenance_exclusion_start_3`
maintenance_exclusion_end_1	no	(Optional.) End of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns `maintenance_exclusion_end_2` and `maintenance_exclusion_end_3`
subnet_vlans	no	This is used in the cluster provisioning automation to call the edge network API to create a VLANs for a particular edge-zone
recreate_on_delete	yes	Whether to recreate a cluster with a zone state of `ACTIVE`. This can be used for automated re-provisioning (delete the cluster and it'll automatically re-create).

Cluster Intent Validation

We recommend that cluster intent is validated as part of the PR process for proper format and values. There are a number of validation tools available, and we provide an example validation github action that uses the csv-validator tool. For more information, view the validation model and the validation github action

Operations

Metrics

This table describes the metrics available to monitor cluster provisioning.

Name	Type	Tags	Description
unknown-zones-${environment}	Count	zone	Zones found in the environment, but are not specified as part of cluster intent
ready-stores-${environment}	Count	store_id	Store edge zones ready for provisioning
cluster-creation-success-${environment}	Count	cluster_name	Cluster Creation Success Count
cluster-creation-failure-${environment}	Count	cluster_name	Cluster Creation Failure Count
cluster-modify-success-${environment}	Count	cluster_name	Cluster Modify Success Count
cluster-modify-failure-${environment}	Count	cluster_name	Cluster Modify Failure Count

Alerts

This table describes the alerts created to monitoring cluster provisioning. These alerts are intended to be examples and should be tuned for your environment.

Name	Description
unknown-zone-alert	Alerts whenever an unknown zone not defined in the cluster intent source of truth has been found in the environment.
cluster-creation-failure	Alerts when cluster creation has failed
cluster-modify-failure	Alerts when cluster modification has failed

Automated Retries

Automated retries can be configured to address intermittent build failures. To enable, set the cluster-creation-max-retries variable in the terraform to a value greater than 0 but less than 5. The solution tracks the number of failed builds for a zone and will retry them until the number exceeds the specified max retry.

Note

If you decrease the number of cluster-creation-max-retries, this may impact in-progress builds from properly calling the zone's signal endpoint properly. Be sure to manually check that any failed builds are properly retried. This is not a concern when increasing the value.

Terraform Details

Providers

Name	Version
archive	2.4.2
google	5.26.0
random	3.6.1

Modules

No modules.

Inputs

Name	Description	Type	Default	Required
environment	Deployment environment. Used to build resource names to partition GCP resources if deploying multiple ACP instances into the same project.	`string`	`"stg"`	no
node_location	default GDCE zone used by CloudBuild	`string`	n/a	yes
project_id	The Google Cloud Platform (GCP) project id in which the solution resources will be provisioned	`string`	`"cloud-alchemists-sandbox"`	no
project_id_fleet	Optional id of GCP project hosting the Google Kubernetes Engine (GKE) fleet or Google Distributed Compute Engine (GDCE) machines. Defaults to the value of 'project_id'.	`string`	`null`	no
project_id_secrets	Optional id of GCP project containing the Secret Manager entry storing Git repository credentials. Defaults to the value of 'project_id'.	`string`	`null`	no
project_services	GCP Service APIs (.googleapis.com) to enable for this project	`list(string)`	[ "cloudbuild.googleapis.com", "cloudfunctions.googleapis.com", "cloudscheduler.googleapis.com", "run.googleapis.com", "storage.googleapis.com" ]	no
project_services_fleet	GCP Service APIs (.googleapis.com) to enable for this project	`list(string)`	[ "anthos.googleapis.com", "anthosaudit.googleapis.com", "anthosconfigmanagement.googleapis.com", "anthosgke.googleapis.com", "artifactregistry.googleapis.com", "cloudbuild.googleapis.com", "cloudfunctions.googleapis.com", "cloudresourcemanager.googleapis.com", "cloudscheduler.googleapis.com", "connectgateway.googleapis.com", "container.googleapis.com", "edgecontainer.googleapis.com", "gkeconnect.googleapis.com", "gkehub.googleapis.com", "gkeonprem.googleapis.com", "iam.googleapis.com", "iamcredentials.googleapis.com", "logging.googleapis.com", "monitoring.googleapis.com", "opsconfigmonitoring.googleapis.com", "run.googleapis.com", "secretmanager.googleapis.com", "serviceusage.googleapis.com", "stackdriver.googleapis.com", "storage.googleapis.com", "sts.googleapis.com" ]	no
project_services_secrets	GCP Service APIs (.googleapis.com) to enable for this project	`list(string)`	[ "secretmanager.googleapis.com" ]	no
region	GCP region to deploy resources	`string`	n/a	yes
source_of_truth_repo	Repository containing source of truth cluster intent registry	`string`	n/a	yes
source_of_truth_branch	Repository branch containing source of truth cluster intent registry	`string`	n/a	yes
source_of_truth_path	Path to cluster intent registry file in repository	`string`	n/a	yes
git_secret_id	Git token to authenticate with source of truth	`string`	n/a	yes
cluster_creation_timeout	Cloud Build timeout in seconds for cluster creation. This should account for time to create the cluster, configure core services (ConfigSync, Robin, VMRuntime, etc..), and time for any workload configuration needed before the health checks pass.	`number`	28800	no
cluster_creation_max_retries	The maximum number of retries upon cluster creation failure before marking the zone state as CUSTOMER_FACTORY_TURNUP_CHECKS_FAILED	`number`	0	no
default_config_sync_version	Sets a default ConfigSync version to use for provisioned clusters. If left empty, it will not specify a version at the cluster level. If empty, this will either install the fleet configured version or the latest version of ConfigSync.	`string`	""	no
opt_in_build_messages	Opt in to sending build steps and failure messages to Google. These messages help Google provide support on issues during the provisioning process.	`bool`	false	no

Outputs

No outputs.

Disclaimer

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/workflows		.github/workflows
bootstrap		bootstrap
docs		docs
scripts		scripts
validation		validation
watchers		watchers
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
OWNERS		OWNERS
README.md		README.md
example-source-of-truth.csv		example-source-of-truth.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GDC Connected Cluster Provisioner

Table of Contents

Overview

High Level Architecture - Cluster Creation

High Level Architecture - Cluster Modification

Supported Modifications

Pre-Requisites

Required Roles for Terraform Agent

Cluster Intent Git PAT Token

ConfigSync

Installation

Usage

Cluster Intent

Cluster Intent Data Format

Cluster Intent Validation

Operations

Metrics

Alerts

Automated Retries

Terraform Details

Providers

Modules

Inputs

Outputs

Disclaimer

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

GDC-ConsumerEdge/automated-cluster-provisioner

Folders and files

Latest commit

History

Repository files navigation

GDC Connected Cluster Provisioner

Table of Contents

Overview

High Level Architecture - Cluster Creation

High Level Architecture - Cluster Modification

Supported Modifications

Pre-Requisites

Required Roles for Terraform Agent

Cluster Intent Git PAT Token

ConfigSync

Installation

Usage

Cluster Intent

Cluster Intent Data Format

Cluster Intent Validation

Operations

Metrics

Alerts

Automated Retries

Terraform Details

Providers

Modules

Inputs

Outputs

Disclaimer

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages