This solution automates the provisioning and configuration of Google Distributed Cloud connected clusters at scale during pre staging processes as edge zones are turned up.
- GDC Connected Cluster Provisioner
The GDC connected cluster provisioner solution is automation which optimizes for:
- Declared Intent. Cluster parameters should be specified well ahead of when the cluster can be provisioned. Clusters parameters should be able to be defined months in advanced with cluster creation happening once the Edge Zone is available.
- Safety. By design, errors should not result in fleet wide impact. We do this by preferring manual remediation over automated. Once a cluster is created, there are only a few supported update actions and no supported delete operations.
- End to end automation: With preconfigured declared intent, we design the automation to run without any human intervention.
- Extensibility. This solution is an opinionated deployment pipeline and will not cover 100% of provisioning workflows or GCP environment requirements. Extension of the provisioning logic is expected.
- Zone Watcher: A Cloud Function which polls against the Cluster Intent Data and the available Edge Zones. If there is a declared cluster for a new zone, it will kick off Cloud Build to provision the cluster. Otherwise, it will skip.
- Edge Zone: There are 2 GDC APIs which are leveraged to detect the availability of an Edge Zone
- The Zone created as part of an order. This provides the
globallyUniqueId
or edge zone node location for use during cluster provisioning as well as thestate
of the zone. - The available machines in a given GCP project. This is used to determine whether a cluster is already running on a set of machines or not through the
hostedNode
property. If a cluster is already provisioned, it will not trigger provisioning.
- The Zone created as part of an order. This provides the
- Cluster Intent Data: A CSV file which holds the parameters necessary for cluster creation. Example: example-source-of-truth.csv
- Cloud Build Job: This is a bash script which queries the cluster intent database to read the necessary parameter to create a cluster, bootstrap configsync and other fleet services, and validate the completion of the provisioning process.
- Cluster Watcher: A Cloud Function which polls against the Cluster Intent Data and the available clusters. If there are any supported modifications that need to be made, it will kick off the Cloud Build job.
- GDC Clusters: The GDC Cluster resource. The Cloud watcher function queries against this api to compare parameters against the cluster intent data while the cloud build job will call the appropriate update commands to modify the cluster.
- Cluster Intent Data: A CSV file which holds the parameters necessary for cluster creation. Example: example-source-of-truth.csv
- Cloud Build Job: This is a bash script which queries the cluster intent database to read the necessary parameters to modify the cluster.
By design, the solution does not support destructive operations across the fleet. Beyond cluster creation, these are the supported actions:
- Adding new VLANs
- Updating the maintenance window, or removing the maintenance window
- Updating the maintenance exclusion window(s), or removing the maintenance exclusion windows.
Other modifications not listed like deleting a VLAN, or reconfiguring ConfigSync should be scripted outside of this solution.
The solution is designed to run within a GCP organization. It is expected that the user will have the following:
- A GCP project to host the solution resources.
- A GCP project to host the GDC clusters.
- A GCP project to host the GDC machines.
- A git repository containing the cluster intent data.
- A git token to authenticate with the git repository.
- Adequate permissions to deploy the GCP resources from terraform.
GCP Role Name | Projects |
---|---|
roles/cloudbuild.builds.editor | Main |
roles/cloudfunctions.admin | Main |
roles/cloudscheduler.admin | Main |
roles/iam.serviceAccountAdmin | Main |
roles/resourcemanager.projectIamAdmin | All |
roles/iam.serviceAccountUser | Main |
roles/serviceusage.serviceUsageAdmin | All |
roles/storage.admin | Main |
When using Github, a personal access token must be created and uploaded to Secret Manager. When using Gitlab, a project access token must be configured and uploaded to Secret Manager. The automated cluster provisioning solution uses these tokens to query against the cluster intent data, which is a CSV file stored in a git repository.
This project assumes the usage of ConfigSync for handling declared cluster configuration and any necessary workload configuration in the pre-staging environment.
cd bootstrap
cp terraform.tfvars.example terraform.tfvars
# update the terraform.tfvars as needed
terraform init -backend-config=env/prod.gcs.tfbackend
terraform plan
terraform apply -var="environment=prod"
This will deploy all the GCP resources for the automated cluster provisioning solution. Use the environment=...
terraform variable to separate out multiple instances of the solution. For example, having separated dev vs. prod instances is helpful to validate and ensure any development doesn't disrupt active provisioning.
Once the solution is deployed, most usage interaction is expected to happen through the cluster intent data csv file. An example can be found here, where each row is one cluster for a given location. The expected sequence would be:
- In a GCP project, place an order through the UI or API. This will generate a corresponding Zone resource.
- Add a new line into the cluster intent data csv file filling out
store_id
,machine_project_id
, andlocation
as the key to find the appropriate edge zone. Then fill out all the other required parameters in the CSV file. - Wait for the next reconciliation loop... and done! If this is a new cluster, you'll see a new Cloud Build job which contains the provisioning logic.
Parameter | Required | Description |
---|---|---|
store_id | yes | This is the same as the order's zone name. It is used to look up state and the corresponding EdgeContainer Zone. |
zone_name | no | (Optional.) In situations where there is no order placed, or when one wants to bypass the gdc hardware management api logic, you can specify the zone_name here which will skip all api calls to the hardware management api. |
machine_project_id | yes | The GCP project that hosts the edge zone. |
fleet_project_id | yes | The GCP project that will host the cluster. |
cluster_name | yes | The name of the cluster. |
location | yes | The GCP region. Note that this has to be the same region that the order was placed in. Order region == Edge Zone region == Cluster region. |
node_count | yes | The number of nodes in a cluster |
cluster_ipv4_cidr | yes | The desired IPv4 CIDR block for Kubernetes pods. |
services_ipv4_cidr | yes | The desired IPv4 CIDR block for Kubernetes services. |
external_load_balancer_ipv4_address_pools | yes | The desired IPv4 CIDR block for ingress traffic of GDC load balancers. |
sync_repo | yes | The git repository used for ConfigSync's RootSync object. |
sync_branch | yes | The branch used for ConfigSync's RootSync object. |
sync_dir | yes | The path within the repository used for ConfigSync's RootSync object. |
git_token_secrets_manager_name | yes | Secrets Manager secret for the git PAT token to deploy into the cluster for ConfigSync to pull github configuration |
cluster_version | yes | Initial cluster version to provision the cluster |
maintenance_window_start | no | (Optional.) Start time of the MW |
maintenance_window_end | no | (Optional.) End time of the MW |
maintenance_window_recurrence | no | (Optional.) Frequency of the MW |
maintenance_exclusion_name_1 | no | (Optional.) Name of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns maintenance_exclusion_name_2 and maintenance_exclusion_name_3 |
maintenance_exclusion_start_1 | no | (Optional.) Start of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns maintenance_exclusion_start_2 and maintenance_exclusion_start_3 |
maintenance_exclusion_end_1 | no | (Optional.) End of maintenance exclusion window. Supports up to 3 exclusion windows by specifying additional columns maintenance_exclusion_end_2 and maintenance_exclusion_end_3 |
subnet_vlans | no | This is used in the cluster provisioning automation to call the edge network API to create a VLANs for a particular edge-zone |
recreate_on_delete | yes | Whether to recreate a cluster with a zone state of ACTIVE . This can be used for automated re-provisioning (delete the cluster and it'll automatically re-create). |
We recommend that cluster intent is validated as part of the PR process for proper format and values. There are a number of validation tools available, and we provide an example validation github action that uses the csv-validator tool. For more information, view the validation model and the validation github action
This table describes the metrics available to monitor cluster provisioning.
Name | Type | Tags | Description |
---|---|---|---|
unknown-zones-${environment} | Count | zone | Zones found in the environment, but are not specified as part of cluster intent |
ready-stores-${environment} | Count | store_id | Store edge zones ready for provisioning |
cluster-creation-success-${environment} | Count | cluster_name | Cluster Creation Success Count |
cluster-creation-failure-${environment} | Count | cluster_name | Cluster Creation Failure Count |
cluster-modify-success-${environment} | Count | cluster_name | Cluster Modify Success Count |
cluster-modify-failure-${environment} | Count | cluster_name | Cluster Modify Failure Count |
This table describes the alerts created to monitoring cluster provisioning. These alerts are intended to be examples and should be tuned for your environment.
Name | Description |
---|---|
unknown-zone-alert | Alerts whenever an unknown zone not defined in the cluster intent source of truth has been found in the environment. |
cluster-creation-failure | Alerts when cluster creation has failed |
cluster-modify-failure | Alerts when cluster modification has failed |
Automated retries can be configured to address intermittent build failures. To enable, set the cluster-creation-max-retries
variable in the terraform to a value greater than 0 but less than 5. The solution tracks the number of failed builds for a zone and will retry them until the number exceeds the specified max retry.
Note
If you decrease the number of cluster-creation-max-retries
, this may impact in-progress builds from properly calling the zone's signal endpoint properly. Be sure to manually check that any failed builds are properly retried. This is not a concern when increasing the value.
Name | Version |
---|---|
archive | 2.4.2 |
5.26.0 | |
random | 3.6.1 |
No modules.
Name | Description | Type | Default | Required |
---|---|---|---|---|
environment | Deployment environment. Used to build resource names to partition GCP resources if deploying multiple ACP instances into the same project. | string |
"stg" |
no |
node_location | default GDCE zone used by CloudBuild | string |
n/a | yes |
project_id | The Google Cloud Platform (GCP) project id in which the solution resources will be provisioned | string |
"cloud-alchemists-sandbox" |
no |
project_id_fleet | Optional id of GCP project hosting the Google Kubernetes Engine (GKE) fleet or Google Distributed Compute Engine (GDCE) machines. Defaults to the value of 'project_id'. | string |
null |
no |
project_id_secrets | Optional id of GCP project containing the Secret Manager entry storing Git repository credentials. Defaults to the value of 'project_id'. | string |
null |
no |
project_services | GCP Service APIs (.googleapis.com) to enable for this project | list(string) |
[ |
no |
project_services_fleet | GCP Service APIs (.googleapis.com) to enable for this project | list(string) |
[ |
no |
project_services_secrets | GCP Service APIs (.googleapis.com) to enable for this project | list(string) |
[ |
no |
region | GCP region to deploy resources | string |
n/a | yes |
source_of_truth_repo | Repository containing source of truth cluster intent registry | string |
n/a | yes |
source_of_truth_branch | Repository branch containing source of truth cluster intent registry | string |
n/a | yes |
source_of_truth_path | Path to cluster intent registry file in repository | string |
n/a | yes |
git_secret_id | Git token to authenticate with source of truth | string |
n/a | yes |
cluster_creation_timeout | Cloud Build timeout in seconds for cluster creation. This should account for time to create the cluster, configure core services (ConfigSync, Robin, VMRuntime, etc..), and time for any workload configuration needed before the health checks pass. | number |
28800 | no |
cluster_creation_max_retries | The maximum number of retries upon cluster creation failure before marking the zone state as CUSTOMER_FACTORY_TURNUP_CHECKS_FAILED | number |
0 | no |
default_config_sync_version | Sets a default ConfigSync version to use for provisioned clusters. If left empty, it will not specify a version at the cluster level. If empty, this will either install the fleet configured version or the latest version of ConfigSync. | string |
"" | no |
opt_in_build_messages | Opt in to sending build steps and failure messages to Google. These messages help Google provide support on issues during the provisioning process. | bool |
false | no |
No outputs.
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.