diff --git a/jobs/auctioneer/templates/indicators.yml.erb b/jobs/auctioneer/templates/indicators.yml.erb index 73496783ba..a97899ca04 100644 --- a/jobs/auctioneer/templates/indicators.yml.erb +++ b/jobs/auctioneer/templates/indicators.yml.erb @@ -1,109 +1,113 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> - -indicators: -- name: auctioneer_lrp_auctions_failed - promql: rate(AuctioneerLRPAuctionsFailed{source_id="auctioneer"}[5m]) * 60 - documentation: - title: Auctioneer - App Instance (AI) Placement Failures - description: | - The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. - - Use: This metric can indicate that Diego is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. - - This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. - - This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state. - - Origin: Firehose - Type: Counter (Integer) - Frequency: During each auction - recommended_response: | - 1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API. - 2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem. - 3. Consider scaling additional cells. - -- name: auctioneer_states_duration - promql: max_over_time(AuctioneerFetchStatesDuration{source_id="auctioneer"}[5m]) / 1000000000 - documentation: - title: Auctioneer - Time to Fetch Cell State - description: | - Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction. - - Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing. - - Origin: Firehose - Type: Gauge, integer in ns - Frequency: During event, during each auction - recommended_response: | - 1. Check the health of the cells by reviewing the logs and looking for errors. - 2. Review IaaS console metrics. - 3. Inspect the Auctioneer logs to determine if one or more cells is taking significantly longer to fetch state than other cells. Relevant log lines will have wording like `fetched cell state`. - -- name: auctioneer_lrp_auctions_started - promql: rate(AuctioneerLRPAuctionsStarted{source_id="auctioneer"}[5m]) * 60 - documentation: - title: Auctioneer - App Instance Starts - description: | - The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. - - Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The provided measurement can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window. - - This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. - - Origin: Firehose - Type: Counter (Integer) - Frequency: During event, during each auction - recommended_response: | - When observing a significant amount of container churn, do the following: - - 1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity. - 2. If container churn appears to continue over an extended period, inspect Diego Auctioneer and BBS logs. - - When observing extended periods of high or low activity trends, scale up or down CF components as needed. - -- name: auctioneer_task_auctions_failed - promql: rate(AuctioneerTaskAuctionsFailed{source_id="auctioneer"}[5m]) * 60 - documentation: - title: Auctioneer - Task Placement Failures - description: | - The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. - - Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. - - This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled. - - This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state. - - Origin: Firehose - Type: Counter (Float) - Frequency: During event, during each auction - recommended_response: | - In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API. - - 1. Investigate the health of Diego cells. - 2. Consider scaling additional cells. - -- name: auctioneer_lock_held - promql: max_over_time(LockHeld{source_id="auctioneer"}[5m]) - documentation: - title: Auctioneer - Lock Held - description: | - Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost. - - Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer. - - Origin: Firehose - Type: Gauge - Frequency: Periodically - recommended_response: | - 1. Run monit status on the instance group that the Auctioneer job is running on to check for failing processes. - 2. If there are no failing processes, then review the logs for Auctioneer. - - Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push. + labels: + deployment: <%= spec.deployment %> + component: auctioneer + +spec: + product: + name: diego + version: latest + + indicators: + - name: auctioneer_lrp_auctions_failed + promql: rate(AuctioneerLRPAuctionsFailed{source_id="auctioneer"}[5m]) * 60 + documentation: + title: Auctioneer - App Instance (AI) Placement Failures + description: | + The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. + + Use: This metric can indicate that Diego is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. + + This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. + + This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state. + + Origin: Firehose + Type: Counter (Integer) + Frequency: During each auction + recommended_response: | + 1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API. + 2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem. + 3. Consider scaling additional cells. + + - name: auctioneer_states_duration + promql: max_over_time(AuctioneerFetchStatesDuration{source_id="auctioneer"}[5m]) / 1000000000 + documentation: + title: Auctioneer - Time to Fetch Cell State + description: | + Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction. + + Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing. + + Origin: Firehose + Type: Gauge, integer in ns + Frequency: During event, during each auction + recommended_response: | + 1. Check the health of the cells by reviewing the logs and looking for errors. + 2. Review IaaS console metrics. + 3. Inspect the Auctioneer logs to determine if one or more cells is taking significantly longer to fetch state than other cells. Relevant log lines will have wording like `fetched cell state`. + + - name: auctioneer_lrp_auctions_started + promql: rate(AuctioneerLRPAuctionsStarted{source_id="auctioneer"}[5m]) * 60 + documentation: + title: Auctioneer - App Instance Starts + description: | + The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. + + Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The provided measurement can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window. + + This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled. + + Origin: Firehose + Type: Counter (Integer) + Frequency: During event, during each auction + recommended_response: | + When observing a significant amount of container churn, do the following: + + 1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity. + 2. If container churn appears to continue over an extended period, inspect Diego Auctioneer and BBS logs. + + When observing extended periods of high or low activity trends, scale up or down CF components as needed. + + - name: auctioneer_task_auctions_failed + promql: rate(AuctioneerTaskAuctionsFailed{source_id="auctioneer"}[5m]) * 60 + documentation: + title: Auctioneer - Task Placement Failures + description: | + The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job. + + Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work. + + This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled. + + This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state. + + Origin: Firehose + Type: Counter (Float) + Frequency: During event, during each auction + recommended_response: | + In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API. + + 1. Investigate the health of Diego cells. + 2. Consider scaling additional cells. + + - name: auctioneer_lock_held + promql: max_over_time(LockHeld{source_id="auctioneer"}[5m]) + documentation: + title: Auctioneer - Lock Held + description: | + Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost. + + Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer. + + Origin: Firehose + Type: Gauge + Frequency: Periodically + recommended_response: | + 1. Run monit status on the instance group that the Auctioneer job is running on to check for failing processes. + 2. If there are no failing processes, then review the logs for Auctioneer. + - Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push. diff --git a/jobs/bbs/templates/indicators.yml.erb b/jobs/bbs/templates/indicators.yml.erb index 51c9af5755..6bb70930a8 100644 --- a/jobs/bbs/templates/indicators.yml.erb +++ b/jobs/bbs/templates/indicators.yml.erb @@ -1,148 +1,152 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> - -indicators: -- name: convergence_lrp_duration - promql: max_over_time(ConvergenceLRPDuration{source_id="bbs"}[15m]) / 1000000000 - documentation: - title: BBS - Time to Run LRP Convergence - description: | - Time in ns that the BBS took to run its LRP convergence pass. - - Use: If the convergence run begins taking too long, apps or Tasks may be crashing without restarting. This symptom can also indicate loss of connectivity to the BBS database. - - Origin: Firehose - Type: Gauge (Integer in ns) - Frequency: 30 s - recommended_response: | - 1. Check BBS logs for errors. - 2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory depending on its system.cpu/system.memory metrics. - 3. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high. - -- name: request_latency - promql: avg_over_time(RequestLatency{source_id="bbs"}[15m]) / 1000000000 - documentation: - title: BBS - Time to Handle Requests - description: | - The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints. - - Diego is now aggregating this metric to emit the max value observed over 60 seconds. - - Use: If this metric rises, the BBS API is slowing. Response to certain operations is slow if request latency is high. - - Origin: Firehose - Type: Gauge (Integer in ns) - Frequency: 60 s - recommended_response: | - 1. Check CPU and memory statistics. - 2. Check BBS logs for faults and errors that can indicate issues with BBS. - 3. Try scaling the BBS VM resources up. For example, add more CPUs/memory depending on its system.cpu/system.memory metrics. - 4. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high. - -- name: lrps_extra - promql: avg_over_time(LRPsExtra{source_id="bbs"}[5m]) - documentation: - title: BBS - More App Instances Than Expected - description: | - Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record. - - Use: If Diego has more LRPs running than expected, there may be problems with the BBS. - - Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated. - - Origin: Firehose - Type: Gauge (Float) - Frequency: 30 s - recommended_response: | - 1. Review the BBS logs for proper operation or errors, looking for detailed error messages. - 2. Check the Domain freshness. - -- name: lrps_missing - promql: avg_over_time(LRPsMissing{source_id="bbs"}[5m]) - documentation: - title: BBS - Fewer App Instances Than Expected - description: | - Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record. - - Use: If Diego has less LRP running than expected, there may be problems with the BBS. - - An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated. - - Origin: Firehose - Type: Gauge (Float) - Frequency: 30 s - recommended_response: | - 1. Review the BBS logs for proper operation or errors, looking for detailed error messages. - 2. Check the Domain freshness. - -- name: crashed_actual_lrps - promql: avg_over_time(CrashedActualLRPs{source_id="bbs"}[5m]) - documentation: - title: BBS - Crashed App Instances - description: | - Total number of LRP instances that have crashed. - - Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes above the trend line. Tune alert values to your deployment. - - Origin: Firehose - Type: Gauge (Float) - Frequency: 30 s - recommended_response: | - 1. Look at the BBS logs for apps that are crashing and at the cell logs to see if the problem is with the apps themselves, rather than a platform issue. - -- name: lrps_running - promql: avg_over_time(LRPsRunning{source_id="bbs"}[1h]) - avg_over_time(LRPsRunning{source_id="bbs"}[1h] offset 1h) - documentation: - title: BBS - Running App Instances, Rate of Change - description: | - Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego cells. - - Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You may want to alert on delta values outside of the expected range. - - Origin: Firehose - Type: Gauge (Float) - Frequency: During event, emission should be constant on a running deployment. - recommended_response: | - 1. Scale components as necessary. - -- name: bbs_lock_held - promql: max_over_time(LockHeld{source_id="bbs"}[5m]) - documentation: - title: BBS - Lock Held - description: | - Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost. - - Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric may occasionally be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS. - - Origin: Firehose - Type: Gauge - Frequency: Periodically - recommended_response: | - 1. Run monit status on the instance group that the BBS job is running on to check for failing processes. - 2. If there are no failing processes, then review the logs for BBS. - - A healthy BBS shows obvious activity around starting or claiming LRPs. - - An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer. - -- name: domain_cf_apps - promql: max_over_time(Domain_cf_apps{source_id="bbs"}[5m]) - documentation: - title: BBS - Cloud Controller and Diego in Sync - description: | - Indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution. - - 1 means cf-apps Domain is up-to-date - - No data received means cf-apps Domain is not up-to-date - - Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running could vary from those desired. - - Origin: Firehose - Type: Gauge (Float) - Frequency: 30 s - recommended_response: | - 1. Check the BBS and Clock Global (Cloud Controller clock) logs. + labels: + deployment: <%= spec.deployment %> + component: bbs + +spec: + product: + name: diego + version: latest + + indicators: + - name: convergence_lrp_duration + promql: max_over_time(ConvergenceLRPDuration{source_id="bbs"}[15m]) / 1000000000 + documentation: + title: BBS - Time to Run LRP Convergence + description: | + Time in ns that the BBS took to run its LRP convergence pass. + + Use: If the convergence run begins taking too long, apps or Tasks may be crashing without restarting. This symptom can also indicate loss of connectivity to the BBS database. + + Origin: Firehose + Type: Gauge (Integer in ns) + Frequency: 30 s + recommended_response: | + 1. Check BBS logs for errors. + 2. Try vertically scaling the BBS VM resources up. For example, add more CPUs or memory depending on its system.cpu/system.memory metrics. + 3. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high. + + - name: request_latency + promql: avg_over_time(RequestLatency{source_id="bbs"}[15m]) / 1000000000 + documentation: + title: BBS - Time to Handle Requests + description: | + The maximum observed latency time over the past 60 seconds that the BBS took to handle requests across all its API endpoints. + + Diego is now aggregating this metric to emit the max value observed over 60 seconds. + + Use: If this metric rises, the BBS API is slowing. Response to certain operations is slow if request latency is high. + + Origin: Firehose + Type: Gauge (Integer in ns) + Frequency: 60 s + recommended_response: | + 1. Check CPU and memory statistics. + 2. Check BBS logs for faults and errors that can indicate issues with BBS. + 3. Try scaling the BBS VM resources up. For example, add more CPUs/memory depending on its system.cpu/system.memory metrics. + 4. Consider vertically scaling the backing database, if system.cpu and system.memory metrics for the database instances are high. + + - name: lrps_extra + promql: avg_over_time(LRPsExtra{source_id="bbs"}[5m]) + documentation: + title: BBS - More App Instances Than Expected + description: | + Total number of LRP instances that are no longer desired but still have a BBS record. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsExtra is the total number of LRP instances that are no longer desired but still have a BBS record. + + Use: If Diego has more LRPs running than expected, there may be problems with the BBS. + + Deleting an app with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsExtra is unusual and should be investigated. + + Origin: Firehose + Type: Gauge (Float) + Frequency: 30 s + recommended_response: | + 1. Review the BBS logs for proper operation or errors, looking for detailed error messages. + 2. Check the Domain freshness. + + - name: lrps_missing + promql: avg_over_time(LRPsMissing{source_id="bbs"}[5m]) + documentation: + title: BBS - Fewer App Instances Than Expected + description: | + Total number of LRP instances that are desired but have no record in the BBS. When Diego wants to add more apps, the BBS sends a request to the auctioneer to spin up additional LRPs. LRPsMissing is the total number of LRP instances that are desired but have no BBS record. + + Use: If Diego has less LRP running than expected, there may be problems with the BBS. + + An app push with many instances can temporarily spike this metric. However, a sustained spike in bbs.LRPsMissing is unusual and should be investigated. + + Origin: Firehose + Type: Gauge (Float) + Frequency: 30 s + recommended_response: | + 1. Review the BBS logs for proper operation or errors, looking for detailed error messages. + 2. Check the Domain freshness. + + - name: crashed_actual_lrps + promql: avg_over_time(CrashedActualLRPs{source_id="bbs"}[5m]) + documentation: + title: BBS - Crashed App Instances + description: | + Total number of LRP instances that have crashed. + + Use: Indicates how many instances in the deployment are in a crashed state. An increase in bbs.CrashedActualLRPs can indicate several problems, from a bad app with many instances associated, to a platform issue that is resulting in app crashes. Use this metric to help create a baseline for your deployment. After you have a baseline, you can create a deployment-specific alert to notify of a spike in crashes above the trend line. Tune alert values to your deployment. + + Origin: Firehose + Type: Gauge (Float) + Frequency: 30 s + recommended_response: | + 1. Look at the BBS logs for apps that are crashing and at the cell logs to see if the problem is with the apps themselves, rather than a platform issue. + + - name: lrps_running + promql: avg_over_time(LRPsRunning{source_id="bbs"}[1h]) - avg_over_time(LRPsRunning{source_id="bbs"}[1h] offset 1h) + documentation: + title: BBS - Running App Instances, Rate of Change + description: | + Rate of change in app instances being started or stopped on the platform. It is derived from bbs.LRPsRunning and represents the total number of LRP instances that are running on Diego cells. + + Use: Delta reflects upward or downward trend for app instances started or stopped. Helps to provide a picture of the overall growth trend of the environment for capacity planning. You may want to alert on delta values outside of the expected range. + + Origin: Firehose + Type: Gauge (Float) + Frequency: During event, emission should be constant on a running deployment. + recommended_response: | + 1. Scale components as necessary. + + - name: bbs_lock_held + promql: max_over_time(LockHeld{source_id="bbs"}[5m]) + documentation: + title: BBS - Lock Held + description: | + Whether a BBS instance holds the expected BBS lock (in Locket). 1 means the active BBS server holds the lock, and 0 means the lock was lost. + + Use: This metric is complimentary to Active Locks, and it offers a BBS-level version of the Locket metrics. Although it is emitted per BBS instance, only 1 active lock is held by BBS. Therefore, the expected value is 1. The metric may occasionally be 0 when the BBS instances are performing a leader transition, but a prolonged value of 0 indicates an issue with BBS. + + Origin: Firehose + Type: Gauge + Frequency: Periodically + recommended_response: | + 1. Run monit status on the instance group that the BBS job is running on to check for failing processes. + 2. If there are no failing processes, then review the logs for BBS. + - A healthy BBS shows obvious activity around starting or claiming LRPs. + - An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer. + + - name: domain_cf_apps + promql: max_over_time(Domain_cf_apps{source_id="bbs"}[5m]) + documentation: + title: BBS - Cloud Controller and Diego in Sync + description: | + Indicates if the cf-apps Domain is up-to-date, meaning that CF App requests from Cloud Controller are synchronized to bbs.LRPsDesired (Diego-desired AIs) for execution. + - 1 means cf-apps Domain is up-to-date + - No data received means cf-apps Domain is not up-to-date + + Use: If the cf-apps Domain does not stay up-to-date, changes requested in the Cloud Controller are not guaranteed to propagate throughout the system. If the Cloud Controller and Diego are out of sync, then apps running could vary from those desired. + + Origin: Firehose + Type: Gauge (Float) + Frequency: 30 s + recommended_response: | + 1. Check the BBS and Clock Global (Cloud Controller clock) logs. diff --git a/jobs/locket/templates/indicators.yml.erb b/jobs/locket/templates/indicators.yml.erb index 8e41862fa1..889f94c906 100644 --- a/jobs/locket/templates/indicators.yml.erb +++ b/jobs/locket/templates/indicators.yml.erb @@ -1,54 +1,58 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> - -indicators: -- name: locket_active_locks - promql: max_over_time(ActiveLocks{source_id="locket"}[5m]) - documentation: - title: Locket - Active Locks - description: | - Total count of how many locks the system components are holding. - - Use: If the ActiveLocks count is not equal to the expected value, there is likely a problem with Diego. - - Origin: Firehose - Type: Gauge - Frequency: 60s - recommended_response: | - 1. Run monit status to inspect for failing processes. - 2. If there are no failing processes, then review the logs for the components using the Locket service: BBS, Auctioneer, TPS Watcher, Routing API, and Clock Global (Cloud Controller clock). Look for indications that only one of each component is active at a time. - 3. Focus triage on the BBS first: - A healthy BBS shows obvious activity around starting or claiming LRPs. - An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer. - Reference the BBS-level Locket metric Locks Held by BBS. A value of 0 indicates Locket issues at the BBS level. - 4. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads. - Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push. - Reference the Auctioneer-level Locket metric Locks Held by Auctioneer. A value of 0 indicates Locket issues at the Auctioneer level. - 5. The TPS Watcher is primarily active when app instances crash. Therefore, if the TPS Watcher is suspected, review the most recent logs. - -- name: locket_active_presences - promql: max_over_time(ActivePresences{source_id="locket"}[15m]) - documentation: - title: Locket - Active Presences - description: | - Total count of active presences. Presences are defined as the registration records that the cells maintain to advertise themselves to the platform. - - Use: If the Active Presences count is far from the expected, there might be a problem with Diego. - - The number of active presences varies according to the number of cells deployed. Therefore, during purposeful scale adjustments, this alerting threshold should be adjusted. - Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when cells are evacuated and restarted. Tolerable variance is within the bounds of the BOSH max inflight range for the instance group. - - Origin: Firehose - Type: Gauge - Frequency: 60s - recommended_response: | - 1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of cells in the current deployment. - 2. Run monit status to inspect for failing processes. - 3. If there are no failing processes, then review the logs for the components using the Locket service itself on Diego BBS instances. + labels: + deployment: <%= spec.deployment %> + component: locket + +spec: + product: + name: diego + version: latest + + indicators: + - name: locket_active_locks + promql: max_over_time(ActiveLocks{source_id="locket"}[5m]) + documentation: + title: Locket - Active Locks + description: | + Total count of how many locks the system components are holding. + + Use: If the ActiveLocks count is not equal to the expected value, there is likely a problem with Diego. + + Origin: Firehose + Type: Gauge + Frequency: 60s + recommended_response: | + 1. Run monit status to inspect for failing processes. + 2. If there are no failing processes, then review the logs for the components using the Locket service: BBS, Auctioneer, TPS Watcher, Routing API, and Clock Global (Cloud Controller clock). Look for indications that only one of each component is active at a time. + 3. Focus triage on the BBS first: + A healthy BBS shows obvious activity around starting or claiming LRPs. + An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer. + Reference the BBS-level Locket metric Locks Held by BBS. A value of 0 indicates Locket issues at the BBS level. + 4. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads. + Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push. + Reference the Auctioneer-level Locket metric Locks Held by Auctioneer. A value of 0 indicates Locket issues at the Auctioneer level. + 5. The TPS Watcher is primarily active when app instances crash. Therefore, if the TPS Watcher is suspected, review the most recent logs. + + - name: locket_active_presences + promql: max_over_time(ActivePresences{source_id="locket"}[15m]) + documentation: + title: Locket - Active Presences + description: | + Total count of active presences. Presences are defined as the registration records that the cells maintain to advertise themselves to the platform. + + Use: If the Active Presences count is far from the expected, there might be a problem with Diego. + + The number of active presences varies according to the number of cells deployed. Therefore, during purposeful scale adjustments, this alerting threshold should be adjusted. + Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when cells are evacuated and restarted. Tolerable variance is within the bounds of the BOSH max inflight range for the instance group. + + Origin: Firehose + Type: Gauge + Frequency: 60s + recommended_response: | + 1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of cells in the current deployment. + 2. Run monit status to inspect for failing processes. + 3. If there are no failing processes, then review the logs for the components using the Locket service itself on Diego BBS instances. diff --git a/jobs/rep/templates/indicators.yml.erb b/jobs/rep/templates/indicators.yml.erb index 66405a9730..8b49c05352 100644 --- a/jobs/rep/templates/indicators.yml.erb +++ b/jobs/rep/templates/indicators.yml.erb @@ -1,84 +1,88 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> - -indicators: -- name: capacity_remaining_memory - promql: min_over_time(CapacityRemainingMemory{source_id="rep"}[5m]) / 1024 - documentation: - title: Diego Cell - Remaining Memory Available - Overall Remaining Memory Available - description: | - Remaining amount of memory in MiB available for this Diego cell to allocate to containers. - - Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning. - - Origin: Firehose - Type: Gauge (Integer in MiB) - Frequency: 60 s - recommended_response: | - 1. Assign more resources to the cells - 2. Assign more cells. - -- name: capacity_remaining_disk - promql: min_over_time(CapacityRemainingDisk{source_id="rep"}[5m]) / 1024 - documentation: - title: Diego Cell - Remaining Disk Available - Overall Remaining Disk Available - description: | - Remaining amount of disk in MiB available for this Diego cell to allocate to containers. - - Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes. - - It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory. - - Origin: Firehose - Type: Gauge (Integer in MiB) - Frequency: 60 s - recommended_response: | - 1. Assign more resources to the cells. - 2. Assign more cells. - -- name: garden_health_check_failed - promql: max_over_time(GardenHealthCheckFailed{source_id="rep"}[5m]) - documentation: - title: Diego Cell - Garden Healthcheck Failed - description: | - The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy. - - Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need. - - Suggested alert threshold based on multiple unhealthy cells in the given time window. - - Although end-user impact is usually low if only one cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation could result in negative end-user impact if left unresolved. - - Origin: Firehose - Type: Gauge (Float, 0-1) - Frequency: 30 s - recommended_response: | - 1. Investigate Diego cell servers for faults and errors. - 2. If a particular cell or cells appear problematic: - a. Determine a time interval during which the metrics from the cell changed from healthy to unhealthy. - b. Pull the logs that the cell generated over that interval. The Cell ID is the same as the BOSH instance ID. - c. Pull the BBS logs over that same time interval. - 3. As a last resort, it sometimes helps to recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax. - -- name: rep_bulk_sync_duration - promql: max_over_time(RepBulkSyncDuration{source_id="rep"}[15m]) / 1000000000 - documentation: - title: Diego Cell - Time to Sync - description: | - Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers. - - Use: Sync times that are too high can indicate issues with the BBS. - - Origin: Firehose - Type: Gauge (Float in ns) - Frequency: 30 s - recommended_response: | - 1. Investigate BBS logs for faults and errors. - 2. If a particular cell or cells appear problematic, investigate logs for the cells. + labels: + deployment: <%= spec.deployment %> + component: rep + +spec: + product: + name: diego + version: latest + + indicators: + - name: capacity_remaining_memory + promql: min_over_time(CapacityRemainingMemory{source_id="rep"}[5m]) / 1024 + documentation: + title: Diego Cell - Remaining Memory Available - Overall Remaining Memory Available + description: | + Remaining amount of memory in MiB available for this Diego cell to allocate to containers. + + Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning. + + Origin: Firehose + Type: Gauge (Integer in MiB) + Frequency: 60 s + recommended_response: | + 1. Assign more resources to the cells + 2. Assign more cells. + + - name: capacity_remaining_disk + promql: min_over_time(CapacityRemainingDisk{source_id="rep"}[5m]) / 1024 + documentation: + title: Diego Cell - Remaining Disk Available - Overall Remaining Disk Available + description: | + Remaining amount of disk in MiB available for this Diego cell to allocate to containers. + + Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes. + + It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory. + + Origin: Firehose + Type: Gauge (Integer in MiB) + Frequency: 60 s + recommended_response: | + 1. Assign more resources to the cells. + 2. Assign more cells. + + - name: garden_health_check_failed + promql: max_over_time(GardenHealthCheckFailed{source_id="rep"}[5m]) + documentation: + title: Diego Cell - Garden Healthcheck Failed + description: | + The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy. + + Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need. + + Suggested alert threshold based on multiple unhealthy cells in the given time window. + + Although end-user impact is usually low if only one cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation could result in negative end-user impact if left unresolved. + + Origin: Firehose + Type: Gauge (Float, 0-1) + Frequency: 30 s + recommended_response: | + 1. Investigate Diego cell servers for faults and errors. + 2. If a particular cell or cells appear problematic: + a. Determine a time interval during which the metrics from the cell changed from healthy to unhealthy. + b. Pull the logs that the cell generated over that interval. The Cell ID is the same as the BOSH instance ID. + c. Pull the BBS logs over that same time interval. + 3. As a last resort, it sometimes helps to recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax. + + - name: rep_bulk_sync_duration + promql: max_over_time(RepBulkSyncDuration{source_id="rep"}[15m]) / 1000000000 + documentation: + title: Diego Cell - Time to Sync + description: | + Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers. + + Use: Sync times that are too high can indicate issues with the BBS. + + Origin: Firehose + Type: Gauge (Float in ns) + Frequency: 30 s + recommended_response: | + 1. Investigate BBS logs for faults and errors. + 2. If a particular cell or cells appear problematic, investigate logs for the cells. diff --git a/jobs/rep_windows/templates/indicators.yml.erb b/jobs/rep_windows/templates/indicators.yml.erb index 66405a9730..be9090a67c 100644 --- a/jobs/rep_windows/templates/indicators.yml.erb +++ b/jobs/rep_windows/templates/indicators.yml.erb @@ -1,84 +1,88 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> - -indicators: -- name: capacity_remaining_memory - promql: min_over_time(CapacityRemainingMemory{source_id="rep"}[5m]) / 1024 - documentation: - title: Diego Cell - Remaining Memory Available - Overall Remaining Memory Available - description: | - Remaining amount of memory in MiB available for this Diego cell to allocate to containers. - - Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning. - - Origin: Firehose - Type: Gauge (Integer in MiB) - Frequency: 60 s - recommended_response: | - 1. Assign more resources to the cells - 2. Assign more cells. - -- name: capacity_remaining_disk - promql: min_over_time(CapacityRemainingDisk{source_id="rep"}[5m]) / 1024 - documentation: - title: Diego Cell - Remaining Disk Available - Overall Remaining Disk Available - description: | - Remaining amount of disk in MiB available for this Diego cell to allocate to containers. - - Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes. - - It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory. - - Origin: Firehose - Type: Gauge (Integer in MiB) - Frequency: 60 s - recommended_response: | - 1. Assign more resources to the cells. - 2. Assign more cells. - -- name: garden_health_check_failed - promql: max_over_time(GardenHealthCheckFailed{source_id="rep"}[5m]) - documentation: - title: Diego Cell - Garden Healthcheck Failed - description: | - The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy. - - Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need. - - Suggested alert threshold based on multiple unhealthy cells in the given time window. - - Although end-user impact is usually low if only one cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation could result in negative end-user impact if left unresolved. - - Origin: Firehose - Type: Gauge (Float, 0-1) - Frequency: 30 s - recommended_response: | - 1. Investigate Diego cell servers for faults and errors. - 2. If a particular cell or cells appear problematic: - a. Determine a time interval during which the metrics from the cell changed from healthy to unhealthy. - b. Pull the logs that the cell generated over that interval. The Cell ID is the same as the BOSH instance ID. - c. Pull the BBS logs over that same time interval. - 3. As a last resort, it sometimes helps to recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax. - -- name: rep_bulk_sync_duration - promql: max_over_time(RepBulkSyncDuration{source_id="rep"}[15m]) / 1000000000 - documentation: - title: Diego Cell - Time to Sync - description: | - Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers. - - Use: Sync times that are too high can indicate issues with the BBS. - - Origin: Firehose - Type: Gauge (Float in ns) - Frequency: 30 s - recommended_response: | - 1. Investigate BBS logs for faults and errors. - 2. If a particular cell or cells appear problematic, investigate logs for the cells. + labels: + deployment: <%= spec.deployment %> + component: rep_windows + +spec: + product: + name: diego + version: latest + + indicators: + - name: capacity_remaining_memory + promql: min_over_time(CapacityRemainingMemory{source_id="rep"}[5m]) / 1024 + documentation: + title: Diego Cell - Remaining Memory Available - Overall Remaining Memory Available + description: | + Remaining amount of memory in MiB available for this Diego cell to allocate to containers. + + Use: Can indicate low memory capacity overall in the platform. Low memory can prevent app scaling and new deployments. The overall sum of capacity can indicate that you need to scale the platform. Observing capacity consumption trends over time helps with capacity planning. + + Origin: Firehose + Type: Gauge (Integer in MiB) + Frequency: 60 s + recommended_response: | + 1. Assign more resources to the cells + 2. Assign more cells. + + - name: capacity_remaining_disk + promql: min_over_time(CapacityRemainingDisk{source_id="rep"}[5m]) / 1024 + documentation: + title: Diego Cell - Remaining Disk Available - Overall Remaining Disk Available + description: | + Remaining amount of disk in MiB available for this Diego cell to allocate to containers. + + Use: Low disk capacity can prevent app scaling and new deployments. Because Diego staging Tasks can fail without at least 6 GB free, the recommended red threshold is based on the minimum disk capacity across the deployment falling below 6 GB in the previous 5 minutes. + + It can also be meaningful to assess how many chunks of free disk space are above a given threshold, similar to rep.CapacityRemainingMemory. + + Origin: Firehose + Type: Gauge (Integer in MiB) + Frequency: 60 s + recommended_response: | + 1. Assign more resources to the cells. + 2. Assign more cells. + + - name: garden_health_check_failed + promql: max_over_time(GardenHealthCheckFailed{source_id="rep"}[5m]) + documentation: + title: Diego Cell - Garden Healthcheck Failed + description: | + The Diego cell periodically checks its health against the garden backend. For Diego cells, 0 means healthy, and 1 means unhealthy. + + Use: Set an alert for further investigation if multiple unhealthy Diego cells are detected in the given time window. If one cell is impacted, it does not participate in auctions, but end-user impact is usually low. If multiple cells are impacted, this can indicate a larger problem with Diego, and should be considered a more critical investigation need. + + Suggested alert threshold based on multiple unhealthy cells in the given time window. + + Although end-user impact is usually low if only one cell is impacted, this should still be investigated. Particularly in a lower capacity environment, this situation could result in negative end-user impact if left unresolved. + + Origin: Firehose + Type: Gauge (Float, 0-1) + Frequency: 30 s + recommended_response: | + 1. Investigate Diego cell servers for faults and errors. + 2. If a particular cell or cells appear problematic: + a. Determine a time interval during which the metrics from the cell changed from healthy to unhealthy. + b. Pull the logs that the cell generated over that interval. The Cell ID is the same as the BOSH instance ID. + c. Pull the BBS logs over that same time interval. + 3. As a last resort, it sometimes helps to recreate the cell by running bosh recreate. See the BOSH documentation for bosh recreate command syntax. + + - name: rep_bulk_sync_duration + promql: max_over_time(RepBulkSyncDuration{source_id="rep"}[15m]) / 1000000000 + documentation: + title: Diego Cell - Time to Sync + description: | + Time in ns that the Diego Cell Rep took to sync the ActualLRPs that it claimed with its actual garden containers. + + Use: Sync times that are too high can indicate issues with the BBS. + + Origin: Firehose + Type: Gauge (Float in ns) + Frequency: 30 s + recommended_response: | + 1. Investigate BBS logs for faults and errors. + 2. If a particular cell or cells appear problematic, investigate logs for the cells. diff --git a/jobs/route_emitter/templates/indicators.yml.erb b/jobs/route_emitter/templates/indicators.yml.erb index 938926261e..3d38038932 100644 --- a/jobs/route_emitter/templates/indicators.yml.erb +++ b/jobs/route_emitter/templates/indicators.yml.erb @@ -1,28 +1,32 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> + labels: + deployment: <%= spec.deployment %> + component: route_emitter + +spec: + product: + name: diego + version: latest -indicators: -- name: route_emitter_sync_duration - promql: max_over_time(RouteEmitterSyncDuration{source_id="route_emitter"}[15m]) / 1000000000 - documentation: - title: Route Emitter - Sync Duration - description: | - Time in ns that the active Route Emitter took to perform its synchronization pass. + indicators: + - name: route_emitter_sync_duration + promql: max_over_time(RouteEmitterSyncDuration{source_id="route_emitter"}[15m]) / 1000000000 + documentation: + title: Route Emitter - Sync Duration + description: | + Time in ns that the active Route Emitter took to perform its synchronization pass. - Use: Increases in this metric indicate that the Route Emitter may have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. + Use: Increases in this metric indicate that the Route Emitter may have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. - Origin: Firehose - Type: Gauge (Float in ns) - Frequency: 60s - recommended_response: | - If all or many jobs showing as impacted, there is likely an issue with Diego. - 1. Investigate the Route Emitter and Diego BBS logs for errors. - 2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed. - If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further. + Origin: Firehose + Type: Gauge (Float in ns) + Frequency: 60s + recommended_response: | + If all or many jobs showing as impacted, there is likely an issue with Diego. + 1. Investigate the Route Emitter and Diego BBS logs for errors. + 2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed. + If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further. diff --git a/jobs/route_emitter_windows/templates/indicators.yml.erb b/jobs/route_emitter_windows/templates/indicators.yml.erb index 938926261e..55aced6d09 100644 --- a/jobs/route_emitter_windows/templates/indicators.yml.erb +++ b/jobs/route_emitter_windows/templates/indicators.yml.erb @@ -1,28 +1,32 @@ --- -apiVersion: v0 - -product: - name: diego - version: latest +apiVersion: indicatorprotocol.io/v1 +kind: IndicatorDocument metadata: - deployment: <%= spec.deployment %> + labels: + deployment: <%= spec.deployment %> + component: route_emitter_windows + +spec: + product: + name: diego + version: latest -indicators: -- name: route_emitter_sync_duration - promql: max_over_time(RouteEmitterSyncDuration{source_id="route_emitter"}[15m]) / 1000000000 - documentation: - title: Route Emitter - Sync Duration - description: | - Time in ns that the active Route Emitter took to perform its synchronization pass. + indicators: + - name: route_emitter_sync_duration + promql: max_over_time(RouteEmitterSyncDuration{source_id="route_emitter"}[15m]) / 1000000000 + documentation: + title: Route Emitter - Sync Duration + description: | + Time in ns that the active Route Emitter took to perform its synchronization pass. - Use: Increases in this metric indicate that the Route Emitter may have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. + Use: Increases in this metric indicate that the Route Emitter may have trouble maintaining an accurate routing table to broadcast to the Gorouters. Tune alerting values to your deployment based on historical data and adjust based on observations over time. The suggested starting point is ≥ 5 for the yellow threshold and ≥ 10 for the critical threshold. - Origin: Firehose - Type: Gauge (Float in ns) - Frequency: 60s - recommended_response: | - If all or many jobs showing as impacted, there is likely an issue with Diego. - 1. Investigate the Route Emitter and Diego BBS logs for errors. - 2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed. - If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further. + Origin: Firehose + Type: Gauge (Float in ns) + Frequency: 60s + recommended_response: | + If all or many jobs showing as impacted, there is likely an issue with Diego. + 1. Investigate the Route Emitter and Diego BBS logs for errors. + 2. Verify that app routes are functional by making a request to an app, pushing an app and pinging it, or if applicable, checking that your smoke tests have passed. + If one or a few jobs showing as impacted, there is likely a connectivity issue and the impacted job should be investigated further.