Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 110 additions & 106 deletions jobs/auctioneer/templates/indicators.yml.erb
Original file line number Diff line number Diff line change
@@ -1,109 +1,113 @@
---
apiVersion: v0

product:
name: diego
version: latest
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument

metadata:
deployment: <%= spec.deployment %>

indicators:
- name: auctioneer_lrp_auctions_failed
promql: rate(AuctioneerLRPAuctionsFailed{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - App Instance (AI) Placement Failures
description: |
The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: This metric can indicate that Diego is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Integer)
Frequency: During each auction
recommended_response: |
1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API.
2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem.
3. Consider scaling additional cells.

- name: auctioneer_states_duration
promql: max_over_time(AuctioneerFetchStatesDuration{source_id="auctioneer"}[5m]) / 1000000000
documentation:
title: Auctioneer - Time to Fetch Cell State
description: |
Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction.

Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing.

Origin: Firehose
Type: Gauge, integer in ns
Frequency: During event, during each auction
recommended_response: |
1. Check the health of the cells by reviewing the logs and looking for errors.
2. Review IaaS console metrics.
3. Inspect the Auctioneer logs to determine if one or more cells is taking significantly longer to fetch state than other cells. Relevant log lines will have wording like `fetched cell state`.

- name: auctioneer_lrp_auctions_started
promql: rate(AuctioneerLRPAuctionsStarted{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - App Instance Starts
description: |
The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The provided measurement can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

Origin: Firehose
Type: Counter (Integer)
Frequency: During event, during each auction
recommended_response: |
When observing a significant amount of container churn, do the following:

1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
2. If container churn appears to continue over an extended period, inspect Diego Auctioneer and BBS logs.

When observing extended periods of high or low activity trends, scale up or down CF components as needed.

- name: auctioneer_task_auctions_failed
promql: rate(AuctioneerTaskAuctionsFailed{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - Task Placement Failures
description: |
The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Float)
Frequency: During event, during each auction
recommended_response: |
In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API.

1. Investigate the health of Diego cells.
2. Consider scaling additional cells.

- name: auctioneer_lock_held
promql: max_over_time(LockHeld{source_id="auctioneer"}[5m])
documentation:
title: Auctioneer - Lock Held
description: |
Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer.

Origin: Firehose
Type: Gauge
Frequency: Periodically
recommended_response: |
1. Run monit status on the instance group that the Auctioneer job is running on to check for failing processes.
2. If there are no failing processes, then review the logs for Auctioneer.
- Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.
labels:
deployment: <%= spec.deployment %>
component: auctioneer

spec:
product:
name: diego
version: latest

indicators:
- name: auctioneer_lrp_auctions_failed
promql: rate(AuctioneerLRPAuctionsFailed{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - App Instance (AI) Placement Failures
description: |
The number of Long Running Process (LRP) instances that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: This metric can indicate that Diego is out of container space or that there is a lack of resources within your environment. This indicator also increases when the LRP is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Integer)
Frequency: During each auction
recommended_response: |
1. To best determine the root cause, examine the Auctioneer logs. Depending on the specific error and resource constraint, you may also find a failure reason in the Cloud Controller (CC) API.
2. Investigate the health of your Diego cells to determine if they are the resource type causing the problem.
3. Consider scaling additional cells.

- name: auctioneer_states_duration
promql: max_over_time(AuctioneerFetchStatesDuration{source_id="auctioneer"}[5m]) / 1000000000
documentation:
title: Auctioneer - Time to Fetch Cell State
description: |
Time in ns that the auctioneer took to fetch state from all the Diego cells when running its auction.

Use: Indicates how the cells themselves are performing. Alerting on this metric helps alert that app staging requests to Diego may be failing.

Origin: Firehose
Type: Gauge, integer in ns
Frequency: During event, during each auction
recommended_response: |
1. Check the health of the cells by reviewing the logs and looking for errors.
2. Review IaaS console metrics.
3. Inspect the Auctioneer logs to determine if one or more cells is taking significantly longer to fetch state than other cells. Relevant log lines will have wording like `fetched cell state`.

- name: auctioneer_lrp_auctions_started
promql: rate(AuctioneerLRPAuctionsStarted{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - App Instance Starts
description: |
The number of LRP instances that the auctioneer successfully placed on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Provides a sense of running system activity levels in your environment. Can also give you a sense of how many app instances have been started over time. The provided measurement can help indicate a significant amount of container churn. However, for capacity planning purposes, it is more helpful to observe deltas over a long time window.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no app instances being scheduled.

Origin: Firehose
Type: Counter (Integer)
Frequency: During event, during each auction
recommended_response: |
When observing a significant amount of container churn, do the following:

1. Look to eliminate explainable causes of temporary churn, such as a deployment or increased developer activity.
2. If container churn appears to continue over an extended period, inspect Diego Auctioneer and BBS logs.

When observing extended periods of high or low activity trends, scale up or down CF components as needed.

- name: auctioneer_task_auctions_failed
promql: rate(AuctioneerTaskAuctionsFailed{source_id="auctioneer"}[5m]) * 60
documentation:
title: Auctioneer - Task Placement Failures
description: |
The number of Tasks that the auctioneer failed to place on Diego cells. This metric is cumulative over the lifetime of the auctioneer job.

Use: Failing Task auctions indicate a lack of resources within your environment and that you likely need to scale. This indicator also increases when the Task is requesting an isolation segment, volume drivers, or a stack that is unavailable, either not deployed or lacking sufficient resources to accept the work.

This metric is emitted on event, and therefore gaps in receipt of this metric can be normal during periods of no tasks being scheduled.

This error is most common due to capacity issues, for example, if cells do not have enough resources, or if cells are going back and forth between a healthy and unhealthy state.

Origin: Firehose
Type: Counter (Float)
Frequency: During event, during each auction
recommended_response: |
In order to best determine the root cause, examine the Auctioneer logs. Depending on the specific error or resource constraint, you may also find a failure reason in the CC API.

1. Investigate the health of Diego cells.
2. Consider scaling additional cells.

- name: auctioneer_lock_held
promql: max_over_time(LockHeld{source_id="auctioneer"}[5m])
documentation:
title: Auctioneer - Lock Held
description: |
Whether an Auctioneer instance holds the expected Auctioneer lock (in Locket). 1 means the active Auctioneer holds the lock, and 0 means the lock was lost.

Use: This metric is complimentary to Active Locks, and it offers an Auctioneer-level version of the Locket metrics. Although it is emitted per Auctioneer instance, only 1 active lock is held by Auctioneer. Therefore, the expected value is 1. The metric may occasionally be 0 when the Auctioneer instances are performing a leader transition, but a prolonged value of 0 indicates an issue with Auctioneer.

Origin: Firehose
Type: Gauge
Frequency: Periodically
recommended_response: |
1. Run monit status on the instance group that the Auctioneer job is running on to check for failing processes.
2. If there are no failing processes, then review the logs for Auctioneer.
- Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.
Loading