Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit fda458b

Browse files
authored
Merge pull request #238 from grafana/add-bucket-index-observability
Add bucket index observability
2 parents 6bb6fe1 + 05094da commit fda458b

File tree

4 files changed

+78
-2
lines changed

4 files changed

+78
-2
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
## master / unreleased
44

55
* [ENHANCEMENT] Added `unregister_ingesters_on_shutdown` config option to disable unregistering ingesters on shutdown (default is enabled). #213
6+
* [ENHANCEMENT] Improved blocks storage observability: #237
7+
- Cortex / Queries: added bucket index load operations and latency (available only when bucket index is enabled)
8+
- Alerts: added "CortexBucketIndexNotUpdated" (bucket index only) and "CortexTenantHasPartialBlocks"
69

710
## 1.6.0 / 2021-01-05
811

cortex-mixin/alerts/blocks.libsonnet

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,33 @@
184184
message: 'Cortex Store Gateway {{ $labels.namespace }}/{{ $labels.instance }} has not successfully synched the bucket since {{ $value | humanizeDuration }}.',
185185
},
186186
},
187+
{
188+
// Alert if the bucket index has not been updated for a given user.
189+
alert: 'CortexBucketIndexNotUpdated',
190+
expr: |||
191+
min by(namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
192+
|||,
193+
labels: {
194+
severity: 'critical',
195+
},
196+
annotations: {
197+
message: 'Cortex bucket index for tenant {{ $labels.user }} in {{ $labels.namespace }} has not been updated since {{ $value | humanizeDuration }}.',
198+
},
199+
},
200+
{
201+
// Alert if a we consistently find partial blocks for a given tenant over a relatively large time range.
202+
alert: 'CortexTenantHasPartialBlocks',
203+
'for': '6h',
204+
expr: |||
205+
max by(namespace, user) (cortex_bucket_blocks_partials_count) > 0
206+
|||,
207+
labels: {
208+
severity: 'warning',
209+
},
210+
annotations: {
211+
message: 'Cortex tenant {{ $labels.user }} in {{ $labels.namespace }} has {{ $value }} partial blocks.',
212+
},
213+
},
187214
],
188215
},
189216
],

cortex-mixin/dashboards/queries.libsonnet

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
136136
)
137137
)
138138
.addRowIf(
139-
std.member($._config.storage_engine, 'chunks'),
139+
std.member($._config.storage_engine, 'blocks'),
140140
$.row('Querier - Blocks storage')
141141
.addPanel(
142142
$.panel('Number of store-gateways hit per Query') +
@@ -156,7 +156,31 @@ local utils = import 'mixin-utils/utils.libsonnet';
156156
)
157157
.addRowIf(
158158
std.member($._config.storage_engine, 'blocks'),
159-
$.row('Store-gateway - Blocks')
159+
$.row('')
160+
.addPanel(
161+
$.panel('Bucket indexes loaded (per querier)') +
162+
$.queryPanel([
163+
'max(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
164+
'min(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
165+
'avg(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
166+
], ['Max', 'Min', 'Average']) +
167+
{ yaxes: $.yaxes('short') },
168+
)
169+
.addPanel(
170+
$.successFailurePanel(
171+
'Bucket indexes load / sec',
172+
'sum(rate(cortex_bucket_index_loads_total{%s}[$__rate_interval])) - sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)],
173+
'sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.querier),
174+
)
175+
)
176+
.addPanel(
177+
$.panel('Bucket indexes load latency') +
178+
$.latencyPanel('cortex_bucket_index_load_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.querier)),
179+
)
180+
)
181+
.addRowIf(
182+
std.member($._config.storage_engine, 'blocks'),
183+
$.row('Store-gateway - Blocks storage')
160184
.addPanel(
161185
$.panel('Blocks queried / sec') +
162186
$.queryPanel('sum(rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), 'blocks') +

cortex-mixin/docs/playbooks.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,28 @@ gsutil mv gs://BUCKET/TENANT/BLOCK gs://BUCKET/TENANT/corrupted-BLOCK
226226

227227
Same as [`CortexCompactorHasNotUploadedBlocks`](#CortexCompactorHasNotUploadedBlocks).
228228

229+
### CortexBucketIndexNotUpdated
230+
231+
This alert fires when the bucket index, for a given tenant, is not updated since a long time. The bucket index is expected to be periodically updated by the compactor and is used by queriers and store-gateways to get an almost-updated view over the bucket store.
232+
233+
How to **investigate**:
234+
- Ensure the compactor is successfully running
235+
- Look for any error in the compactor logs
236+
237+
### CortexTenantHasPartialBlocks
238+
239+
This alert fires when Cortex finds partial blocks for a given tenant. A partial block is a block missing the `meta.json` and this may usually happen in two circumstances:
240+
241+
1. A block upload has been interrupted and not cleaned up or retried
242+
2. A block deletion has been interrupted and `deletion-mark.json` has been deleted before `meta.json`
243+
244+
How to **investigate**:
245+
- Look for the block ID in the logs
246+
- Find out which Cortex component operated on the block at last (eg. uploaded by ingester/compactor, or deleted by compactor)
247+
- Investigate if was a partial upload or partial delete
248+
- Safely manually delete the block from the bucket if was a partial delete or an upload failed by a compactor
249+
- Further investigate if was an upload failed by an ingester but not later retried (ingesters are expected to retry uploads until succeed)
250+
229251
### CortexWALCorruption
230252

231253
This alert is only related to the chunks storage. This can happen because of 2 reasons: (1) Non graceful shutdown of ingesters. (2) Faulty storage or NFS.

0 commit comments

Comments
 (0)