LOG-7896: Add alert when forwarder sink is generating errors #3137

jcantrill · 2025-10-16T20:03:47Z

Description

This PR:

adds an alert for when sinks are generating errors (i.e. connection refused)
Creates the alert for each pod to identify issues with specific nodes

Links

https://issues.redhat.com/browse/LOG-7896

@xperimental @r2d2rnd

does it makes sense for the threshold to be zero?
does it make sense to have one alert for each collector pod or collapse them into a single CLF definition

openshift-ci-robot · 2025-10-16T20:03:52Z

@jcantrill: This pull request references LOG-7896 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.8.0" version, but no target version was set.

In response to this:

Description

This PR:

adds an alert for when sinks are generating errors (i.e. connection refused)

Creates the alert for each pod to identify issues with specific nodes

Links

https://issues.redhat.com/browse/LOG-7896

@xperimental @r2d2rnd

does it makes sense for the threshold to be zero?

does it make sense to have one alert for each collector pod or collapse them into a single CLF definition

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jcantrill · 2025-10-16T20:03:57Z

/hold

openshift-ci · 2025-10-16T20:04:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jcantrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

r2d2rnd · 2025-10-17T08:51:13Z

Hello @jcantrill ,

About your questions:

Question 1: does it makes sense for the threshold to be zero?
From my point of view, if that metric can't take "negative" value, then, the threshold of zero is good

Question 2: does it make sense to have one alert for each collector pod or collapse them into a single CLF definition
It's better by collector. Sometimes a problem to deliver is in all the collectors as a bad definition, SSL/TLS errors, destination not available, etc. But other times, the problem is related to a single node or only some of them and not all.

Then, yes, it makes sense to be individual by collector pod.

jcantrill · 2025-10-17T13:29:52Z

But other times, the problem is related to a single node or only some of them and not all.

My original impl was an alert for a CLF but I modified it to be for the pods in a CLF for exactly this reason. I was thinking of the case where there is issue with a single node

r2d2rnd · 2025-10-17T14:32:55Z

Hello @jcantrill ,
I was thinking again on this and reviewing real scenarios and I need to contradict myself. We need to put the alert at the CLF level and not at the pod level. It's common to have in the pipeline defined in the pipeline where you define some inputs (application namespaces) that they don't run in all the nodes and send only specific inputs, usually some specific application pods or containers to specific outputs. This output is also not used for receiving other logs. Something like:

input A:
    - namespaces:
      - ns1
      - ns2
input B
    - namespaces:
     - ns3
      - ns4
 outputs:
    - out1
    - out2
 pipeline:
  - name A
     inputRefs:
     - inputA
     outputrefs:
     - out1  
   - name B
     inputRefs:
     - inputB
     outputrefs:
     - out2

Then, for generating a lot of fake errors, we need to do it by CLF as an output is defined, we should expect that this output should receive some logs from any of the outputs

jcantrill · 2025-10-17T14:54:57Z

I believe the way this alert triggers this satisfies what is needed:

The pod '{{ $labels.pod }}' owned by ClusterLogForwarder "{{ $labels.namespace }}/{{ $labels.app_kubernetes_io_instance }}" for output "{{ $labels.component_id }}" is generating the error: "{{ $labels.error_kind }}".

It will fire an alert which identifies the specific pod for a given CLF namespace/name which should cover all scenarios. I can imagine the tedious part would be on a cluster with a significant number of nodes and generating one alert for each; that may be the deal breaker here as it would be a significant number of alerts to dismiss. The current implementation allows identification of "the one ocp node" with problems but does not handle well all of them having issues at once.

My ideal implementation would be a single alert that identifies all the pods which are exhibiting the same issue.

jcantrill · 2025-10-20T20:28:18Z

/test functional-target

jcantrill · 2025-10-21T14:37:52Z

/test functional-target

openshift-ci · 2025-10-21T16:14:31Z

@jcantrill: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/functional-target	`1e8f836`	link	true	`/test functional-target`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2025

openshift-ci bot requested review from alanconway and cahartma October 16, 2025 20:04

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025

jcantrill force-pushed the log7896 branch from 8443e7b to 86532ad Compare October 16, 2025 20:05

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025

jcantrill force-pushed the log7896 branch 2 times, most recently from 241cfcf to ad34141 Compare October 17, 2025 18:49

LOG-7896: Add alert when forwarder sink is generating errors

1e8f836

jcantrill force-pushed the log7896 branch from ad34141 to 1e8f836 Compare October 17, 2025 18:56

jcantrill added the release/6.4 label Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LOG-7896: Add alert when forwarder sink is generating errors #3137

LOG-7896: Add alert when forwarder sink is generating errors #3137

jcantrill commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 16, 2025 •

edited by openshift-ci bot

Loading

Description

Links

Uh oh!

jcantrill commented Oct 16, 2025

Uh oh!

openshift-ci bot commented Oct 16, 2025

Uh oh!

r2d2rnd commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 17, 2025 •

edited

Loading

Uh oh!

r2d2rnd commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 20, 2025

Uh oh!

jcantrill commented Oct 21, 2025

Uh oh!

openshift-ci bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LOG-7896: Add alert when forwarder sink is generating errors #3137

Are you sure you want to change the base?

LOG-7896: Add alert when forwarder sink is generating errors #3137

Conversation

jcantrill commented Oct 16, 2025

Description

Links

Uh oh!

openshift-ci-robot commented Oct 16, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Links

Uh oh!

jcantrill commented Oct 16, 2025

Uh oh!

openshift-ci bot commented Oct 16, 2025

Uh oh!

r2d2rnd commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r2d2rnd commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 17, 2025

Uh oh!

jcantrill commented Oct 20, 2025

Uh oh!

jcantrill commented Oct 21, 2025

Uh oh!

openshift-ci bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Oct 16, 2025 •

edited by openshift-ci bot

Loading

jcantrill commented Oct 17, 2025 •

edited

Loading