Skip to content

Conversation

@jcantrill
Copy link
Contributor

Description

This PR:

  • adds an alert for when sinks are generating errors (i.e. connection refused)
  • Creates the alert for each pod to identify issues with specific nodes

Links

https://issues.redhat.com/browse/LOG-7896

image image

@xperimental @r2d2rnd

  • does it makes sense for the threshold to be zero?
  • does it make sense to have one alert for each collector pod or collapse them into a single CLF definition

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 16, 2025

@jcantrill: This pull request references LOG-7896 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.8.0" version, but no target version was set.

In response to this:

Description

This PR:

  • adds an alert for when sinks are generating errors (i.e. connection refused)
  • Creates the alert for each pod to identify issues with specific nodes

Links

https://issues.redhat.com/browse/LOG-7896

image image

@xperimental @r2d2rnd

  • does it makes sense for the threshold to be zero?
  • does it make sense to have one alert for each collector pod or collapse them into a single CLF definition

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025
@jcantrill
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2025
@openshift-ci openshift-ci bot requested review from alanconway and cahartma October 16, 2025 20:04
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025
@r2d2rnd
Copy link
Contributor

r2d2rnd commented Oct 17, 2025

Hello @jcantrill ,

About your questions:

Question 1: does it makes sense for the threshold to be zero?
From my point of view, if that metric can't take "negative" value, then, the threshold of zero is good

Question 2: does it make sense to have one alert for each collector pod or collapse them into a single CLF definition
It's better by collector. Sometimes a problem to deliver is in all the collectors as a bad definition, SSL/TLS errors, destination not available, etc. But other times, the problem is related to a single node or only some of them and not all.

Then, yes, it makes sense to be individual by collector pod.

@jcantrill
Copy link
Contributor Author

jcantrill commented Oct 17, 2025

But other times, the problem is related to a single node or only some of them and not all.

My original impl was an alert for a CLF but I modified it to be for the pods in a CLF for exactly this reason. I was thinking of the case where there is issue with a single node

@r2d2rnd
Copy link
Contributor

r2d2rnd commented Oct 17, 2025

Hello @jcantrill ,
I was thinking again on this and reviewing real scenarios and I need to contradict myself. We need to put the alert at the CLF level and not at the pod level. It's common to have in the pipeline defined in the pipeline where you define some inputs (application namespaces) that they don't run in all the nodes and send only specific inputs, usually some specific application pods or containers to specific outputs. This output is also not used for receiving other logs. Something like:

input A:
    - namespaces:
      - ns1
      - ns2
input B
    - namespaces:
     - ns3
      - ns4
 outputs:
    - out1
    - out2
 pipeline:
  - name A
     inputRefs:
     - inputA
     outputrefs:
     - out1  
   - name B
     inputRefs:
     - inputB
     outputrefs:
     - out2

Then, for generating a lot of fake errors, we need to do it by CLF as an output is defined, we should expect that this output should receive some logs from any of the outputs

@jcantrill
Copy link
Contributor Author

I believe the way this alert triggers this satisfies what is needed:

The pod '{{ $labels.pod }}' owned by ClusterLogForwarder "{{ $labels.namespace }}/{{ $labels.app_kubernetes_io_instance }}" for output "{{ $labels.component_id }}" is generating the error: "{{ $labels.error_kind }}".

It will fire an alert which identifies the specific pod for a given CLF namespace/name which should cover all scenarios. I can imagine the tedious part would be on a cluster with a significant number of nodes and generating one alert for each; that may be the deal breaker here as it would be a significant number of alerts to dismiss. The current implementation allows identification of "the one ocp node" with problems but does not handle well all of them having issues at once.

My ideal implementation would be a single alert that identifies all the pods which are exhibiting the same issue.

@jcantrill jcantrill force-pushed the log7896 branch 2 times, most recently from 241cfcf to ad34141 Compare October 17, 2025 18:49
@jcantrill
Copy link
Contributor Author

/test functional-target

1 similar comment
@jcantrill
Copy link
Contributor Author

/test functional-target

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@jcantrill: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/functional-target 1e8f836 link true /test functional-target

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. release/6.4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants