Syslog drain app error messages in app log stream #633

corporatemax · 2024-11-25T12:35:07Z

Description

This change should resolve #579 .

I pulled the SyslogConnector.emitLoggregatorErrorLog function into the app_log_emitter.go file and moved its dependent variables accordingly and capsuled it with the AppLogEmitter struct.
Furthermore, I used the AppLogEmitter in the RetryWriter and the WriterFactory to provide error messages to the Application Developer who tries to configure a syslog drain but does not know why there are no logs showing up in the syslog drain target.
I did this also in the FilteredBindingFetcher as there missconfigurations leads to errors there.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Testing performed?

Unit tests
Integration tests
Acceptance tests

Checklist:

This PR is being made against the main branch, or relevant version branch
I have made corresponding changes to the documentation
I have added testing for my changes

If you have any questions, or want to get attention for a PR or issue please reach out on the #logging-and-metrics channel in the cloudfoundry slack

linux-foundation-easycla · 2024-11-25T12:35:12Z

The committers listed above are authorized under a signed CLA.

✅ login: corporatemax / name: Maximilian Stefanac (Corporate) (29533d9, 2ab4406, 6bb279e, a45b295, dd8f517, fc1737e)

chombium

@corporatemax Thanks for the PR and the time invested in it. Generally it looks good. I have only few minor comments and I would wait for the other approvers to tell what they think.

src/pkg/egress/syslog/loggregator_emitter.go

chombium · 2024-11-25T13:31:26Z

src/pkg/egress/syslog/loggregator_emitter.go

+
+// WriteLog writes a message in the application log stream using a LogClient.
+func (appLogEmitter *LoggregatorEmitter) WriteLog(appID string, message string) {
+	if appLogEmitter.logClient == nil || appID == "" {


I wonder if we should write a warning in the Syslog Agent's logs that the emitter is not set and the app dev won't know that something went wrong... On the other hand, I understand that this is only to safeguard the EmitLog call as everything is initialized properly in the syslog_agent.go cmd.

I feel like if we don't write a log message for the syslog agent, we are covering the misconfiguration... Of course, this might not be the write place as the log will be written on every WriteLog call.

@ctlong Any thoughts on this?

I am not sure if the location of this file is sufficient as I am also calling it in the filtered binding fetcher. Should I move this in its separate package? If yes where exactly?

Let's leave it as is for now until we think of something better.

@chombium it sounds like you're worried that we do not have any mechanisms to prevent future misconfiguration of the AppLogEmitter when it's initialized in syslog_agent.go, is that right?

I think we could adequately cover this case with a new test in syslog_agent_test.go. That would seem like a valuable addition to me.

I added a test for this.

src/pkg/egress/syslog/retry_writer_test.go

chombium · 2024-11-25T13:43:47Z

src/pkg/egress/syslog/syslog_connector.go

-	"log"
-
 	"golang.org/x/net/context"
+	"log"


Was this reformatting? If so please run golangci-lint run ./... in the /src folder

no this was goland - I ran the formatter on my side and commited the changes.

This looks very strange to me. Usually, the standard libraries are grouped at the top... can you please double-check the formatting here, thanks

I resorted this imports.

src/pkg/ingress/bindings/filtered_binding_fetcher_test.go

chombium · 2024-11-25T13:59:41Z

src/pkg/egress/syslog/syslog_connector.go

 		drainDroppedMetric.Add(float64(missed))

-		w.emitLoggregatorErrorLog(b.AppId, fmt.Sprintf("%d messages lost for application %s in user provided syslog drain with url %s", missed, b.AppId, anonymousUrl.String()))
+		w.loggregatorEmitter.WriteLog(b.AppId, fmt.Sprintf("%d messages lost for application %s in user provided syslog drain with url %s", missed, b.AppId, anonymousUrl.String()))


This is a system/operator message and we should adjust the text as we have a concrete App for which some logs are lost. IMO x messages lost for syslog drain with url y is enough. On the other hand if the same syslog drain url is used for multiple apps the message would be clearer this way. Though, the app id will be sent anyhow as label/tag of the log. I'm not totally sure if we should change the message.

@ctlong any thought on this?

What is the final log message output to the user in this case? I think the log client, or some of the other logic in AppLogEmitter may prefix the message with some information. In any case, you're right, we should evaluate the full log envelope and see if anything should be tweaked to ensure that it is legible to an app developer.

chombium

@corporatemax Thanks for looking into the review and making adjustments. I'm happy with the how the things look like.

I wonder what @ctlong has to say about this PR.

chombium · 2024-12-02T15:02:56Z

@corporatemax Can you please check the unit tests? Do they fail because they run as GH Action or it's something else?

corporatemax · 2024-12-03T12:10:51Z

@chombium the test states

unexpected response from internal bindings endpoint. status code: 404

I assume this test is flaky regarding the opened ports.
Can you please rerun the unittests?

chombium · 2024-12-06T14:55:20Z

@corporatemax I was able to reproduce the error locally on Ubuntu Linux 20.04 LTS when running both all the tests and only the failing test suite.

All tests run:

go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace

and running the single suite:

go run github.com/onsi/ginkgo/v2/ginkgo -r --randomize-all --randomize-suites --fail-on-pending --keep-going --race --trace --focus "Syslog Agent App Suite"

At first I thought that we have some problem when all of the tests run together, but the problem is isolated, but it seems that the the problem is within the Syslog Agent App Suite.

I've also tried running single test like:

go run github.com/onsi/ginkgo/v2/ginkgo -r  --fail-on-pending --keep-going --race --trace  --focus "generates metrics" cmd/syslog-agent/app

and that run properly.

It's seems to me there are problems with how the tests are setup and run when running the whole Syslog Agent App Suite

I've rerun the tests as they run in the GitHub action and I could reproduce the problem. Therefore I will revoke my approval until all test are running properly

src/pkg/egress/syslog/writer_factory.go

chombium

Generally it looks ok, but I have some objections about the implementation as it differs slightly from the rest of the code.

src/pkg/egress/syslog/app_log_emitter.go

chombium · 2025-01-16T15:08:50Z

src/pkg/egress/syslog/app_log_emitter.go

+}
+
+// AppLogEmitter abstracts the sending of a log to the application log stream.
+type AppLogEmitter interface {


Do we have a need for an interface here? Who else implements the EmitLog method? Could this interface be deleted and the method attached to the DefaultAppLogEmitter?

removed the interface.

chombium · 2025-01-16T15:11:56Z

src/pkg/egress/syslog/app_log_emitter.go

+}
+
+// DefaultAppLogEmitter is an implementation of AppLogEmitter which sends logs to an instance of a LogClient
+type DefaultAppLogEmitter struct {


Can we rename this to something else please? The word default caught my eye...

Usually in the Loggregator's code we have a single struct, from which the default implementation is create and then we apply options... We should be consistent here as well with the naming and the implementation.

If the AppLogEmitter interface above is removed we have a good name for the struct.

adjusted it.

corporatemax · 2025-02-06T09:18:29Z

@chombium I adressed the feedback regarding the DefaultAppLogEmitter.

chombium

Hi @corporatemax, I'm terribly sorry for keep you waiting that long. Thanks fro the changes so far. I have two more things which has to be taken care of. Ah, and please squash the commits.

chombium · 2025-03-05T16:10:59Z

src/cmd/syslog-agent/app/syslog_agent.go

 	cfg Config,
 	m Metrics,
 	l *log.Logger,
+	factory syslog.AppLogEmitterFactory,


Can you be more specific in the naming? We already have a writerFactory down in the code, so it would be good if the variable naming is more concise.

renamed this factory.

src/pkg/egress/syslog/app_log_emitter.go

chombium · 2025-03-12T15:29:19Z

src/pkg/egress/syslog/app_log_emitter.go

+	}
+}
+
+func NewDefaultAppLogEmitterFactory() DefaultAppLogEmitterFactory {


Same as above. Please remove "default" from the function name.

removed the term Default from this function name.

chombium

LGTM!

Thanks for the cooperation @corporatemax

The unit and the linting tests fail.

chombium · 2025-03-20T15:53:31Z

@corporatemax Everything looks good now, but please fix the unit tests and the linting errors, before we merge this change

chombium

LGTM! Thanks for the quick fix @corporatemax

ctlong

On further review, I can't accept this change due to the log load that I believe it would add at the FilteredBindingFetcher layer. That loop occurs every ~30s on every Syslog Agent in every VM, which seems like an excessive amount of new logs. For example, if there were 150 VMs in the system, then 150 new, duplicate log lines will be generated for every invalid syslog drain every 30 seconds.

➡️ If you want to remove that portion of the change, I can reevaluate the rest of the PR. Alternatively, fixing that loop could be worthwhile but is a more involved change (likely need to push the drain validation to the writer level so that it only triggers if the drain is activated).

add error messages from filtered binding fetcher to the app log stream rename LoggregatorEmitter to AppLogEmitter change format of NewFilteredBindingFetcher calls rename references to AppLogEmitter move spyLogClient to testhelper package use grpc port for ingress client from config add comment rearrange imports introduce AppLogEmitterFactory remove AppLogEmitter interface and refactor to struct rename factory rename NewAppLogEmitterFactory method

No longer active in the project

🥺🥺🥺🥺🥺🥺
All the best Carson! ❤️

chombium

Hi @corporatemax, thanks for this iteration. I've found some unusual formatting and unchanged line moves. Can you please take a look at that?

src/cmd/syslog-agent/app/syslog_agent.go

src/cmd/syslog-agent/app/syslog_agent_mtls_test.go

src/internal/testhelper/spy_log_client.go

src/pkg/egress/syslog/syslog_connector.go

src/pkg/ingress/bindings/filtered_binding_fetcher_test.go

chombium

Hi @corporatemax ,

I've checked the PR and now we don't have problem with duplication as the error message is too deep in the TCPWriter. Now the problem is that the error is written only by connection problems. The thing with that is that most of the problems come from bad configuration. As we've already spoken, the validation happens in the Filtered Binding Fetcher and we should not spam the users. As we discussed previously, I've tried to find a way how to check if an App instance is running on a particular Diego Cell, so that we can emit the message from the Syslog Agent running on the cell. The only thing that I've found useful is the CAPI v3 /v3/apps/:guid/processes call. The problem with it, is that for it we would need an UAA user.

Today, we've discussed another more viable option with @fhambrec and checked if we can move the validation from the Syslog Agent's Filtered Binding Fetcher to the Syslog Binding Cache. The Cache runs only on scheduler VMs, so we won't have the spam multiplication problems caused by the number of VMs running the Syslog Agent and the number of application instances. We've concluded that if we move the Syslog Drain URL validation logic to the Get Method in the Binding Cache Poller, the logs would be emitted once per app every minute per Scheduler node. This is much less than previous as there is only one Scheduler VM per availability zone.

Can you please proceed with this new idea? I think you have everything in this PR, you will only need to decide if the validation logic hast to be moved to the binding cache and be done once and only the valid Syslog Drain Config gets into the cache.

Wdyt about this approach?

chombium requested changes Nov 25, 2024

View reviewed changes

corporatemax requested a review from chombium November 25, 2024 16:38

corporatemax changed the title ~~Syslog drain app error messages~~ Syslog drain app error messages in app log stream Nov 26, 2024

corporatemax marked this pull request as ready for review November 27, 2024 10:53

corporatemax requested a review from a team as a code owner November 27, 2024 10:53

corporatemax force-pushed the syslog-drain-app-error-messages branch from bf6dd19 to 9f789db Compare December 2, 2024 08:39

chombium previously approved these changes Dec 2, 2024

View reviewed changes

ctlong reviewed Dec 10, 2024

View reviewed changes

src/pkg/egress/syslog/writer_factory.go Show resolved Hide resolved

corporatemax force-pushed the syslog-drain-app-error-messages branch from fb4fc55 to 7336dcf Compare January 8, 2025 16:27

chombium requested changes Jan 16, 2025

View reviewed changes

corporatemax force-pushed the syslog-drain-app-error-messages branch from 8660c84 to ca33637 Compare February 4, 2025 13:10

corporatemax requested a review from chombium February 6, 2025 09:18

chombium requested changes Mar 12, 2025

View reviewed changes

corporatemax force-pushed the syslog-drain-app-error-messages branch from 78a1984 to d36e467 Compare March 14, 2025 16:55

corporatemax requested a review from chombium March 18, 2025 15:41

chombium mentioned this pull request Mar 19, 2025

add section for syslog-drain-url protocols, certificates and troubleshooting hints cloudfoundry/docs-dev-guide#548

Merged

corporatemax force-pushed the syslog-drain-app-error-messages branch 2 times, most recently from 4860796 to 3bc0a08 Compare March 20, 2025 15:45

chombium previously approved these changes Mar 20, 2025

View reviewed changes

chombium requested a review from ctlong March 20, 2025 15:50

corporatemax force-pushed the syslog-drain-app-error-messages branch from 3bc0a08 to c4a5487 Compare March 21, 2025 08:59

chombium previously approved these changes Mar 21, 2025

View reviewed changes

ctlong previously requested changes Apr 2, 2025

View reviewed changes

corporatemax added 2 commits June 27, 2025 16:17

remove AppLogEmitter from filtered_binding_fetcher

dd8f517

chombium mentioned this pull request Jul 8, 2025

feat: Add the custom headers syslog http drain RFC cloudfoundry/community#1228

Open

only emit logs in retry writer

a45b295

corporatemax dismissed chombium’s stale review via a45b295 July 28, 2025 09:27

corporatemax force-pushed the syslog-drain-app-error-messages branch from c4a5487 to a45b295 Compare July 28, 2025 09:27

corporatemax requested a review from chombium July 29, 2025 06:56

remove unused field of struct (lint issue)

6bb279e

chombium requested changes Aug 14, 2025

View reviewed changes

read forwarder agend address from environment/bpm template

fc1737e

corporatemax requested a review from chombium August 19, 2025 14:32

chombium requested changes Aug 28, 2025

View reviewed changes

wip add syslog drain validation logic to syslog binding cache

2ab4406

Syslog drain app error messages in app log stream #633

Are you sure you want to change the base?

Syslog drain app error messages in app log stream #633

Conversation

corporatemax commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Testing performed?

Checklist:

Uh oh!

linux-foundation-easycla bot commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chombium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corporatemax Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chombium left a comment

Choose a reason for hiding this comment

Uh oh!

chombium commented Dec 2, 2024

Uh oh!

corporatemax commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chombium commented Dec 6, 2024

Uh oh!

Uh oh!

chombium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corporatemax commented Feb 6, 2025

Uh oh!

chombium left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corporatemax commented Nov 25, 2024 •

edited

Loading

linux-foundation-easycla bot commented Nov 25, 2024 •

edited

Loading

corporatemax Nov 25, 2024 •

edited

Loading

corporatemax commented Dec 3, 2024 •

edited

Loading

chombium left a comment •

edited

Loading

chombium commented Mar 20, 2025 •

edited

Loading

chombium left a comment •

edited

Loading