Skip to content

Conversation

vparfonov
Copy link
Contributor

@vparfonov vparfonov commented Sep 15, 2025

Summary

This PR this is a continuation of work on the PR started here: #21307
Changes:

  • a new DynamicOrStatic<T> enum was introduced to allow facility and severity to be configured in multiple ways:
    • as a static name (e.g., "user")
    • as a static number (e.g., 16)
    • ss a dynamic field path (e.g., "$.level")
  • custom serde deserializers were implemented to handle this complex, multi-format input, providing clear error messages for invalid values.
  • support except_fields, this allows users to remove specified fields from the LogEvent before it is processed, which is useful for stripping internal or sensitive data before it's included in the final payload.
  • added #[serde(alias = "tag")] to the app_name field, allowing users to use the more familiar tag option for RFC3164 configurations.
  • RFC3164 Tag Handling: The logic for generating the RFC3164 TAG was improved with a regex to prevent incorrectly duplicating the proc_id if the app_name field was already formatted.
  • added unit tests

Vector configuration

[sinks.example]
type = "socket"
inputs = ["example_parse_encoding"]
address = "logserver:514"
mode = "xyz"

[sinks.example.encoding]
codec = "syslog"
except_fields = ["_internal"]
rfc = "rfc5424"
facility = "$$._syslog.facility"
severity = "$$._syslog.severity"
app_name = "$$._syslog.app_name"
msg_id = "$$._syslog.msg_id"
proc_id = "$$._syslog.proc_id"

How did you test this PR?

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details here.

syedriko and others added 22 commits March 15, 2024 17:23
Original commit from syedriko
This is only a temporary change to make the diffs for future commits easier to follow.
- Introduce a `Pri` struct with fields for severity and facility as enum values.
  - `Pri` uses `strum` crate to parse string values into their appropriate enum variant.
  - Handles the responsibility of encoding the two enum values ordinal values into the `PRIVAL` value for the encoder.
- As `Facility` and `Severity` enums better represent their ordinal mapping directly
  - The `Fixed` + `Field` subtyping with custom deserializer isn't necessary. Parsing a string that represents the enum by name or its ordinal representation is much simpler.
  - Likewise this removes the need for the get methods as the enum can provide both the `String` or `u8` representation as needed.
`SyslogSerializer::encode()` has been simplified.
- Only  matching `Event::Log` is relevant, an `if let` bind instead of `match` helps remove a redundant level of nesting.
- This method only focuses on boilerplate now, delegating the rest to `ConfigDecanter` (_adapt `LogEvent` + encoder config_) and `SyslogMessage` (_encode into syslog message string_).
- This removes some complexity during actual encoding logic, which should only be concerned about directly encoding from one representation to another, not complimentary features related to Vector config or it's type system.

The new `ConfigDecanter` is where many of the original helper methods that were used by `SyslogSerializer::encode()` now reside. This change better communicates the scope of their usage.
- Any interaction with `LogEvent` is now contained within the methods of this new struct. Likewise for the consumption of the encoder configuration (instead of queries to config throughout encoding).
- The `decant_config()` method better illustrates an overview of the data we're encoding and where that's being sourced from via the new `SyslogMessage` struct, which splits off the actual encoding responsibility (see next commit).
`SyslogSerializerConfig` has been simplified.
- Facility / Severity deserializer methods aren't needed, as per their prior refactor with `strum`.
- The `app_name` default is set via `decant_config()` when not configured explicitly.
- The other two fields calling a `default_nil_value()` method instead use an option value which encodes `None` into the expected `-` value.
- Everything else does not need a serde attribute to apply a default, the `Default` trait on the struct is sufficient.
- `trim_prefix` was removed as it didn't seem relevant. `tag` was also removed as it's represented by several subfields in RFC 5424 which RFC 3164 can also use.

`SyslogMessage::encode()` refactors the original PR encoding logic:
- Syslog Header fields focused, the PRI and final message value have already been prepared prior. They are only referenced at the end of `encode()` to combine into the final string output.
- While less efficient than `push_str()`, each match variant has a clear structure returned via the array `join(" ")` which minimizes the noise of `SP` from the original PR. Value preparation prior to this is clear and better documented.
- `Tag` is a child struct to keep the main logic easy to grok. `StructuredData` is a similar case.
No changes beyond relocating the code into a single file.
- Drop notes referring to original PR differences + StructuredData adaption references. None of it should be relevant going forward.
- Revise some other notes.
- Drop `add_log_source` method (introduced from the original PR author) in favor of using `StructuredData` support instead.
This should be simple and lightweight enough to justify for the DRY benefit?

This way the method doesn't need to be duplicated redundantly. That was required because there is no trait for `FromRepr` provided via `strum`. That would require a similar amount of lines for the small duplication here.

The `akin` macro duplicates the `impl` block for each value in the `&enums` array.
- `ConfigDecanter::get_message()` replaces the fallback method in favor of `to_string_lossy()` (a dedicated equivalent for converting `Value` type to a String type (_technically it is a CoW str, hence the follow-up with `to_string()`_)).
  - This also encodes the value better, especially for the default `log_namespace: false` as the message value (when `String`) is not quote wrapped, which matches the behaviour of the `text` encoder output.
  - Additionally uses the `LogEvent` method `get_message()` directly from `lib/vector-core/src/event
/log_event.rs`. This can better retrieve the log message regardless of the `log_namespace` setting.
- Encoding of RFC 5424 fields has changed to inline the `version` constant directly, instead of via a redundant variable. If there's ever multiple versions that need to be supported, it could be addressed then.
- The RFC 5424 timestamp has a max precision of microseconds, thus this should be rounded and `AutoSi` can be used (_or `Micros` if it should have fixed padding instead of truncating trailing `000`_).
- The original PR author appears to have relied on a hard-coded timestamp key here.
- `DateTime<Local>` would render the timestamp field with the local timezone offset, but other than that `DateTime<Utc>` would seem more consistent with usage in Vector, especially since any original TZ context is lost by this point?
- Notes adjusted accordingly, with added TODO query for each encoding mode to potentially support configurable timezone.
- Move encoder config settings under a single `syslog` config field. This better mirrors configuration options for existing encoders like Avro and CSV.
- `ConfigDecanter::value_by_key()` appears to accomplish roughly the same as the existing helper method `to_string_lossy()`. Prefer that instead. This also makes the `StructuredData` helper `value_to_string()` redundant too at a glance?
- Added some reference for the priority value `PRIVAL`.
- `Pri::from_str_variants()` uses the existing defaults for fallback, communicate that more clearly. Contextual note is no longer useful, removed.
To better communicate the allowed values, these two config fields can change from the `String` type to their appropriate enum type.
- This relies on serde to deserialize the config value to the enum which adds a bit more noise to grok.
- It does make `Pri::from_str_variants()` redundant, while the `into_variant()` methods are refactored to `deserialize()` with a proper error message emitted to match the what serde would normally emit for failed enum variant deserialization.
- A drawback of this change is that these two config fields lost the ability to reference a different value path in the `LogEvent`. That'll be addressed in a future commit.
In a YAML config a string can optionally be wrapped with quotes, while a number that isn't quote wrapped will be treated as a number type.

The current support was only for string numbers, this change now supports flexibility for config using ordinal values in YAML regardless of quote usage.

The previous `Self::into_variant(&s)` logic could have been used instead of bringing in `serde-aux`, but the external helper attribute approach seems easier to grok/follow as the intermediary container still seems required for a terse implementation.

The match statement uses a reference (_which requires a deref for `from_repr`_) to appease the borrow checker for the later borrow needed by `value` in the error message.
This seems redundant given the context? Mostly adds unnecessary noise.

Could probably `impl Configurable` or similar to try workaround the requirement. The metadata description could generate the variant list similar to how it's been handled for error message handling?
Not sure if this is worthwhile, but it adopts error message convention elsewhere I've seen by managing them via Snafu.
Signed-off-by: Vitalii Parfonov <[email protected]>
…rity dynamic, payload_key optional

Signed-off-by: Vitalii Parfonov <[email protected]>
Signed-off-by: Vitalii Parfonov <[email protected]>
Signed-off-by: Vitalii Parfonov <[email protected]>
Signed-off-by: Vitalii Parfonov <[email protected]>
@vparfonov vparfonov requested review from a team as code owners September 15, 2025 08:40
@github-actions github-actions bot added the domain: releasing Anything related to releasing Vector label Sep 15, 2025
@vparfonov vparfonov marked this pull request as draft September 15, 2025 08:56
@github-actions github-actions bot added domain: codecs Anything related to Vector's codecs (encoding/decoding) and removed domain: releasing Anything related to releasing Vector labels Sep 15, 2025
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Sep 15, 2025
@vparfonov vparfonov marked this pull request as ready for review September 15, 2025 12:27
Copy link
Contributor

@drichards-87 drichards-87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Since there are no documentation updates in this PR, there's nothing for me to review, so I'm approving on behalf of the Docs Team.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vparfonov, we will do a more thorough review shortly.

chrono.workspace = true
csv-core = { version = "0.1.12", default-features = false }
derivative.workspace = true
derivative = { version = "2", default-features = false }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use workspace dependencies if they are used in >1 places.

snafu.workspace = true
syslog_loose = { version = "0.23", default-features = false, optional = true }
strum = { version = "0.26", features = ["derive"], optional = true }
syslog_loose = { version = "0.22", default-features = false, optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally downgrade here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, i think it was mistake i will fix

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Sep 16, 2025
@polarathene
Copy link

polarathene commented Sep 20, 2025

TL;DR: It looks like this PR is trying to mix in an additional feature for the encoder config to support referencing LogEvent data with different keys, when VRL remap transform should be capable of abstracting that out? (if it's something worthwhile to contribute perhaps extract that out to a separate PR/discussion after landing the encoder support?)

My feedback below is regarding changes related to the TLDR concern raised above, when comparing to my stale PR and it's review feedback.

EDIT: The concern appears to have been raised internally, so further review should wait until the refactor is pushed. I can also update my PR if that'd help with changes I note below.

Regex for RFC 3164 tag encoding

  • RFC3164 Tag Handling: The logic for generating the RFC3164 TAG was improved with a regex to prevent incorrectly duplicating the proc_id if the app_name field was already formatted.

Could you please explain when the app_name would also carry the proc_id, such that this regex check was necessary?

I assume this is specifically for the use-case where you want the static encoder config to additionally support referencing a different LogEvent field? Or something else specific to how you're providing the data?

You have this as:

// This regex pattern checks for "something[digits]".
// It's created lazily to be compiled only once.
static RFC3164_TAG_REGEX: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"^\S+\[\d+]$").unwrap());

impl Tag {
    fn encode_rfc_3164(&self) -> String {
        if RFC3164_TAG_REGEX.is_match(&self.app_name) {
            // If it's already formatted
            format!("{}:", self.app_name)
        } else if let Some(proc_id) = self.proc_id.as_deref() {
            format!("{}[{}]:", self.app_name, proc_id)
        } else {
            format!("{}:", self.app_name)
        }
    }

    // ...
}

Compared to my implementation:

impl Tag {
    // Roughly equivalent - RFC 5424 fields can compose the start of
    // an RFC 3164 MSG part (TAG + CONTENT fields):
    // https://datatracker.ietf.org/doc/html/rfc5424#appendix-A.1
    fn encode_rfc_3164(&self) -> String {
        let Self { app_name, proc_id, msg_id } = self;

        match proc_id.as_deref().or(msg_id.as_deref()) {
            Some(context) => [&app_name, "[", &context, "]:"].concat(),
            None => [&app_name, ":"].concat()
        }
    }

    // ...
}

NOTE: I don't use the convenience of format!() since IIRC performance was worse, so concat on a fixed array of strings was better 😅

From the looks of it with your regex addition, you're adding support for when app_name has bundled the proc_id formatting but without the :, but that's more of an issue/workaround for how you may be providing the log data to the encoder on your end? app_name really shouldn't have [ (or :) in it's value to begin with.

As this doesn't actually seem like a valid requirement for the encoder to support (see reference links in next section for both RFCs on this encoding step), I would discourage it and defer to correction of the input before it is processed by the encoder.


Serde alias tag for app_name field (Tag struct)

  • added #[serde(alias = "tag")] to the app_name field, allowing users to use the more familiar tag option for RFC3164 configurations.
/// App Name. Can be a static string or a dynamic field path like "$.app".
/// `tag` is supported as an alias for `app_name` for RFC3164 compatibility.
#[serde(alias = "tag")]
app_name: Option<String>,

I don't see why the alias is needed?

In my PR I have relevant comment links:

  • RFC 3164 section 4.1.3 explains the MSG part, which is comprised of the TAG and CONTENT fields.

    The value in the TAG field will be the name of the program or process that generated the message.
    The CONTENT contains the details of the message.

  • RFC 5424 - Appendix 1 explains that compared to RFC 3164:

    In this document, MSG is what was called CONTENT in RFC 3164.

    • The TAG is now part of the header, but not as a single field.
    • The TAG has been split into APP-NAME, PROCID, and MSGID.

Why is it important to serialize the app_name from a tag field?

Is it not redundant with your desire for config values to support this custom reference of alternative keys from LogEvent data already? A feature that is already unclear for why it is required when Remap transform can be used with the Vector config?

Personally with Vector already having it's own standard approach to handle transforming the input, it would be helpful to understand what value these two alternatives are providing that aren't specifically to support your personal usage when some config changes could be made to keep the implementation simple on Vector's side?


except_fields config feature

  • support except_fields, this allows users to remove specified fields from the LogEvent before it is processed, which is useful for stripping internal or sensitive data before it's included in the final payload.

This feature might have a bit of extra convenience, but it sounds more like a generic feature than SysLog support specific, how much of a value add is that over the equivalent config in VRL? (a remap transform should be able to fairly easily do the equivalent via remove()?)

AFAIK, the encoder should just focus on it's task to encode input received, and the config is meant to support that but any need to transform data prior to fit the encoders expected input shape is a separate task that should be generic not encoder specific?


DynamicOrStatic<T> enum

  • a new DynamicOrStatic<T> enum was introduced to allow facility and severity to be configured in multiple ways:

    • as a static name (e.g., "user")
    • as a static number (e.g., 16)
    • ss a dynamic field path (e.g., "$.level")
  • custom serde deserializers were implemented to handle this complex, multi-format input, providing clear error messages for invalid values.

/// A configuration value that can be either a static value or a dynamic path.
#[configurable_component]
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum DynamicOrStatic<T: 'static> {
    /// A static, fixed value.
    Static(T),
    /// A dynamic value read from a field in the event using `$.` path syntax.
    Dynamic(String),
}

Dynamic variant support aside (since concerns for that were covered above already), this has deviated quite a bit from what my PR had, was there a reason to do so?

I did have the Dynamic variant support working perfectly fine IIRC, if not in the current iteration of the PR, there would be some commits back (when config struct handled a string dynamic key lookup for these enums in the decant method).

This is what you have for the deserializing support:

// Generic helper.
fn deserialize_syslog_code<'de, D, T>(
    deserializer: D,
    type_name: &'static str,
    max_value: usize,
    from_repr_fn: fn(usize) -> Option<T>,
) -> Result<DynamicOrStatic<T>, D::Error>
where
    D: Deserializer<'de>,
    T: FromStr + VariantNames,
{
    let s = String::deserialize(deserializer)?;
    if s.starts_with("$.") {
        Ok(DynamicOrStatic::Dynamic(s))
    } else {
        parse_syslog_code(&s, from_repr_fn)
            .map(DynamicOrStatic::Static)
            .ok_or_else(|| {
                serde::de::Error::custom(format!(
                    "Invalid {type_name}: '{s}'. Expected a name, integer 0-{max_value}, or path."
                ))
            })
    }
}

fn parse_syslog_code<T>(s: &str, from_repr_fn: fn(usize) -> Option<T>) -> Option<T>
where
    T: FromStr,
{
    if let Ok(value_from_name) = s.parse::<T>() {
        return Some(value_from_name);
    }
    if let Ok(value_from_number) = s.parse::<u64>() {
        return from_repr_fn(value_from_number as usize);
    }
    None
}

fn deserialize_facility<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Facility>, D::Error>
where
    D: Deserializer<'de>,
{
    deserialize_syslog_code(deserializer, "facility", 23, Facility::from_repr)
}

fn deserialize_severity<'de, D>(deserializer: D) -> Result<DynamicOrStatic<Severity>, D::Error>
where
    D: Deserializer<'de>,
{
    deserialize_syslog_code(deserializer, "severity", 7, Severity::from_repr)
}

And here is what I had with my existing PR, with some review feedback applied (switch from akin crate to plain macro_rules + some minor revision):

macro_rules! deserialize_impl {
    ($enum:ty) => {
	impl $enum {
	    fn deserialize<'de, D>(deserializer: D) -> Result<Self, D::Error>
        where
            D: Deserializer<'de>,
        {
            let value = NumberOrString::deserialize(deserializer)?;
            Self::try_from(value).map_err(D::Error::custom)
        }
    }

    impl TryFrom<NumberOrString> for $enum {
        type Error = StrumParseError;

        fn try_from(value: NumberOrString) -> Result<Self, <Self as TryFrom<NumberOrString>>::Error> {
            let variant: Option<Self> = match &value {
                NumberOrString::Number(num) => Self::from_repr(*num),
                NumberOrString::String(s) => Self::from_str(&s.to_ascii_lowercase()).ok(),
            };
  
            variant.with_context(|| InvalidVariantSnafu {
                input: value.to_string(),
                variants: Self::VARIANTS.join("`, `"),
            })
        }
    }
}

deserialize_impl!(Facility);
deserialize_impl!(Severity);

// An intermediary container to deserialize config value into.
// Ensures that a string number is properly deserialized to the `usize` variant.
#[derive(derive_more::Display, Deserialize)]
#[serde(untagged)]
enum NumberOrString {
    Number(
        #[serde(deserialize_with = "deserialize_number_from_string")]
        usize
    ),
    String(String)
}

#[derive(Debug, Snafu)]
enum StrumParseError {
    #[snafu(display("Unknown variant `{input}`, expected one of `{variants}`"))]
    InvalidVariant { input: String, variants: String },
}

NOTES:

  • I split out logic from my current PR's deserializer to separate out a TryFrom impl as shown above, as I think I intended to support that conversion somewhere where serde wasn't used.
  • I had also moved the deserializer annotation to the Pri struct, with the parent config struct deserializing with flattened fields:
    #[configurable_component]
    #[derive(Clone, Debug, Default)]
    #[serde(default)]
    pub struct SyslogSerializerOptions {
        /// RFC
        rfc: SyslogRFC,
    
        #[serde(flatten)]
        #[configurable(derived)]
        priority: Pri,
    
        #[serde(flatten)]
        #[configurable(derived)]
        tag: Tag,
    }
    
    /// Priority Value
    #[derive(Clone, Default, Debug)]
    #[configurable_component]
    #[serde(default)]
    struct Pri {
        /// Facility
        #[serde(deserialize_with = "Facility::deserialize")]
        facility: Facility,
        /// Severity
        #[serde(deserialize_with = "Severity::deserialize")]
        severity: Severity,
    }
  • Usage of serde(untagged) attribute for NumberOrString could be replaced in favor of more performance AFAIK via some additional verbosity with the serde-untagged crate instead, or similar to your approach with parse_syslog_code().

This is also technically an inconsistency with your SyslogSerializerOptions since DynamicOrStatic<T> and Option<String> are both being used with that intent of dynamic/static string value.

So our approaches are mostly similar, except mine allows for keeping the type consistent as the enums in both config struct and the later derived decant/encoding struct. Doing so I think is better to grok/maintain?

@pront
Copy link
Member

pront commented Sep 24, 2025

Thank you for the thorough feedback @polarathene.

Some general comments from me to help move things forward:

  • I prefer starting with the simplest possible version, e.g. except_fields can be added later if we deem it necessary.
  • When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.
  • I would not worry about micro-optimizations such as concat vs format at this stage.

@polarathene
Copy link

  • When giving review feedback prefer multiple comments (one per topic) and when possible attach them on the code itself. It's hard to track what was discussed and what was resolved.

I normally do :)

However I was mostly focused on the PR description statements and comparing to my own equivalent snippets from my earlier PR. Along with the understanding that the current PR was going to see some notable refactoring... it was easier for me to structure it as I did for my own reference.

As such it was more of a discussion with the PR authors decisions, to get on the same page rather than review the existing PR. IIRC when reviewing inline on proposed changes, it doesn't always convey the scope of lines the comment is for (only highlights the last one and a few above it), so without a suggested change proposed, it was clearer for me to use as reference for comparison, especially after any rework was done, I'm not particularly interested in that extra leg work to dig up the old context intended 😅


  • I would not worry about micro-optimizations such as concat vs format at this stage.

I was more curious about the deviation from what I already had there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: sinks Anything related to the Vector's sinks meta: awaiting author Pull requests that are awaiting their author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants