Skip to content

Conversation

slvrtrn
Copy link
Contributor

@slvrtrn slvrtrn commented Jun 20, 2025

Summary

Implements RBWNAT support for insert and inserter.

Unfortunately, there will be quite a lot of breaking changes. Specifically, we absolutely need make Query::insert async, so we can properly fetch the schema, as we need to put RBWNAT header in the stream before the first record. This also implies making Inserter::write async too (but at least this is aligned with how insert works already).

On a flip side, useless Result types from the new constructors for insert and inserter are now removed.

Current progress:

  • RowMetadata cache - initially wanted to make it static and global, but it is extremely unfriendly for tests, since it is shared within the test process and collides for the tables with the same name, leading to cryptic errors
  • Insert support
  • Inserter support
  • Writing RBWNAT header
  • Serializer validation calls
  • Tests
  • README update
  • Adjust mocked_insert benchmark (TLDR: ~2x slower with validation)

Related issues

A checkmark means that we have all necessary integration tests in place.

Resolved after this PR

Previously closed issues

These issues were closed due to schema mismatch in the code, but now should have clearer error (panic) messages:

Wrapping up RBWNAT

After #221, we had a few leftovers:

  • Test various maps (de)serialization, e.g., BTreeMap, IndexMap, etc.)
  • Figure out the final version of Mock API after all the breaking changes

Follow-ups

Flatten attribute is non-trivial and has been extracted as #264 to reduce scope and implement it later.

@slvrtrn slvrtrn requested review from loyd and serprex June 20, 2025 19:47
@slvrtrn slvrtrn changed the base branch from row-binary-header-check to main June 23, 2025 14:55
@slvrtrn slvrtrn changed the title feat(insert): RowBinaryWithNamesAndTypes feat!(insert): RowBinaryWithNamesAndTypes Jun 23, 2025
@mshustov mshustov requested a review from loyd July 13, 2025 10:57
@slvrtrn
Copy link
Contributor Author

slvrtrn commented Jul 30, 2025

Figure out the final version of Mock API after all the breaking changes

Perhaps we should do it as a follow-up.

@slvrtrn
Copy link
Contributor Author

slvrtrn commented Jul 30, 2025

Added mocked_insert benchmark with validation, results are a bit odd:

insert/validation=true/compression=None
                        time:   [79.181 ns 84.123 ns 89.261 ns]
                        thrpt:  [598.31 MiB/s 634.85 MiB/s 674.47 MiB/s]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
insert/validation=true/compression=Lz4
                        time:   [88.744 ns 91.117 ns 94.152 ns]
                        thrpt:  [567.23 MiB/s 586.12 MiB/s 601.80 MiB/s]
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe
insert/validation=false/compression=None
                        time:   [43.984 ns 44.218 ns 44.454 ns]
                        thrpt:  [1.1732 GiB/s 1.1795 GiB/s 1.1858 GiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
insert/validation=false/compression=Lz4
                        time:   [41.431 ns 41.623 ns 41.817 ns]
                        thrpt:  [1.2472 GiB/s 1.2530 GiB/s 1.2588 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

inserter/validation=true/compression=None
                        time:   [101.51 ns 107.46 ns 113.71 ns]
                        thrpt:  [469.67 MiB/s 496.97 MiB/s 526.12 MiB/s]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
inserter/validation=true/compression=Lz4
                        time:   [88.339 ns 92.919 ns 97.824 ns]
                        thrpt:  [545.94 MiB/s 574.76 MiB/s 604.56 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
inserter/validation=false/compression=None
                        time:   [65.365 ns 65.697 ns 66.001 ns]
                        thrpt:  [809.17 MiB/s 812.91 MiB/s 817.04 MiB/s]
Found 19 outliers among 100 measurements (19.00%)
  19 (19.00%) low severe
inserter/validation=false/compression=Lz4
                        time:   [61.713 ns 64.738 ns 67.457 ns]
                        thrpt:  [791.70 MiB/s 824.96 MiB/s 865.38 MiB/s]

inserter-period/validation=true/compression=None
                        time:   [104.17 ns 110.77 ns 118.18 ns]
                        thrpt:  [451.92 MiB/s 482.12 MiB/s 512.70 MiB/s]
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
inserter-period/validation=true/compression=Lz4
                        time:   [72.856 ns 74.972 ns 76.716 ns]
                        thrpt:  [696.15 MiB/s 712.34 MiB/s 733.03 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high severe
inserter-period/validation=false/compression=None
                        time:   [67.356 ns 68.250 ns 68.976 ns]
                        thrpt:  [774.27 MiB/s 782.50 MiB/s 792.89 MiB/s]
inserter-period/validation=false/compression=Lz4
                        time:   [62.827 ns 65.378 ns 67.786 ns]
                        thrpt:  [787.86 MiB/s 816.87 MiB/s 850.05 MiB/s]

@QazCetelic
Copy link

QazCetelic commented Aug 2, 2025

This indeed fixes #266

Comment on lines +133 to +144
if columns.len() != T::COLUMN_NAMES.len() {
panic!(
"While processing struct {}: database schema has {} columns, \
but the struct definition has {} fields.\
\n#### All struct fields:\n{}\n#### All schema columns:\n{}",
T::NAME,
columns.len(),
T::COLUMN_NAMES.len(),
join_panic_schema_hint(T::COLUMN_NAMES),
join_panic_schema_hint(&columns),
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break inserts that leave out columns with default values, and relatedly, it will make it impossible to insert into tables with MATERIALIZED or ALIAS columns altogether. Perhaps this function should use the DESCRIBE TABLE output rather than the RowBinaryWithNamesAndTypes header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good find, also worth testing with both current and updated version

}
let mut insert = client.insert("some")?;
let mut insert = client.insert::<MyRow>("some").await?;
Copy link
Contributor

@abonander abonander Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ```rust,ignore we could use ```rust,no_run so that these at least get compile-tested.

I could take that on as a task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done as a part of finalizing #160

)
ENGINE = MergeTree
PRIMARY KEY (instrument_id, received_time)
"#,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using raw string literals is also something idiomatic to SQLx, though strictly speaking this is only necessary for when you need to quote identifiers as it avoids the need for escapes: https://docs.rs/sqlx/latest/sqlx/macro.query.html#type-overrides-output-columns

Since it appears ClickHouse uses backticks instead of double quotes for quoted identifiers, there's not much benefit to using raw string literals here.

}
}

/// Cache for [`RowMetadata`] to avoid allocating it for the same struct more than once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to be careful with this kind of caching is the user may make concurrent changes to the table which invalidates the cached metadata. This could easily happen if they run migrations during a rolling upgrade of an application.

We've had problems with this in SQLx WRT prepared statement caching.

This isn't necessarily a bad idea, but at the very least you should provide a method for the user to manually invalidate the cache.

Copy link
Contributor Author

@slvrtrn slvrtrn Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I think ideally we should:

  • provide a TTL (maybe one or a few hours by default)
  • an ability to manually drop the caches

// https://clickhouse.com/docs/en/sql-reference/syntax#identifiers
let sql = format!("INSERT INTO {table}({fields}) FORMAT RowBinary");
let format = if row_metadata.is_some() {
"RowBinaryWithNamesAndTypes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the format types defined as an enum, or at least constants somewhere. That's something I could do. If we want the user to be able to specify formats that we don't know about (e.g if new ones get added), there's options for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is internal only; we can add some popular formats as an enum provided by the crate. This might be useful with fetch_bytes: https://github.com/ClickHouse/clickhouse-rs/blob/main/examples/stream_arbitrary_format_rows.rs
and its INSERT counterpart that is to be implemented: #174

So perhaps it can be introduced in the scope of #174 indeed. However: it should implement Into<String> and not break the existing API expecting an impl Into<String>, cause we want to export these enums as just a hint on what is available, and this library might not be in sync with the latest CH release that adds a new format, which the end user might want to use ASAP.

@slvrtrn
Copy link
Contributor Author

slvrtrn commented Sep 22, 2025

Let's squash this one as it became too big and continue in the follow-ups. I created a few issues based on the feedback in the comments:

These issues should be addressed before 0.14.0.

We can release #294 (changing panic to an error instead) as 0.14.1 cause it is not a breaking change.

@slvrtrn slvrtrn merged commit 3947c14 into main Sep 22, 2025
7 checks passed
slvrtrn added a commit that referenced this pull request Sep 22, 2025
slvrtrn added a commit that referenced this pull request Oct 3, 2025
@slvrtrn slvrtrn deleted the rbwnat-insert branch October 8, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consideration of Type Safety Support the FixedString type Use RowBinaryWithNamesAndTypes

5 participants