Skip to content

Conversation

muzarski
Copy link
Contributor

@muzarski muzarski commented Apr 18, 2025

Fixes: #257
Ref: #132

This PR implements cass_session_get_metrics and enables two integration tests.

Metrics tests

To be quite honest, I thought we'll be able to enable more tests, but:

  1. StatsConnections

Requires cass_cluster_set_num_threads_io, cass_cluster_set_core _connections_per_host
and cass_cluster_set_constant_reconnect.

  1. ErrorsConnectionTimeouts

Requires cass_cluster_set_core_connections_per_host.

  1. SpeculativeExecutionRequests

Requires cass_session_get_speculative_execution_metrics.

  1. Requests

This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?
Thanks to @Lorak-mmk suggestion, I was able to apply the HistoryListener hack in the test - this allows us to enforce >=1ms latencies for the requests.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
  • I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

@muzarski muzarski self-assigned this Apr 18, 2025
@muzarski muzarski added the P1 P1 priority item - very important label Apr 18, 2025
@muzarski muzarski added this to the 0.5 milestone Apr 18, 2025
@muzarski muzarski requested review from Lorak-mmk and wprzytula April 18, 2025 15:24
@muzarski
Copy link
Contributor Author

HeartbeatFailed test failed. It worked for me locally - I'll investigate.

@muzarski muzarski marked this pull request as draft April 18, 2025 16:01
@muzarski muzarski force-pushed the metrics branch 3 times, most recently from 55ca2d9 to 3b5129d Compare April 18, 2025 18:55
@muzarski
Copy link
Contributor Author

Ok, it looks like the HearbeatFailed test is totally unreliable under vagrind. This makes sense, because this test is timeout-sensitive and valgrind increases the execution time significantly. This is similar to the test that used cass_future_wait_timed.

I've enabled the test under [C*/SCYLLA]_NO_VALGRIND_TEST_FILTER.

@muzarski muzarski marked this pull request as ready for review April 18, 2025 19:20
@muzarski muzarski requested review from Lorak-mmk and wprzytula April 18, 2025 19:20
@muzarski
Copy link
Contributor Author

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

@muzarski
Copy link
Contributor Author

muzarski commented Apr 18, 2025

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

v1.2: Well, I gave it a shot, but it failed for one of the Scylla releases..... So it unfortunately is flaky (and will be, unless we address microsecond granularity on the rust-driver side). Disabled this test again.

Copy link
Contributor

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?

What about your idea with setting coalescing to > 1ms?

@muzarski
Copy link
Contributor Author

muzarski commented Apr 22, 2025

What about your idea with setting coalescing to > 1ms?

It is not reliable enough.. Notice that there may be a burst of requests that are handled (and appended to request queue) just before the coalescing timer expires. Then such delay has almost no effect on the mean request latency value.

@Lorak-mmk
Copy link
Contributor

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

@Lorak-mmk
Copy link
Contributor

Another possible place for that would be history listener.

@muzarski
Copy link
Contributor Author

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

I see that we measure latency per-attempt. We completely ignore failing attempts when measuring the latency. Is it expected behaviour?

@Lorak-mmk
Copy link
Contributor

I have no idea tbh. @piodul ?

This means that both approaches (LBP, history) are not viable, right?

@muzarski
Copy link
Contributor Author

This means that both approaches (LBP, history) are not viable, right?

LBP will not work for sure. But history hack might work actually. I'll try it out.

@piodul
Copy link
Contributor

piodul commented Apr 22, 2025

I have no idea tbh. @piodul ?

Me neither. What do other drivers measure?

@muzarski
Copy link
Contributor Author

v1.3: Applied the HistoryListener hack to increase the request latency during the test. The idea behind this is explained in the commit message. Thanks to that, we can enable Requests metrics test. All changes are contained in the last commit.

@muzarski muzarski requested a review from Lorak-mmk April 23, 2025 10:03
@muzarski
Copy link
Contributor Author

Rebased on master

@muzarski muzarski mentioned this pull request Apr 24, 2025
6 tasks
We 0-initialize deprecated fields, so we retain the binary compatibility
with cpp-driver.
Since we implemented metrics, we can enable HeartbeatFailed test with some
adjustments to log filtering.

This test seems to fail under valgrind. This is why I enable it to run
next to other test that cannot be run under valgrind.

Note: The original test seems to be flaky for Cassandra. The following scenario
occurred:
1. node2 is paused
2. keepaliver notifies the pool refiller about that
3. refiller removes the connection to node2 (metrics::total_connections -= 1)
4. in the test, we read get_metrics().total_connections < initial_connections - we go out of the loop
5. refiller tries to open a connection again (metrics::total_connections += 1)
6. we read get_metrics().total_connections, and expect total_connections to be
less than initial_connections - but it is not.

This is why, to combat this, I adjusted the test so the same metrics snapshot
is used to leave the loop and make an assertion. In this case, the aforementioned
"unlucky" scenario will not happen.
To be quite honest, I thought we'll be able to enable more tests, but:

1. StatsConnections
Requires `cass_cluster_set_num_threads_io`, `cass_cluster_set_core_connections_per_host`
and `cass_cluster_set_constant_reconnect`.

2. ErrorsConnectionTimeouts
Requires `cass_cluster_set_core_connections_per_host`.

3. SpeculativeExecutionRequests
Requires `cass_session_get_speculative_execution_metrics`.

4. Requests
This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?
Without this, valgrind complains about access to uninitialized memory.
Why does this test not work without adjustments? Well, this is because
rust-driver collects latencies with millisecond graunularity. In result,
most of the latencies during the tests in local setup are rounded to 0ms.

This is why, we somehow need to simulate higher latencies during the test.
There is one piece of code that user controls and is executed in rust-driver
in between start and end time measurements - namely `HistoryListener::log_attempt_start`.

This is where we can add a sleep to simulate higher latencies in local setup.
And so, I implemented `SleepingHistoryListener` that does just that. In addition,
I implemented the testing API to set such listener on the statement.

The "Requests" test is adjusted accordingly, and enabled.

Note: Since all latencies during the test in local setup are now expected to be around 1ms,
I loosened the stddev check to be `>= 0` instead of `> 0`.
@muzarski
Copy link
Contributor Author

Rebased on master (sped up ci)

@muzarski muzarski mentioned this pull request Apr 24, 2025
5 tasks
@muzarski muzarski requested a review from wprzytula April 25, 2025 15:33
@muzarski
Copy link
Contributor Author

@wprzytula I addressed all of your commits. There was nothing to change.

@muzarski muzarski merged commit 3c28a13 into scylladb:master Apr 28, 2025
12 checks passed
@muzarski muzarski deleted the metrics branch April 28, 2025 10:56
@wprzytula wprzytula mentioned this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 P1 priority item - very important

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement metrics on cpp-rust-driver side (cass_session_get_metrics)

4 participants