-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15260: Add SAI-specific execution info to slow query logger #2044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checklist before you submit for review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did first round of review and left some questions and suggestions, mostly trivial.
I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?
Sonar didn't run but we can check it with the next CI run, it didn't prevent me from doing initial review.
src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/QueryMonitorableDetails.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/sai/SlowSAIQueryLoggerTest.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
Outdated
Show resolved
Hide resolved
| public Index.Searcher indexSearcher() | ||
| { | ||
| return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this); | ||
| if (indexQueryPlan == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see tests that show that things look ok if we have queries involving non-sai indexes. Can you add some, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, and also for regular queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an artificial test for the sake of increasing coverage from the current 96.6%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why artificial? I see queries and real test case?
for the sake of increasing coverage from the current 96.6%.
This was not about the number at all. There are valid cases as unreachable checks which would exhibit a bug, but this wasn't one of them. Thanks for the added test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw by looking into the new test - this was probably just missing queries in the already existing test which would have hit those paths. As this is about testing the slow query logger, they should be checking probably whether we log or not, as expected.
d742940 to
5b0a230
Compare
We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here: cassandra/src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java Lines 205 to 210 in 5c33fd2
|
5c33fd2 to
13bf6ad
Compare
ekaterinadimitrova2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are almost ready.
A few small comments plus complaint on test coverage from the tool. I feel it is worth to check it as even before the tool I mentioned we have to test those if clauses. Thanks
The test failures seem to be known to me flakies not related to what we do here.
About the below comment
I am wondering how good of an idea is to have to restart a node to disable it if we sense too much noise or performance degradation once enabled?
We don't need to restart the node with the current approach. The system property that toggles logging SAI execution details is read every time we detect a slow query, and the system property can be changed at any time. Relevant bits here:
I am clear how we check it to decide whether to log or not, but how do you set that property in runtime? I don't see any setter or anything? What did I miss?
src/java/org/apache/cassandra/db/monitoring/MonitoringTask.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/sai/SlowSAIQueryLoggerTest.java
Show resolved
Hide resolved
| public Index.Searcher indexSearcher() | ||
| { | ||
| return indexQueryPlan == null ? null : indexQueryPlan.searcherFor(this); | ||
| if (indexQueryPlan == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems solar is still complaining on the test coverage for this one and the next check. Can you take a look, please?
You didn't miss anything, I did. I have added a new mbean just for SAI and added the new config property there. Hopefully we can incrementally keep adding other similar past and future properties to it. I have named it |
|
After this Slack discussion, I'm removing the SAI mbean since we have other ways to manipulate |
b47f01b to
b271894
Compare
36077f6 to
cb8ed72
Compare
ekaterinadimitrova2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very neat with the supplier but I do not think volatile is enough now for it. Please check the comment.
I am surprised Sonar complaints about code test coverage for the empty execution info, didn't we have it covered already on the previous run?
About the other issue flagged by Sonar - instance can definitely not be final in MonitoringTask.
|
|
||
| @VisibleForTesting | ||
| static MonitoringTask instance = make(REPORT_INTERVAL_MS, MAX_OPERATIONS); | ||
| public static MonitoringTask instance = make(Math.max(0, CassandraRelevantProperties.SLOW_QUERY_LOG_MONITORING_REPORT_INTERVAL_IN_MS.getInt()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there is unused import now
| @Nullable | ||
| protected final Index.QueryPlan indexQueryPlan; | ||
|
|
||
| private volatile Supplier<ExecutionInfo> executionInfoSupplier = ExecutionInfo.EMPTY_SUPPLIER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The supplier is neat, but now volatile is not enough. Also pointed by Sonar. On a quick look, probably just adding AtomicReference should be probably enough? The fields in QueryMonitorableExecutionInfo are final.
I have also read that generic warning in Sonar, but wouldn't |
Agreed, volatile should be enough actually |
|
Let's check the test failures on the other PR and I can approve this one. |
ba65451 to
8445e70
Compare
|
Thanks for the feedback. I have removed that unused import and rebased without conflicts. I don't see related failures in CI in either this PR or the one for CNDB. |
Just to confirm - after #15280 slow query logging will always redact user data? I did not see any parameters added in |
Yes, #15280 will always redact the query user data printed to logs. When either this issue or #15280 get merged, the other will need an update, whichever gets second. I thought #15280 was going to be first, but it is taking longer than initially expected. I understand that while #15280 is not merged, redaction doesn't exist for this ticket. The tests added here will fail when redaction is ready, and I have just modified the one in the CNDB PR so it will also fail with redaction, so we don't miss it. Note by the way that here we are not printing any additional user data to the logs, since the slow queries were already being logged, the metrics don't contain user data, and the plan contains the same user data as the query (#15280 also redacts the plan). |
|
@ekaterinadimitrova2 I have added the requested JMH microbenchmark to measure the impact of the extended logging. Here are the results: We can see that the generation of the extended SAI Producing the larger log messages adds some ~20 microseconds to the previous ~10 microseconds per query, making them ~3x slower. However, the cost of the slow query logging is ~4 orders of magnitude lower than the faster of the slow queries, which take at least 500ms by default. In fact, the impact of the extended logging is lower than the error of the benchmark. So, as it was expected, the extended logging dilutes into almost nothing when compared to the cost of the reported slow queries themselves. |
ekaterinadimitrova2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the benchmark test, LGTM
...bench/org/apache/cassandra/test/microbench/index/sai/QueryMonitorableExecutionInfoBench.java
Outdated
Show resolved
Hide resolved
Adds the metrics in IndexContext and the tree view of the query's Plan to the log messages produced for slow SAI queries. There is a new system property named cassandra.sai.slow_query_log.execution_info_enabled to toggle this feature, which is enabled by default. This property can be changed dynamically at any time, without a restart. When SAI-specific information is added to the slow query logger, the log messages for slow queries can take up almost 10x the (uncompressed) disk space. In the worst of the worst cases, with all queries being permanently slower than 500ms, and at least 50 different queries, reporting every 5 seconds (the default), the slow query logger can produce around 120MB of text per day and instance. With the detailed logging, it would be 1.14GB per day. # Conflicts: # src/java/org/apache/cassandra/db/monitoring/MonitoringTask.java # src/java/org/apache/cassandra/index/sai/plan/Plan.java
92241c9 to
0897f42
Compare
|
I have rebased this to include the redaction of logged SELECT queries introduced by CNDB-15280, and updated |
|
✔️ Build ds-cassandra-pr-gate/PR-2044 approved by ButlerApproved by Butler |
ekaterinadimitrova2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Adds the metrics in IndexContext and the tree view of the query's Plan to the log messages produced for slow SAI queries. There is a new system property named cassandra.sai.slow_query_log.execution_info_enabled to toggle this feature, which is enabled by default. This property can be changed dynamically at any time, without a restart. When SAI-specific information is added to the slow query logger, the log messages for slow queries can take up almost 10x the (uncompressed) disk space. In the worst of the worst cases, with all queries being permanently slower than 500ms, and at least 50 different queries, reporting every 5 seconds (the default), the slow query logger can produce around 120MB of text per day and instance. With the detailed logging, it would be 1.14GB per day.
… logger (#2044) Adds the metrics in IndexContext and the tree view of the query's Plan to the log messages produced for slow SAI queries. There is a new system property named cassandra.sai.slow_query_log.execution_info_enabled to toggle this feature, which is enabled by default. This property can be changed dynamically at any time, without a restart. When SAI-specific information is added to the slow query logger, the log messages for slow queries can take up almost 10x the (uncompressed) disk space. In the worst of the worst cases, with all queries being permanently slower than 500ms, and at least 50 different queries, reporting every 5 seconds (the default), the slow query logger can produce around 120MB of text per day and instance. With the detailed logging, it would be 1.14GB per day.



What is the issue
The slow query logger is one of the most useful tools when dealing with performance-related incidents. Currently it simply prints the CQL representation of the slow query and its running time. So while it can allow us to identify problematic SAI queries, we have to go to the aggregated metrics where it's difficult to see what matches the problematic query.
I think it would be useful if the slow query logger also printed the SAI query metrics and plan. That way we could examine log records at the time of the incident to see what's going on with individual slow queries, without needing to extract conclusions from aggregated metrics.
What does this PR fix and why was it fixed
This PR adds SAI-specific information to the log messages produced when slow queries are detected. That information consists on the metrics in the
QueryContextand the tree view of the queryPlan. There is acassandra.cassandra.sai.slow_query_log.details_enabledproperty to disable this extended logging.This is how slow queries are currently logged:
And this is how they would look like with SAI details:
Note that the queries are printed as invalid CQL and include user data. That is being dealt with in CNDB-15280, which is under review.