HBASE-29090: Add server-side load metrics to client results #6623

hgromer · 2025-01-21T19:39:45Z

No description provided.

hgromer · 2025-01-21T20:10:56Z

krconv

This will be very useful! Just some nitpicks, feel free to ignore

krconv · 2025-01-21T22:41:35Z

hbase-protocol-shaded/src/main/protobuf/client/Client.proto

  }
  optional ReadType readType = 23 [default = DEFAULT];
  optional bool need_cursor_result = 24 [default = false];
+  optional bool resultMetricsEnabled = 25 [default = false];


Nit; enable_result_metrics matches the existing fields better (and same note for other protobuf changes)

Suggested change

optional bool resultMetricsEnabled = 25 [default = false];

optional bool enable_result_metrics = 25 [default = false];

updated. this file has both camel & snake case littered everywhere I should probably avoid adding additional inconsistencies

krconv · 2025-01-21T23:00:41Z

hbase-protocol-shaded/src/main/protobuf/client/Client.proto

+* Statistics about the Result's server-side metrics
+*/
+message ResultMetrics {
+  required uint64 blockBytesScanned = 1;


Not sure what the standard is for HBase, but I know that optional values with a default are more future proof than required fields, and I think that would be better

wasn't familiar with the move away from required fields, I'm happy to make this optional.

krconv · 2025-01-21T23:02:29Z

hbase-client/src/main/java/org/apache/hadoop/hbase/shaded/protobuf/ProtobufUtil.java

    builder.setStale(result.isStale());
    builder.setPartial(result.mayHaveMoreCellsInRow());

+    if (result.getMetrics() != null) {


Seems like the result metrics will get lost for exists calls; thoughts on handling that more explicitly? Maybe throwing an error if someone tries to enable metrics on an Get with checkExistenceOnly == true

Throwing an error might be a bit aggressive, though I'd be curious to know what others think. I'm not sure if we're happy throwing an unchecked exception from setCheckExistenceOnly and if we wanted to throw an IOException, we'd have to change the signature.

Maybe it's enough that the metics returned will be null

krconv · 2025-01-21T23:04:27Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/Get.java

    this.setFilter(get.getFilter());
    this.setReplicaId(get.getReplicaId());
    this.setConsistency(get.getConsistency());
+    this.setResultMetricsEnabled(get.isResultMetricsEnabled());


Nit; Thoughts on the name QueryMetrics? I think that ResultMetrics could be interpreted as "metrics about the result" instead of "metrics about the query operation"

Query metrics makes sense, I mostly just wanted a way to distinguish between the existing scan metrics, and these new metrics

krconv · 2025-01-21T23:13:37Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java

            boolean mayHaveMoreCellsInRow = scannerContext.mayHaveMoreCellsInRow();
            Result r = Result.create(values, null, stale, mayHaveMoreCellsInRow);
+
+            if (request.getScan().getResultMetricsEnabled()) {


What's the reason for not directly setting it on the result for scans?

good question, scans are a bit tricky just due to how the results are constructed on the client side. you can take a look at how the response converter builds these responses. They're created from the cells, so we need to send the metrics separately through the wire in this case and hydrate them client-side.

hgromer · 2025-01-22T17:38:19Z

I'm actively working through the unit test failures that come up

hgromer · 2025-01-22T22:55:16Z

hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestOnlineLogRecord.java

    scan.withStartRow(Bytes.toBytes(123));
    scan.withStopRow(Bytes.toBytes(456));
-    String expectedOutput =
-      "{\n" + "  \"startTime\": 1,\n" + "  \"processingTime\": 2,\n" + "  \"queueTime\": 3,\n"


it seems like adding a field completely changed the order of the serialized fields. all I did here was add the queryMetricsEnabled field

This assertion is dumb any could be replaced by literally any modern library for asserts on JSON blobs.

ndimiduk

So high-level question: why is this a new QueryMetrics instead of expanding use of the existing ScanMetrics to all query types? That class even includes the comment,

* Some of these metrics are general for any client operation such as put However, there is no need
* for this. So they are defined under scan operation for now.

hgromer · 2025-02-11T16:28:54Z

So high-level question: why is this a new QueryMetrics instead of expanding use of the existing ScanMetrics to all query types? That class even includes the comment,
* Some of these metrics are general for any client operation such as put However, there is no need
* for this. So they are defined under scan operation for now.

My main reasoning was that some metrics exclusively make sense for scans. I thought it'd be confusing to include those metrics in Result objects that were fetched via Get or MultiGet.

Another reason for keeping them separate was to make it easier to expand one and not the other in the future. I worry about having one large object will make things harder to extend in the future.

ndimiduk

Your changes are clean and precise, I have little to say about the execution. However as far as the object model here, I think that you're sitting between two different implementations.

If the plan is to only include a QueryMetrics for Results produced by Gets, I think that you can reduce the scope of this class. Make it a GetMetrics and have it be only applicable to Get-based queries. Push its accessors down to the Get object, remove references from Query and Scan. This approach might become weird because of how Gets are implemented as Scans internally, or maybe it'll be fine -- that's essentially what you have already done here.

On the other hand, you can keep it as a QueryMetrics and also gather this data for Result instances produced by a Scan. Then Scan users can opt-into this metric so as to obtain fine-grained metrics over the internals of the query. You probably also need to update the object model of ScanMetrics, because if a Scan is a Query, then a ScanMetrics should be a QueryMetrics. You'll also need think about whether enabling QueryMetrics also enables ScanMetrics and vice-versa... I expect that there would be a configuration state that enables only ScanMetrics, mimicking the behavior existing today.

Whichever path you choose, you also need to wire this new metrics thing into other parts of the system. Inspecting where ScanMetrics are used will guide you -- MapReduce and PerformanceEvaluation tool(s) should also expose this new feature.

Personally, I think that the ambition of a QueryMetrics is a good one and it maintains the feature parity for users who prefer schemas built around Scans vs. built around Multi-Gets.

ndimiduk · 2025-02-26T11:18:58Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/QueryMetrics.java

+import org.apache.yetus.audience.InterfaceAudience;
+
+@InterfaceAudience.Public
+public class QueryMetrics {


Also decorate this class with IS.Unstable or Evolving. Whichever you choose, also place the same annotation on the public accessor methods on all the IA.Public classes. Let's give ourselves a change to explore this feature and make changes faster than once every 5 years.

Why not use the existing ServerSideMetric class? It has loose typing for metrics via a map and protobuf logic is already there. The only advantage for having QueryMetrics is that the metrics as fields will be more efficient than a map, but it can be improved by defining predefined metric "schemas" for ServerSideMetric. E.g., you can have "QueryMetrics" as one of the schemas and it would predefine a fixed size counters map with "blockBytesScanned" as the only metric.

ServerSideMetric currently has a fixed schema of counters, but this can be changed such that a different schema can be specified.

We at Salesforce have a similar requirement to track scan metrics at a higher granularity of region level. Currently scan metrics for all the regions scanned are combined together as they are available and so we lose the granularity. Thus, we have a proposal to maintain the per region metrics along with capturing region name and RS where the region lies. A new method will be exposed to retrieve the metrics with granularity, but the existing method will behave the same as it combines all the per region metrics into one structure. We are reusing the existing ScanMetrics class due to following reasons:

Our use case involves doing scans and it already has support for this class, so this is an incremental change on top of it.

We are not changing the type of metrics, but simply preserving the granularity. So, we are capturing list of ScanMetric objects, one per each region.

This allows leveraging the existing integration of scan metrics on server and client side, such as protobuf serialization.

Why not use the existing ServerSideMetric class? It has loose typing for metrics via a map and protobuf logic is already there.

I'm hesitant to do this. What's the value in obfuscating the actual metrics themselves by using a generic map when we can efficiently and more clearly represent whichever fields are being served to the user? In my opinion, this is also less error-prone, as the fields are clearly stated, and it's harder to remove or modify something unintentionally.

Thus, we have a proposal to maintain the per region metrics along with capturing region name and RS where the region lies

I talked a little bit about why I opted for creating a new class, rather than re-using an existing one here.

I worry about coupling the new metrics and Scans with this shared metrics class which could make it hard to iterate in the future.

It makes sense to re-use the ScanMetrics for more granular scan metrics, but I think it makes sense to create a new metric type for metrics that will back both Get(s) and Scans. Additionally, the granularity is so course, that we'll be adding bloat to the metric by requiring fields, such as countOfRegions which wouldn't apply to this new type of metric I'm proposing.

We'll still need to introduce this new pattern to enable metrics at the granularity that we want (per-result). So I don't think re-using a base class, and implementing an inheritance hierarchy avoids this. I'm not sure I see the benefit of avoiding a new protobuf definition, or avoiding a new converter. Both are pretty trivial to create, and allow us to keep these concepts (result metrics, vs aggregated scan metrics) separate. Perhaps there's some value in a base metrics class that simply stores counters, it requires a bit of refactoring and touches existing code.

For the sake of discussion, I took a stab at what it would look if we did create this inheritance hierarchy. We can look at this commit to compare both implementations. I'm not we get much value from it, but am happy to continue discussing.

👍 Agreed with @hgromer. I think these are two separate things, and there's not a lot of value in shoehorning inheritance. Wishy washy inheritance feels like something we'll wish we hadn't done two or three iterations down the road

We'll still need to introduce this new pattern to enable metrics at the granularity that we want (per-result).

I am not sure if I am missing the point, but of course, how else will you get instrumentation other than actually using the class?

So I don't think re-using a base class, and implementing an inheritance hierarchy avoids this.

The suggestion was about reusing the existing pattern by making it more generic, instead of introducing a new class. If we keep adding new standalone classes for every new scenario, it can quickly get confusing. Also, the existing counter map based approach is easier to introspect and process. Just my 2¢.

For the sake of discussion, I took a stab at what it would look if we did create this inheritance hierarchy. We can look at this commit to compare both implementations

Looks good, but I would rename ServerSideMetricsCounter as just MetricsCounter (or MetricsBase) as there is nothing server specific in it. Also, it is not the right place to have those public static constants as it is independent of any specific metric.

I feel that the original implementation is still far cleaner. We need to add a QueryMetric class regardless; the only thing we save on is an additional protobuf, but I don't think that's a huge benefit. I think generic counters make sense when the schema or shape of the data is abstract, or unknown (such as Configuration), but I think it's neater to leverage explicit typings whenever possible.

I've reverted back

ndimiduk · 2025-02-26T11:43:14Z

hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestOnlineLogRecord.java

    scan.withStartRow(Bytes.toBytes(123));
    scan.withStopRow(Bytes.toBytes(456));
-    String expectedOutput =
-      "{\n" + "  \"startTime\": 1,\n" + "  \"processingTime\": 2,\n" + "  \"queueTime\": 3,\n"


This assertion is dumb any could be replaced by literally any modern library for asserts on JSON blobs.

ndimiduk · 2025-02-26T11:47:58Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java

+    Result r =
+      Result.create(results, get.isCheckExistenceOnly() ? !results.isEmpty() : null, stale);
+    if (get.isQueryMetricsEnabled()) {
+      long blockBytesScanned = context.getBlockBytesScanned() - blockBytesScannedBefore;


it would be cleaner if we pushed the blockBytesScanner into the RegionScanner. It could provide an accessor that is updated with each call to next(). This maintenance of a subtractive counter seems odd.

That's probably outside of scope of this PR.

ndimiduk · 2025-02-26T11:53:02Z

Requesting @Apache9 as he's spent a lot of time around the client and I think was involved in the improvements that introduced scanner metrics.

hgromer · 2025-02-26T13:59:47Z

If the plan is to only include a QueryMetrics for Results produced by Gets, I think that you can reduce the scope of this class. Make it a GetMetrics and have it be only applicable to Get-based queries. Push its accessors down to the Get object, remove references from Query and Scan. This approach might become weird because of how Gets are implemented as Scans internally, or maybe it'll be fine -- that's essentially what you have already done here.

I think it's nice to implement this behavior for all queries, which is what I've done in this PR. I think it'd be weird to have

You probably also need to update the object model of ScanMetrics, because if a Scan is a Query, then a ScanMetrics should be a QueryMetrics.

Just thinking this through a little bit, this makes sense b/c it enables us to add metrics at a result level granularity specific to either scans or gets. The only thing that gets tricky here is that you need to do class casting to get the exact class you care about. So for example:

class QueryMetrics {
  public long getBlockBytesScanned();
}

class GetMetrics extends QueryMetrics {
  public long getGetSpecificValue();
}

class Result {
  public QueryMetrics getQueryMetrics();
}

Then, to access the metric

class MyDriver() {
  public static void main(String[] args) {
    Get get = new Get(myRk).setQueryMetricsEnabled(true);
    Result r = table.get(get);
    
    GetMetrics = (GetMetrics) r.getQueryMetrics();
    long getSpecificValue = r.getGetSpecificValue();
  }
}

but maybe that's the price to pay for the added flexibility

On the other hand, you can keep it as a QueryMetrics and also gather this data for Result instances produced by a Scan. Then Scan users can opt-into this metric so as to obtain fine-grained metrics over the internals of the query.

You'll also need think about whether enabling QueryMetrics also enables ScanMetrics and vice-versa... I expect that there would be a configuration state that enables only ScanMetrics, mimicking the behavior existing today.

This PR already opts for keeping the default behavior. The finer-grain query metrics are only returned if setQueryMetricsEnabled is set to true. However, that is distinct to setScanMetricsEnabled, which enables the existing scan metrics. This allows for the user to enable each set of metrics independently of on another, which I thought would be best.

Whichever path you choose, you also need to wire this new metrics thing into other parts of the system. Inspecting where ScanMetrics are used will guide you -- MapReduce and PerformanceEvaluation tool(s) should also expose this new feature.

Sounds good, will take a deeper look here

ndimiduk · 2025-04-17T14:30:46Z

Sorry, I've been away for a while.

The only thing that gets tricky here is that you need to do class casting to get the exact class you care about. So for example ...

I was thinking about whether this could be made easier with generics. I think the answer is yes, but we really don't want to make Result generic over QueryMetrics and we don't have enough Get- or Scan-specific stuff to justify specialising to a GetResult extends Result and ScanResult extends Result.

So I thing the down-casting approach is fine. Maybe there's some helper method that would make it a little safer? I.e., push the cast into methods Optional<GetMetrics> getGetMetrics() and Optional<ScanMetrics> getScanMetrics(). Or a single method with templates, like

class Result {
 QueryMetrics getQueryMetricsInternal() { ... }

 <T extends QueryMetrics> Optional<T> getQueryMetrics() {
    return Optional.ofNullable((T) getQueryMetricsInternal());
 }
}

 Result r = table.get(get);
 GetMetrics getMetrics = r.<GetMetrics>getQueryMetrics();

ndimiduk · 2025-04-22T08:44:36Z

I don't think it's a requirement that we have strong typing of the Metrics object. I think it would be okay to have a single type that includes all the possible accessors. The ones that are not applicable for a query type return null, and the javadoc makes this clear.

hgromer · 2025-04-22T11:27:09Z

I don't think it's a requirement that we have strong typing of the Metrics object. I think it would be okay to have a single type that includes all the possible accessors. The ones that are not applicable for a query type return null, and the javadoc makes this clear.

Sounds good, I've added some javadocs!

ndimiduk · 2025-04-22T14:29:00Z

hbase-client/src/main/java/org/apache/hadoop/hbase/client/RawAsyncTableImpl.java

          .action((controller, loc, stub) -> RawAsyncTableImpl.mutate(controller, loc, stub, put,
            (rn, p) -> RequestConverter.buildMutateRequest(rn, row, family, qualifier, op, value,
-              null, timeRange, p, HConstants.NO_NONCE, HConstants.NO_NONCE),
+              null, timeRange, p, HConstants.NO_NONCE, HConstants.NO_NONCE, false),


A carrot to encourage folks to move on :)

ndimiduk · 2025-04-23T10:34:56Z

Heya @hgromer the spotless nit looks legitimate. There have been some upstream breakages lately, so to be sure, please rebase and then run spotless:apply one more time. Thanks.

hgromer · 2025-04-23T12:49:22Z

Heya @hgromer the spotless nit looks legitimate. There have been some upstream breakages lately, so to be sure, please rebase and then run spotless:apply one more time. Thanks.

Just rebased and ran spotless, thanks for the shout

Apache-HBase · 2025-04-23T13:43:11Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 29s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	buf	0m 0s		buf was not available.
+0 🆗	buf	0m 0s		buf was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+0 🆗	mvndep	0m 35s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 26s		master passed
+1 💚	compile	4m 29s		master passed
+1 💚	checkstyle	0m 59s		master passed
+1 💚	spotbugs	4m 31s		master passed
+1 💚	spotless	0m 45s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 1s		the patch passed
+1 💚	compile	4m 23s		the patch passed
+1 💚	cc	4m 23s		the patch passed
+1 💚	javac	4m 23s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 58s		the patch passed
+1 💚	spotbugs	4m 54s		the patch passed
+1 💚	hadoopcheck	11m 55s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	hbaseprotoc	1m 32s		the patch passed
+1 💚	spotless	0m 43s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 26s		The patch does not generate ASF License warnings.
		51m 8s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6623/12/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6623
JIRA Issue	HBASE-29090
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless cc buflint bufcompat hbaseprotoc
uname	Linux a2106dc0875b 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `61a81a5`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	85 (vs. ulimit of 30000)
modules	C: hbase-protocol-shaded hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6623/12/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

ndimiduk · 2025-04-23T15:34:35Z

Heya @hgromer can you put up a backport for branch-2? We may need some additional unit test to ensure things are plumbed all the way through. Thanks.

Apache-HBase · 2025-04-23T17:27:34Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 28s		Docker mode activated.
-0 ⚠️	yetus	0m 3s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+0 🆗	mvndep	0m 35s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 39s		master passed
+1 💚	compile	1m 59s		master passed
+1 💚	javadoc	1m 1s		master passed
+1 💚	shadedjars	6m 10s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 12s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 21s		the patch passed
+1 💚	compile	2m 0s		the patch passed
+1 💚	javac	2m 0s		the patch passed
+1 💚	javadoc	0m 58s		the patch passed
+1 💚	shadedjars	6m 8s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	0m 35s		hbase-protocol-shaded in the patch passed.
+1 💚	unit	1m 42s		hbase-client in the patch passed.
+1 💚	unit	241m 14s		hbase-server in the patch passed.
		275m 34s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6623/12/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6623
JIRA Issue	HBASE-29090
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux d4766ea9b8ee 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `61a81a5`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6623/12/testReport/
Max. process+thread count	5039 (vs. ulimit of 30000)
modules	C: hbase-protocol-shaded hbase-client hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6623/12/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

… results (apache#6623) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>

… results (apache#6623) (#176) Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]> Co-authored-by: Hernan Romer <[email protected]> Co-authored-by: Hernan Gelaf-Romer <[email protected]>

) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>