Cleanup and best pratices #4

discordianfish · 2015-06-16T20:25:11Z

Hi,

we would like to use the elasticsearch_exporter in our infrastructure. Since I'm also a prometheus contributor and wrote a few exporters, I thought I do frist some cleanup and changes to better follow prometheus and go's best practices.

Beside some minor (but IMO important) changes to mainly naming, the probably biggest change is the removable of the node label for the metrics.

Since the exporter returns stats for the local node only, it's not necessary to set this as a label since prometheus will include the instance label to identify the node. If a custom node name is
necessary, this can be specified in the prometheus target configuration:

http://prometheus.io/docs/operating/configuration/#target-groups-target_group

What do you think?

discordianfish · 2015-06-16T20:32:34Z

elasticsearch_exporter.go

@ewr Are you sure this is a gauge? I couldn't find any documentation about whether this is a gauge or counter. It looks like a counter to me. Same for the filter_cache_evictions below.

I bet you're right. Those do sound like counters.

ewr · 2015-06-16T21:05:37Z

Sounds good to me.

Here's my counter-argument for node name, but I'm not set one way or the other: the exporter doesn't have to use an Elasticsearch instance on localhost. That's the default argument value, but not required.

In fact, the README reflects my original intent that one exporter doesn't even have to just pull one node... If you hit /_nodes/stats instead of /_nodes/_local/stats, you get the same stats for all nodes in the cluster. I didn't end up implementing that, but it's a possible case for how you would want to run it.

ewr · 2015-06-16T21:15:43Z

For discussion: A while back I had tweaked this exporter to roll up all the time and memory metrics into a time counter and memory gauge. I had apparently never committed this, but just pushed it to get some feedback:

https://github.com/ewr/elasticsearch_exporter/compare/SummaryMetrics

It made for nice graphing compared to having to list out all the different bytes and seconds metrics, but if it's considered a bad pattern I'm not tied to it.

ewr · 2015-06-16T21:24:14Z

And actually #1 is a PR to implement the all nodes functionality I mentioned. Oops.

discordianfish · 2015-06-16T21:40:50Z

@ewr In general I would just run one exporter on every ES instance. If you would run one exporter per cluster, you need somehow make that HA which is more complicated etc.
I know that @brian-brazil agrees and probably has some even better arguments.

That being said, it could of course support both but I wouldn't recommend running it with cluster wide stats and it makes the code more complicated.

Re/ rolling up the metrics, I quote the best practices:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

http://prometheus.io/docs/practices/naming/

ewr · 2015-06-16T21:48:59Z

Yeah, graphing sum(elasticsearch_memory) and seeing all the different pieces roll up was neat, as compared to sum(elasticsearch_jvm_heap_used + elasticsearch_segments_used + ... )

I'm not sure I was comprehensively pulling everything, but it did seem to pull together likes with likes.

ewr · 2015-06-16T21:52:05Z

Re: cluster vs node—My gut is to leave it in. I wouldn't recommend it for production, like you said, but the code complexity is really just one level in the JSON parsing and a different URL. For someone just getting their feet wet, it seems a nice feature.

discordianfish · 2015-06-16T22:02:17Z

@ewr I don't think that adding up all the times for example make sense. It's all time but doesn't make that much sense when all added up, right? Same for the memory. It also shows up in the help string you used: 'Roll up of various' kinda shows that it's not that logical thing you can give a name, right?

Re: cluster vs node: Since it will otherwise duplicate the instance labels (which is ugly and might collide) we really should omit them for the 'one exporter per node' usecase. That's what would make the code more complicated.

brian-brazil · 2015-06-16T22:06:07Z

If you would run one exporter per cluster, you need somehow make that HA which is more complicated etc.

That's correct, the best way to run this type of exporter is one per node as that keeps the failure domains aligned.

Adding labels such as node and cluster may clash with other labels the user plans on using (for example they'd be noise in my setup, as I'd be using the standard job and instance labels for that purpose). Unless you've got a type of exporter where there's multiple instances being scraped (which is discouraged), leave the instance and job class labels up to the prometheus configuration.

ewr · 2015-06-16T22:06:26Z

Well, there's a finite amount of work time per node, so the higher the cumulative rate goes, the more likely it is that the cluster won't be very responsive. Added up its "How much time per second did ES spend working?"

- Use CamelCase for allStats - Preallocate gauge and counter maps - Rename Uri to URL

Since the exporter returns stats for the local node only, it's not necessary to set this as a label since prometheus will include the instance label to identify the node. If a custom node name is necessary, this can be specified in the prometheus target configuration: http://prometheus.io/docs/operating/configuration/#target-groups-target_group

ewr · 2015-06-16T22:13:39Z

For the labels, I get that running the exporter on the same instance and running one exporter per node is preferred, but I'm a little uneasy about removing that support entirely.

For cluster, that's something that you never want to be reporting an incorrect value because your monitoring config is wrong. ES knows its cluster, and it seems like it should be the authority on it. Plus, I don't have a clean way that I would convey that through in my config (which gets generated from Consul service registrations).

brian-brazil · 2015-06-16T22:18:23Z

Well, there's a finite amount of work time per node, so the higher the cumulative rate goes, the more likely it is that the cluster won't be very responsive. Added up its "How much time per second did ES spend working?"

If the time is partitioned over the labels then this could be okay.

Is cpu time spent by the process exposed anywhere? That's http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuTime()

brian-brazil · 2015-06-16T22:24:39Z

For cluster, that's something that you never want to be reporting an incorrect value because your monitoring config is wrong. ES knows its cluster, and it seems like it should be the authority on it. Plus, I don't have a clean way that I would convey that through in my config (which gets generated from Consul service registrations).

What ES thinks it's name is isn't necessarily what it's name is within the rest of the system - targets should pretty much never chose what their labels are for monitoring. I have a setup where I've two distinct ES clusters with the same internal name for example.

discordianfish · 2015-06-16T22:31:30Z

What ES thinks it's name is isn't necessarily what it's name is within the rest of the system - targets should pretty much never chose what their labels are for monitoring. I have a setup where I've two distinct ES clusters with the same internal name for example.

Good point, we have the same actually.

discordianfish · 2015-06-16T23:55:29Z

So I'd be happy with this as it is. We can address 'rolling up' the timers / memory metrics later, same as having a global cluster view. Keep in mind this isn't working right now anyway. Current master also only exposes metrics for the local node.

ewr · 2015-06-17T00:01:36Z

elasticsearch_exporter.go

My error, but: this should be "Filter cache memory"

ewr · 2015-06-17T00:13:38Z

I'm 👍 on all of this except the removal of node and cluster. I understand that they aren't needed in the happy case where you're running everything via best practices, but it complicates or removes two potential modes of operation: scraping the whole cluster and scraping a remote node.

Is the driver here just that we don't want extra labels being generated when they aren't necessary? Couldn't you overwrite or drop them via relabeling anyway?

brian-brazil · 2015-06-17T06:27:18Z

I understand that they aren't needed in the happy case where you're running everything via best practices, but it complicates or removes two potential modes of operation: scraping the whole cluster and scraping a remote node.

Scraping the cluster is a bit more complicated than adding some labels, you'll also need a special configuration on the Prometheus end to deal with the instance label clash.

If you must have the remote node different to the ES node, target discovery does allow controlling the instance separate from the address that's scraped.

Couldn't you overwrite or drop them via relabeling anyway?

You could with a feature added earlier in the week, however that feature is intended only as a stopgap until you can fix the exporter.

ewr · 2015-06-17T13:43:50Z

In the interest of pushing the ball forward on this, I've re-added node and cluster labels and merged everything else. I'm fine to continue talking about those, but I didn't want the other bits stuck in that conversation.

discordianfish reviewed Jun 16, 2015
View reviewed changes

ewr mentioned this pull request Jun 16, 2015

add option to poll stats for all nodes in the cluster #1

Merged

discordianfish added 7 commits June 17, 2015 00:06

Minor cleanup to follow go best pratices

5886b7e

- Use CamelCase for allStats - Preallocate gauge and counter maps - Rename Uri to URL

Move structs for json unmarshaling to own file

2dce263

Rename metric names & help to follow best pratices

ab469de

Fix comments to talk about ES not consul

ffda1d8

Bump version

639fd0a

Fix type for evictions by changing to counter

68d4408

Restructure jvm metrics to match java client

bf56e34

Remove cluster label

9329c16

ewr reviewed Jun 17, 2015
View reviewed changes

elasticsearch_exporter.go

Copy link

Collaborator

ewr Jun 17, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My error, but: this should be "Filter cache memory"

ewr merged commit 9329c16 into prometheus-community:master Jun 17, 2015

Cleanup and best pratices #4

Cleanup and best pratices #4

Uh oh!

Conversation

discordianfish commented Jun 16, 2015

Uh oh!

discordianfish Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

ewr Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

discordianfish commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

discordianfish commented Jun 16, 2015

Uh oh!

brian-brazil commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

ewr commented Jun 16, 2015

Uh oh!

brian-brazil commented Jun 16, 2015

Uh oh!

brian-brazil commented Jun 16, 2015

Uh oh!

discordianfish commented Jun 16, 2015

Uh oh!

discordianfish commented Jun 16, 2015

Uh oh!

ewr Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

ewr commented Jun 17, 2015

Uh oh!

brian-brazil commented Jun 17, 2015

Uh oh!

ewr commented Jun 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants