Skip to content

Conversation

@discordianfish
Copy link
Member

Hi,

we would like to use the elasticsearch_exporter in our infrastructure. Since I'm also a prometheus contributor and wrote a few exporters, I thought I do frist some cleanup and changes to better follow prometheus and go's best practices.

Beside some minor (but IMO important) changes to mainly naming, the probably biggest change is the removable of the node label for the metrics.

Since the exporter returns stats for the local node only, it's not necessary to set this as a label since prometheus will include the instance label to identify the node. If a custom node name is
necessary, this can be specified in the prometheus target configuration:

http://prometheus.io/docs/operating/configuration/#target-groups-target_group

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewr Are you sure this is a gauge? I couldn't find any documentation about whether this is a gauge or counter. It looks like a counter to me. Same for the filter_cache_evictions below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you're right. Those do sound like counters.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

Sounds good to me.

Here's my counter-argument for node name, but I'm not set one way or the other: the exporter doesn't have to use an Elasticsearch instance on localhost. That's the default argument value, but not required.

In fact, the README reflects my original intent that one exporter doesn't even have to just pull one node... If you hit /_nodes/stats instead of /_nodes/_local/stats, you get the same stats for all nodes in the cluster. I didn't end up implementing that, but it's a possible case for how you would want to run it.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

For discussion: A while back I had tweaked this exporter to roll up all the time and memory metrics into a time counter and memory gauge. I had apparently never committed this, but just pushed it to get some feedback:

https://github.com/ewr/elasticsearch_exporter/compare/SummaryMetrics

It made for nice graphing compared to having to list out all the different bytes and seconds metrics, but if it's considered a bad pattern I'm not tied to it.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

And actually #1 is a PR to implement the all nodes functionality I mentioned. Oops.

@discordianfish
Copy link
Member Author

@ewr In general I would just run one exporter on every ES instance. If you would run one exporter per cluster, you need somehow make that HA which is more complicated etc.
I know that @brian-brazil agrees and probably has some even better arguments.

That being said, it could of course support both but I wouldn't recommend running it with cluster wide stats and it makes the code more complicated.

Re/ rolling up the metrics, I quote the best practices:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

Yeah, graphing sum(elasticsearch_memory) and seeing all the different pieces roll up was neat, as compared to sum(elasticsearch_jvm_heap_used + elasticsearch_segments_used + ... )

I'm not sure I was comprehensively pulling everything, but it did seem to pull together likes with likes.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

Re: cluster vs node—My gut is to leave it in. I wouldn't recommend it for production, like you said, but the code complexity is really just one level in the JSON parsing and a different URL. For someone just getting their feet wet, it seems a nice feature.

@discordianfish
Copy link
Member Author

@ewr I don't think that adding up all the times for example make sense. It's all time but doesn't make that much sense when all added up, right? Same for the memory. It also shows up in the help string you used: 'Roll up of various' kinda shows that it's not that logical thing you can give a name, right?

Re: cluster vs node: Since it will otherwise duplicate the instance labels (which is ugly and might collide) we really should omit them for the 'one exporter per node' usecase. That's what would make the code more complicated.

@brian-brazil
Copy link

If you would run one exporter per cluster, you need somehow make that HA which is more complicated etc.

That's correct, the best way to run this type of exporter is one per node as that keeps the failure domains aligned.

Adding labels such as node and cluster may clash with other labels the user plans on using (for example they'd be noise in my setup, as I'd be using the standard job and instance labels for that purpose). Unless you've got a type of exporter where there's multiple instances being scraped (which is discouraged), leave the instance and job class labels up to the prometheus configuration.

@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

Well, there's a finite amount of work time per node, so the higher the cumulative rate goes, the more likely it is that the cluster won't be very responsive. Added up its "How much time per second did ES spend working?"

- Use CamelCase for allStats
- Preallocate gauge and counter maps
- Rename Uri to URL
Since the exporter returns stats for the local node only, it's not
necessary to set this as a label since prometheus will include the
instance label to identify the node. If a custom node name is
necessary, this can be specified in the prometheus target
configuration:

http://prometheus.io/docs/operating/configuration/#target-groups-target_group
@ewr
Copy link
Collaborator

ewr commented Jun 16, 2015

For the labels, I get that running the exporter on the same instance and running one exporter per node is preferred, but I'm a little uneasy about removing that support entirely.

For cluster, that's something that you never want to be reporting an incorrect value because your monitoring config is wrong. ES knows its cluster, and it seems like it should be the authority on it. Plus, I don't have a clean way that I would convey that through in my config (which gets generated from Consul service registrations).

@brian-brazil
Copy link

Well, there's a finite amount of work time per node, so the higher the cumulative rate goes, the more likely it is that the cluster won't be very responsive. Added up its "How much time per second did ES spend working?"

If the time is partitioned over the labels then this could be okay.

Is cpu time spent by the process exposed anywhere? That's http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuTime()

@brian-brazil
Copy link

For cluster, that's something that you never want to be reporting an incorrect value because your monitoring config is wrong. ES knows its cluster, and it seems like it should be the authority on it. Plus, I don't have a clean way that I would convey that through in my config (which gets generated from Consul service registrations).

What ES thinks it's name is isn't necessarily what it's name is within the rest of the system - targets should pretty much never chose what their labels are for monitoring. I have a setup where I've two distinct ES clusters with the same internal name for example.

@discordianfish
Copy link
Member Author

What ES thinks it's name is isn't necessarily what it's name is within the rest of the system - targets should pretty much never chose what their labels are for monitoring. I have a setup where I've two distinct ES clusters with the same internal name for example.

Good point, we have the same actually.

@discordianfish
Copy link
Member Author

So I'd be happy with this as it is. We can address 'rolling up' the timers / memory metrics later, same as having a global cluster view. Keep in mind this isn't working right now anyway. Current master also only exposes metrics for the local node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My error, but: this should be "Filter cache memory"

@ewr
Copy link
Collaborator

ewr commented Jun 17, 2015

I'm 👍 on all of this except the removal of node and cluster. I understand that they aren't needed in the happy case where you're running everything via best practices, but it complicates or removes two potential modes of operation: scraping the whole cluster and scraping a remote node.

Is the driver here just that we don't want extra labels being generated when they aren't necessary? Couldn't you overwrite or drop them via relabeling anyway?

@brian-brazil
Copy link

I understand that they aren't needed in the happy case where you're running everything via best practices, but it complicates or removes two potential modes of operation: scraping the whole cluster and scraping a remote node.

Scraping the cluster is a bit more complicated than adding some labels, you'll also need a special configuration on the Prometheus end to deal with the instance label clash.

If you must have the remote node different to the ES node, target discovery does allow controlling the instance separate from the address that's scraped.

Couldn't you overwrite or drop them via relabeling anyway?

You could with a feature added earlier in the week, however that feature is intended only as a stopgap until you can fix the exporter.

@ewr ewr merged commit 9329c16 into prometheus-community:master Jun 17, 2015
@ewr
Copy link
Collaborator

ewr commented Jun 17, 2015

In the interest of pushing the ball forward on this, I've re-added node and cluster labels and merged everything else. I'm fine to continue talking about those, but I didn't want the other bits stuck in that conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants