Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a collector for ZFS, currently focussed on ARC stats. #213

Closed
wants to merge 24 commits into from

Conversation

problame
Copy link
Contributor

It is tested on FreeBSD 10.2-RELEASE and Linux (ZFS on Linux 0.6.5.4).
It does not require root privileges to run on either of the above systems.

On FreeBSD, Solaris, etc. ZFS metrics are exposed through sysctls.
ZFS on Linux exposes the same metrics through procfs /proc/spl/....

In addition to sysctl metrics, computed metrics are exposed by
the collector, which are based on several sysctl values.
There is some conditional logic involved in computing these metrics
which cannot be easily mapped to PromQL.

Not all 92 ARC sysctls are exposed right now but this can be changed
with one additional LOC each.

Questions about behavior:

  1. What should happen if -collectors.enabled zfs is set but the stats cannot be found (e.g. because the kernel module is not loaded).
    Currently node_exporter terminates with an error message.

  2. If the behavior described above is not desired, should the collector determine at runtime whether data is available?

    2.1. Should it expose no metrics if no data is available?
    2.2. Could the zfs collector be enabled by default if it does nothing on a non-ZFS system?

It is tested on FreeBSD 10.2-RELEASE and Linux (ZFS on Linux 0.6.5.4).

On FreeBSD, Solaris, etc. ZFS metrics are exposed through sysctls.
ZFS on Linux exposes the same metrics through procfs `/proc/spl/...`.

In addition to sysctl metrics, 'computed metrics' are exposed by
the collector, which are based on several sysctl values.
There is some conditional logic involved in computing these metrics
which cannot be easily mapped to PromQL.

Not all 92 ARC sysctls are exposed right now but this can be changed
with one additional LOC each.
fetchedMetrics := make(map[zfsSysctl]prometheus.Gauge)

insertGauge := func(sysctl, name string) {
fetchedMetrics[zfsSysctl(sysctl)] = prometheus.NewGauge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be using ConstMetrics rather than creating Gauges. Gauges are the like are for direction instrumentation, whereas a custom collector like this should use ConstMetrics.

@brian-brazil
Copy link
Contributor

The README will also need updating.

@problame
Copy link
Contributor Author

What was causing go vet to fail in the Travis CI build?

@brian-brazil
Copy link
Contributor

Looks like a false positive.

Use ConstMetric with UntypedValue.
Handle nozfs build flag.
Remove computed values.
collector now does not expose metrics if stats cannot be fetched.
@problame
Copy link
Contributor Author

Please review the above commits. Will squash them together as soon as you confirm all issues mentioned have been addressed.

Side note: Novice users will have trouble making sense of / interpreting ZFS statistics. Should we include a link to some resources, e.g. the implementation of zfs-stats?

(It is a popular tool on FreeBSD for getting a quick overview. Currently, node_exporter exposes a subset of zfs-stat's metrics, some of which are computed).

//------------------------------------------------------------------------------

type zfsMetric struct {
subsystem zfsSubsystemName // The Prometheus subsystem name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should end in periods

@brian-brazil
Copy link
Contributor

Side note: Novice users will have trouble making sense of / interpreting ZFS statistics. Should we include a link to some resources,

Links to reference docs are good. As someone who uses ZFS on linux, the zfs-stats source doesn't seem to add anything that's not already obvious from the names.

return zfsErrorValue, fmt.Errorf("sysctl '%s' found")
}

func (p *zfsMetricProvider) prepareUpdateArcstats(zfsArcstatsProcpath string) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would benefit from a unittest

@mdlayher
Copy link
Contributor

Neat, did not realize this work was happening. @eliothedeman and I had previously started work on our own ZFS exporter, but integrating the information into node_exporter makes sense to me too.

We expose metrics using go-zfs, but as has been stated, many of these do require root on Linux. I run ZoL and would be very interested in making these stats accessible on Linux as well.

What would be an appropriate way to go about doing this? It seems that creating some sort of "ZFS helper" binary which can independently run as root and spit out metrics for the textfile collector could work.

func (c *zfsCollector) zfsAvailable() (err error) {
file, err := c.openArcstatsFile()
if err != nil {
file.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File needs to be closed in the err == nil case as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's new to me. Do you have any links on this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out this blog: http://blog.golang.org/defer-panic-and-recover.

You could also just use os.Stat() to check for the /proc file instead of actually opening it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually you only need to close the file when the opening was successful. When there was an error, you won't even get a valid *File back, but nil. So the code above would actually crash the entire node exporter with a nil pointer dereference panic in case the file can't be read.

@brian-brazil
Copy link
Contributor

What would be an appropriate way to go about doing this? It seems that creating some sort of "ZFS helper" binary which can independently run as root and spit out metrics for the textfile collector could work.

Has someone asked ZoL why all the stats we're interested in require root? No other filesystem requires that that I'm aware of.

@problame
Copy link
Contributor Author

@mdlayher
Copy link
Contributor

Filed an issue with ZoL: openzfs/zfs#4468.

@@ -0,0 +1,109 @@
package collector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the same copyright/license header here as in all the other Node Exporter files. Same for the other Go files here.

@discordianfish
Copy link
Member

@problame @mdlayher It looks like delegation will be supported in 0.7.0. Maybe we can revive this PR?

@mdlayher
Copy link
Contributor

I'm definitely in favor. 👍

@problame
Copy link
Contributor Author

@discordianfish regarding the requirement for superuser privileges to run the zfs binary: even if it is dropped, the Prometheus policy of not shelling out prohibits using this functionality.

Remember that the part about parsing /proc/spl/kstat/zfs/arcstats (Linux) or alternatively sysctl (FreeBSD) works without superuser privileges on both platforms.

Only the per-dataset / per-pool metrics cannot be gathered without either

  • shelling out to zfs(8) (will work on ZoL 0.7, works on FreeBSD) (prohibited by Prometheus policy)
  • using libzfs (reimplementing zfs(8) functionality, relying on an unstable interface)

About the current state of this patch:

In the Linux-part, the cleanups mention by @brian-brazil probably need to be done. Should work as non-root for now.
In the FreeBSD-part, we violate the shelling-out policy right now, so we'd need to use the FreeBSD sysctl(2) API to first dicover and then read all the arcstats properties.

I am short on time right now and won't touch the FreeBSD side in the foreseeable future (sysctl(2) is rather tedious in comparison to some string processing of procfs pseudo-files).

Feel free to strip the FreeBSD-part and try to upstream the currently existing Linux-part, however ;)

@cbstewart9393
Copy link
Contributor

@problame My team at HPE would like to see the Linux part of this get upstreamed. Our thoughts are to do exactly what you said: get rid of the FreeBSD stuff and do the cleanups on the Linux stuff that @brian-brazil has mentioned. Would you be okay with these changes being made to try to get this PR accepted? I can start a new branch if it would be better to preserve the FreeBSD stuff on this one, or I can commit directly to this branch.
Anything else that needs to be done, aside from removing FreeBSD support and making the changes that have been requested via reviews in this PR?

@problame
Copy link
Contributor Author

@cbstewart9393 It doesn't look like I'm going to have the time to implement the FreeBSD part for this any time soon.
Hence, I would be glad to see this PR modified / merged following the procedure you described.

However, I would suggest to keep the implementation easily extensible to non-Linux OSes, as is now.

@discordianfish
Copy link
Member

Going to close this in favor of #369

@dominikh dominikh mentioned this pull request Sep 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants