Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ST2 Metrics Collection #4004

Merged
merged 62 commits into from
May 15, 2018
Merged

ST2 Metrics Collection #4004

merged 62 commits into from
May 15, 2018

Conversation

bigmstone
Copy link
Contributor

@bigmstone bigmstone commented Feb 16, 2018

Fixes #2974

  • Changelog Entry
  • Tests
  • Example Config
  • Documentation Update
  • Prometheus Exploration (Or abandoning)

@Kami
Copy link
Member

Kami commented Feb 16, 2018

Just some of my quick thoughts - if I worked on something like this, I would start in the following order:

  1. Code instrumentation (make sure we have instrumentation in place and all the metrics we care about)
  2. Expose those metrics via an HTTP endpoint - e.g. /v1/health/metrics or similar.
  3. Third party tool integration (Prometheus or similar)

I think integration with 3rd party tools is important, but I think it's even more important to have a "generic" API endpoint which metrics which everyone can consume (this includes our WebUI, Prometheus or any other monitoring system a user might use).

On a related note, it would be good to identify / list somewhere metrics we should collect. Perhaps we can just keep a list in the description of this PR.

@bigmstone
Copy link
Contributor Author

My original intent was to have an API endpoint where metrics were collected over a stream across our bus. When thinking through this that would require us to begin collecting that information somewhere - mongo isn't the right tool, postgres isn't the right tool. We'd need to rely on another service to begin collecting that information which is yet another dependency on st2. We can make this dependency optional but it also puts us into a position where we're opinionated on how to collect these metrics. This isn't desirable in environments where they're already collecting stats in a specific way.

One thing I'm still considering is having a new st2 service that will be able to aggregate the collection of these metrics. This is primarily to accommodate both streaming style collection (statsd) as well as scraping type collection (prometheus). We could potentially do running average/in-memory stats and provide that via a health API there, but it will be imporant to note to the end user that the data won't persist anywhere so if the service goes down for whatever reason the numbers won't be reliable. Also what's our HA story with this type of implementation? Consensus algo running between them to constantly keep metrics in sync?

@enykeev
Copy link
Member

enykeev commented Feb 19, 2018

I'm wondering if it is at all possible to expose this metrics some way outside of normal stackstorm communication channels (something along the lines of dtrace hooks) and leave gathering this metrics to external process.

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway to implement this at the runner abstraction? I understand the need for runner specific metrics like the python wrapper execution. But for the python runner, that can probably be generic.

# Note: We should eventually refactor this code to make runner standalone and not
# depend on a db connection (as it was in the past) - this param should be passed
# to the runner by the action runner container
with CounterWithTimer(PYTHON_RUNNER_EXECUTION):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this with a decorator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, yes we can.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally not a big fan of decorators, at least when they are used in this manner (wrapping large blocks of code) - it adds another level of indentation and makes code harder to read (imo, cyclomatic complexity).

I'm fine when they are used to wrap function / method definitions. One way to reduce this level of indentation would be to move this code block in a utility method, then call that utility method inside the decorator. This way original code block is indented as it was before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: It looks like we are wrapping whole run method, in this case we should just add a decorator around the whole method :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kami LOL

def send_time(self, key=None):
""" Send current time from start time.
"""
time_delta = datetime.now() - self._start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is not in UTC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

from st2common.metrics.drivers.statsd_driver import StatsdDriver
return StatsdDriver()

return BaseMetricsDriver()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use stevedore to load the drivers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, need to look into it.

# file but don't have oslo context.
from st2common.metrics.metrics import CounterWithTimer
from st2common.constants.metrics import METRICS_REGISTER_RUNNER
with CounterWithTimer(METRICS_REGISTER_RUNNER):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a decorator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

METRICS_TIMER_SUFFIX = "_timer"

PYTHON_RUNNER_EXECUTION = "python_runner_execution"
PYTHON_WRAPPER_EXECUTION = "python_wrapper_execution"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics for other runners?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like the idea of doing generic runner metrics at the runner caller as you mentioned. Also need to incorporate the feedback from community and internal on what to test. This is progressing currently.

@m4dcoder
Copy link
Contributor

@bigmstone, What happens if there's failure during metrics collections or if there's failure with the statsd server? Can you add some tests that shows errors will be handled gracefully and exception logged and won't affect execution of the action? Thanks

@bigmstone
Copy link
Contributor Author

bigmstone commented Mar 16, 2018

@m4dcoder nothing. It's UDP so it serializes the packet and that is all. That said I should see if there's anything the package is set to raise and catch it/log it - but this wouldn't come from a network error or the like. It would be from either an internal error or similar. Possibly also from a poor configuration (non-usable port, non-usable IP, etc.)

@@ -119,7 +123,8 @@ def pre_run(self):
if self._log_level == PYTHON_RUNNER_DEFAULT_LOG_LEVEL:
self._log_level = cfg.CONF.actionrunner.python_runner_log_level

def run(self, action_parameters):
@CounterWithTimer(PYTHON_RUNNER_EXECUTION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍


def metrics_initialize():
"""Initialize metrics constant
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you need to reference the METRICS constant as global. See this example as reference https://github.com/StackStorm/st2/blob/39517c0fb80359c48324a5b6ba4088972fbcb9de/st2common/st2common/services/coordination.py.

@bigmstone bigmstone force-pushed the issue-2974/metrics branch from 5aae4ae to e65d268 Compare May 7, 2018 21:03
Makefile Outdated
@@ -275,6 +275,9 @@ requirements: virtualenv .sdist-requirements
# new version of requests) which we cant resolve at this moment
$(VIRTUALENV_DIR)/bin/pip install "prance==0.6.1"

# Install st2common to register metrics drivers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about s/register metrics drivers/register common plugins?

Makefile Outdated
@@ -275,6 +275,9 @@ requirements: virtualenv .sdist-requirements
# new version of requests) which we cant resolve at this moment
$(VIRTUALENV_DIR)/bin/pip install "prance==0.6.1"

# Install st2common to register metrics drivers
(cd ${ROOT_DIR}/st2common; ${ROOT_DIR}/$(VIRTUALENV_DIR)/bin/python setup.py install)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this. I thought making the requirements already did this because the last time I tested this branch, I didn't have to explicitly run the setup. Can you use python setup.py develop instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4dcoder I'm getting all kinda nasty side effects from doing this. I've tried develop and install trying to work around it. Found some strange WebOb bug that manifests itself only when st2common is installed and was able to resolve it by bumping the version. Now I'm investigating these config issues. Long story short just ignore my commits until I get the tests passing.

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just need to rebase and get CI to pass.

@bigmstone bigmstone merged commit 49eb46d into master May 15, 2018
@bigmstone bigmstone deleted the issue-2974/metrics branch May 15, 2018 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants