Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ATM dev environment docker-compose and API doc #3171

Merged
merged 8 commits into from
Aug 4, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions docker-compose/monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Aggregated Trace Metrics Development/Demo Environment

Aggregated Trace Metrics (ATM) is an opt-in feature introduced to Jaeger that provides Request, Error and Duration (RED) metrics grouped by service name and operation that are derived from span data. These metrics are programmatically available through an API exposed by jaeger-query along with a "Monitor" UI tab that visualizes these metrics as graphs. For more details on this feature, please refer to the [tracking Issue](https://github.com/jaegertracing/jaeger/issues/2954) documenting the proposal and status.

The motivation for providing this environment is to allow developers to either test Jaeger UI or their own applications against jaeger-query's metrics query API, as well as a quick and simple way for users to bring up the entire stack required to visualize RED metrics from simulated traces (or their own), much like Jaeger All-in-one.

This environment consists four backend components:

- [MicroSim](https://github.com/yurishkuro/microsim): a program to simulate traces.
- [Jaeger All-in-one](https://www.jaegertracing.io/docs/1.24/getting-started/#all-in-one): the full Jaeger stack in a single docker image.
albertteoh marked this conversation as resolved.
Show resolved Hide resolved
- [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/): vendor agnostic integration layer for traces and metrics. Its main role in this particular development environment is to receive Jaeger spans, forward these spans untouched to Jaeger All-in-one while simultaneously aggregating metrics out of this span data. To learn more about span metrics aggregation, please refer to the [spanmetrics processor documentation](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor).
- [Prometheus](https://prometheus.io/): a metrics collection and query engine, used to scrape metrics computed by OpenTelemetry Collector, and presents an API for Jaeger All-in-one to query these metrics.

The following diagram illustrates the relationship between these components:

![ATMDev (1)](https://user-images.githubusercontent.com/26584478/127763924-4ae2ce88-8fc2-4def-90c2-55358a433905.png)

# Getting Started

## Bring up/down the dev environment
```
docker compose up
docker compose down
```

## Example 1
Fetch call rates for both the driver and frontend services, grouped by operation, from now,
looking back 1 second with a sliding rate-calculation window of 1m and step size of 1 millisecond

```bash
curl http://localhost:16686/api/metrics/calls\?service\=driver\&service\=frontend\&groupByOperation\=true\&endTs\="$(date +%s)"000\&lookback\=1000\&step\=100\&ratePer\=60000 | jq .
albertteoh marked this conversation as resolved.
Show resolved Hide resolved
```


## Example 2
Fetch P95 latencies for both the driver and frontend services from now,
looking back 1 second with a sliding rate-calculation window of 1m and step size of 1 millisecond, where the span kind is either "server" or "client".

```bash
curl http://localhost:16686/api/metrics/latencies\?service\=driver\&service\=frontend\&quantile\=0.95\&endTs\="$(date +%s)"000\&lookback\=1000\&step\=100\&ratePer\=60000\&spanKind\=server\&spanKind\=client | jq .
```

## Example 3
Fetch error rates for both driver and frontend services using default parameters.
```bash
curl http://localhost:16686/api/metrics/errors\?service\=driver\&service\=frontend | jq .
```

## Example 4
Fetch the minimum step size supported by the underlying metrics store.
```bash
curl http://localhost:16686/api/metrics/minstep | jq .
```

# HTTP API
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add this section into https://www.jaegertracing.io/docs/latest/apis/, would that be an appropriate place or is it sufficient to leave this API doc here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be there as well, IMO.


## Query Metrics

`/api/metrics/{metric_type}?{query}`

Where (Backus-Naur form):
```
metric_type = 'latencies' | 'calls' | 'errors'

query = services , [ '&' optionalParams ]

optionalParams = param | param '&' optionalParams

param = groupByOperation | quantile | endTs | lookback | step | ratePer | spanKinds

services = service | service '&' services
service = 'service=' strValue
- The list of services to include in the metrics selection filter, which are logically 'OR'ed.
- Mandatory.

quantile = 'quantile=' floatValue
- The quantile to compute the latency 'P' value. Valid range (0,1].
- Mandatory for 'latencies' type.

groupByOperation = 'groupByOperation=' boolValue
boolValue = '1' | 't' | 'T' | 'true' | 'TRUE' | 'True' | 0 | 'f' | 'F' | 'false' | 'FALSE' | 'False'
- A boolean value which will determine if the metrics query will also group by operation.
- Optional with default: false

endTs = 'endTs=' intValue
- The posix milliseconds timestamp of the end time range of the metrics query.
- Optional with default: now

lookback = 'lookback=' intValue
- The duration, in milliseconds, from endTs to look back on for metrics data points.
- For example, if set to `3600000` (1 hour), the query would span from `endTs - 1 hour` to `endTs`.
- Optional with default: 3600000 (1 hour).

step = 'step=' intValue
- The duration, in milliseconds, between data points of the query results.
- For example, if set to 5s, the results would produce a data point every 5 seconds from the `endTs - lookback` to `endTs`.
- Optional with default: 5000 (5 seconds).

ratePer = 'ratePer=' intValue
- The duration, in milliseconds, in which the per-second rate of change is calculated for a cumulative counter metric.
- Optional with default: 600000 (10 minutes).

spanKinds = spanKind | spanKind '&' spanKinds
spanKind = 'spanKind=' spanKindType
spanKindType = 'unspecified' | 'internal' | 'server' | 'client' | 'producer' | 'consumer'
- The list of spanKinds to include in the metrics selection filter, which are logically 'OR'ed.
- Optional with default: 'server'
```


## Min Step

`/api/metrics/minstep`

Gets the min time resolution supported by the backing metrics store, in milliseconds, that can be used in the `step` parameter.
e.g. a min step of 1 means the backend can only return data points that are at least 1ms apart, not closer.

## Responses

The response data model is based on [`MetricsFamily`](https://github.com/jaegertracing/jaeger/blob/master/model/proto/metrics/openmetrics.proto#L53).

For example:
```
{
"name": "service_call_rate",
"type": "GAUGE",
"help": "calls/sec, grouped by service",
"metrics": [
{
"labels": [
{
"name": "service_name",
"value": "driver"
}
],
"metricPoints": [
{
"gaugeValue": {
"doubleValue": 0.005846808321083344
},
"timestamp": "2021-06-03T09:12:06Z"
},
{
"gaugeValue": {
"doubleValue": 0.006960443672323934
},
"timestamp": "2021-06-03T09:12:11Z"
},
...
```

If the `groupByOperation=true` parameter is set, the response will include the operation name in the labels like so:
```
"labels": [
{
"name": "operation",
"value": "/FindNearest"
},
{
"name": "service_name",
"value": "driver"
}
],
```

# Disabling Metrics Querying

As this is feature is opt-in only, disabling metrics querying simply involves omitting the `METRICS_STORAGE_TYPE` environment variable when starting-up jaeger-query or jaeger all-in-one.

For example, try removing the `METRICS_STORAGE_TYPE=prometheus` environment variable from the [docker-compose.yml](./docker-compose.yml) file.

Then querying any metrics endpoints results in an error message:
```
$ curl http://localhost:16686/api/metrics/minstep | jq .
{
"data": null,
"total": 0,
"limit": 0,
"offset": 0,
"errors": [
{
"code": 405,
"msg": "metrics querying is currently disabled"
}
]
}
```
39 changes: 39 additions & 0 deletions docker-compose/monitor/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
version: "3.5"
services:
jaeger:
networks:
- backend
image: jaegertracing/all-in-one:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file being tested via CI? If not, it might be better to use a fixed version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file being tested via CI?

Nope.

it might be better to use a fixed version.

I'd like to consider one hypothetical scenario, keeping in mind that the ATM/Monitor feature is still under active development (the UI isn't ready yet).

We set the jaeger all-in-one tag to v1.24 here. If someone makes a change now and merges to master before v1.25 is released, if I understand correctly, will users need to wait until v1.26 to obtain/use the features just merged?

environment:
- METRICS_STORAGE_TYPE=prometheus
- PROMETHEUS_SERVER_URL=http://prometheus:9090
ports:
- "14250:14250"
- "14268:14268"
- "6831:6831/udp"
- "16686:16686"
- "16685:16685"
otel_collector:
networks:
- backend
image: otel/opentelemetry-collector-contrib:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

volumes:
- "./otel-collector-config.yml:/etc/otelcol/otel-collector-config.yml"
command: --config /etc/otelcol/otel-collector-config.yml
albertteoh marked this conversation as resolved.
Show resolved Hide resolved
microsim:
networks:
- backend
image: yurishkuro/microsim:0.2.0
command: "-j http://otel_collector:14278/api/traces -d 24h -s 500ms"
depends_on:
- otel_collector
prometheus:
networks:
- backend
image: prom/prometheus:latest
volumes:
- "./prometheus.yml:/etc/prometheus/prometheus.yml"
ports:
- "9090:9090"
networks:
backend:
36 changes: 36 additions & 0 deletions docker-compose/monitor/otel-collector-config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
receivers:
jaeger:
protocols:
thrift_http:
endpoint: "0.0.0.0:14278"

# Dummy receiver that's never used, because a pipeline is required to have one.
otlp/spanmetrics:
protocols:
grpc:
endpoint: "localhost:65535"

exporters:
prometheus:
endpoint: "0.0.0.0:8889"

jaeger:
endpoint: "jaeger:14250"
insecure: true

processors:
batch:
spanmetrics:
metrics_exporter: prometheus

service:
pipelines:
traces:
receivers: [jaeger]
processors: [spanmetrics, batch]
exporters: [jaeger]
# The exporter name in this pipeline must match the spanmetrics.metrics_exporter name.
# The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
metrics/spanmetrics:
receivers: [otlp/spanmetrics]
exporters: [prometheus]
9 changes: 9 additions & 0 deletions docker-compose/monitor/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

scrape_configs:
- job_name: aggregated-trace-metrics
static_configs:
- targets: ['otel_collector:8889']