Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opencensus: Add OpenCensus metrics support #5923

Merged
merged 4 commits into from
Feb 7, 2023

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Jan 11, 2023

This PR adds OpenCensus metrics to gRPC Go. The e2e tests all test at the exporter level and expect certain metrics emissions from RPC's being performed. Based on #5919.

RELEASE NOTES:

  • stats/opencensus: Add OpenCensus metrics support

@zasweq zasweq requested a review from dfawley January 11, 2023 14:50
@zasweq zasweq self-assigned this Jan 11, 2023
@zasweq zasweq added this to the 1.53 Release milestone Jan 11, 2023
@zasweq zasweq added the Type: Feature New features or improvements in behavior label Jan 11, 2023
@zasweq zasweq assigned dfawley and unassigned zasweq Jan 11, 2023
@zasweq zasweq force-pushed the opencensus-metrics-support branch from 9210455 to bc6f8d4 Compare January 11, 2023 19:27
stats/opencensus/e2e_test.go Outdated Show resolved Hide resolved
codes/codes.go Outdated
@@ -195,6 +195,27 @@ const (
_maxCode = 17
)

// CodeToStr maps from the code type to strings with only uppercase letters.
var CodeToStr = map[Code]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not export this as a map, and make it a permanent part of our API.

Options? One easy one is putting it in internal instead. (Sadly we decided to output a Go-looking name from String(), and changing it to report the canonical name instead would probably break too many tests.)

Maybe a CanonicalString method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Plumbed through internal with a CanonicalString method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete this now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Whoops forgot to delete sorry.

keyClientStatus = tag.MustNewKey("grpc_client_status")
)

// Measures, which are recorded by client stats handler (and interceptor?): Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and interceptor?)

Let's not be unsure about what our code does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has question mark at the end of it because I don't know if it's worth noting if it hasn't happened yet (see comment about Yashes per call latency stuff elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no current measures being recorded at the interceptor level, I simply deleted this comment.

}
// ClientServerLatencyView is the distribution of server latency in
// milliseconds per RPC, keyed on method.
ClientServerLatencyView = &view.View{ // TODO: Same recording point as round trip latency, perhaps I should just delete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the legacy metric?

Is this the per-attempt one and the other one is per-call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the contexts are still scoped to per attempt, as that is the scope of the TagRPC/HandleRPC calls. The per-call will need to be added later in the interceptor. I plan on adding the per call metrics and all the new tracing stuff at the per call level needed nearer GA date. I decided to delete this metric. It's not used in o11y currently and not defined in Yashes GA metrics, and it doesn't seem useful, as it adds no value.

if vi == nil && vi2 == nil {
return true
}
if (vi != nil) != (vi2 != nil) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, nit: if vi == nil || vi2 == nil { return false } is equivalent and will save a comp and allow short circuiting sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming short circuiting means to exit early from the first comparison since it's an or (and the comparison we can do because of passing the check above). Switched here and elsewhere. (need to add this to my traces PR).

stats/opencensus/e2e_test.go Show resolved Hide resolved
Comment on lines +80 to +83
d := &rpcData{
startTime: time.Now(),
method: info.FullMethodName,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this get moved into setRPCData since it's common between client & server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure...I have a bit of hesitancy due to the possibility that these fork in logic needed (doesn't seem like it for GA), and even for retry delay per call (in a45 if we ever get to parity) it seems like start time is still scoped correctly as these calls happen per attempt client side. Switched to setRPCData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tried switching it, and I realized this needs to read info.methodName. I could add it to setRPCData if you'd like and declare an inline struct &rpcData{ startTime: time.Now(), method: string passed in} but keeping it as is seems better from a maintainability perspective. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just thought you'd also pass info into setRPCData -- does that not work? Not a big deal, just a minor optional suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I like it as is now. Passing a stats tag rpc struct into a context helper which to me should be to stick a piece of information in the context feels off. The info from this stats tag rpc struct gets converted into rpc data struct, the subset of information from tag rpc + additional (timestamp) needed to successfully recorded measurements throughout the RPC lifecycle. Thus coupling these two types of data where you pass in one to context setter and it converts automatically feels off. Plus, same comment about client/server partition/maintainability over time.

stats/opencensus/stats.go Show resolved Hide resolved
stats/opencensus/stats.go Outdated Show resolved Hide resolved
Comment on lines 132 to 137
d, ok := ctx.Value(rpcDataKey{}).(*rpcData)
if !ok {
// Shouldn't happen, as gRPC calls TagRPC which populates the rpcData in
// context.
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this into recordRPCData so it only needs to happen in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I had this thought and I made a mental TODO. Switched over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I remember why I didn't do this now. Context is used as an arg in ocstats.RecordWithOptions. However, you can simply read the pointer in recordRPCData, pass just the pointer to the other functions, and pass the pointer plus context to recordDataBegin and recordDataEnd.

stats/opencensus/stats.go Show resolved Hide resolved
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pass :)!

if vi == nil && vi2 == nil {
return true
}
if (vi != nil) != (vi2 != nil) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming short circuiting means to exit early from the first comparison since it's an or (and the comparison we can do because of passing the check above). Switched here and elsewhere. (need to add this to my traces PR).

stats/opencensus/e2e_test.go Show resolved Hide resolved
stats/opencensus/e2e_test.go Outdated Show resolved Hide resolved
stats/opencensus/stats.go Outdated Show resolved Hide resolved
stats/opencensus/stats.go Show resolved Hide resolved
ctx = tag.NewContext(ctx, tags)
}
}
ctx, _ = tag.New(ctx, tag.Upsert(keyServerMethod, removeLeadingSlash(info.FullMethodName)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function errors if the tag map in the context is populated with an invalid key or value path or the mutator fails to work. Both of these values are set by tag.Decode earlier (and it not erroring), and tag.Upsert, both from their tag library, and thus the error shouldn't hit. If the error does hit, it means that the tag map does not have the correct KeyServerMethod. This isn't used for any metrics, and is really just populated for the caller to pull out for their own use. This package will continue to record measurements correctly if this error occurs. However, since this won't really ever happen, and this package will continue to record measurements correctly, I think it's best to ignore the error, and return the context with rpc data irregardless if this tag got populated correctly. This code block is logically equialent to what is currently present:
var newContext context.Context
var err error
if newContext, err = tag.New(ctx, tag.Upsert(keyServerMethod, removeLeadingSlash(info.FullMethodName))); err != nil {
return setRPCData(ctx, d)
}
return setRPCData(newContext, d)

stats/opencensus/stats.go Outdated Show resolved Hide resolved
}
// ClientServerLatencyView is the distribution of server latency in
// milliseconds per RPC, keyed on method.
ClientServerLatencyView = &view.View{ // TODO: Same recording point as round trip latency, perhaps I should just delete.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the contexts are still scoped to per attempt, as that is the scope of the TagRPC/HandleRPC calls. The per-call will need to be added later in the interceptor. I plan on adding the per call metrics and all the new tracing stuff at the per call level needed nearer GA date. I decided to delete this metric. It's not used in o11y currently and not defined in Yashes GA metrics, and it doesn't seem useful, as it adds no value.

bytesDistributionBounds = []float64{1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456, 1073741824, 4294967296}
bytesDistribution = view.Distribution(bytesDistributionBounds...)
millisecondsDistribution = view.Distribution(0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 20, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400, 500, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000)
countDistributionBounds = []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536} // float64 correct here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md#measures this just lists distributions. As you can see, milliseconds distribution is the one with fractions of integers. Since the others bounds are discrete intervals, and their counts are discrete intervals (that are all positive), I'm going to switch this and bytes distribution to uint64 type.

bytesDistributionBounds = []float64{1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456, 1073741824, 4294967296}
bytesDistribution = view.Distribution(bytesDistributionBounds...)
millisecondsDistribution = view.Distribution(0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 20, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400, 500, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000)
countDistributionBounds = []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536} // float64 correct here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, sorry, all their functions (view.Distribution), and the bounds they pass in Export View are all of float64 type, thus I must leave as such.

@zasweq zasweq assigned dfawley and unassigned zasweq Jan 31, 2023
codes/codes.go Outdated
@@ -195,6 +195,27 @@ const (
_maxCode = 17
)

// CodeToStr maps from the code type to strings with only uppercase letters.
var CodeToStr = map[Code]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete this now

s, _ := status.FromError(e.Error) // ignore second argument because codes.Unknown is fine
st = codes.CodeToStr[s.Code()]
s, _ := status.FromError(e.Error)
st = internal.CanonicalString.(func(codes.Code) string)(s.Code())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do this cast once globally:

var canonicalString = internal.CanonicalString.(func(codes.Code) string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Curious: why do you want this cast globally? I have similar typecasts in the gcp/observability module (GlobalDialOptions/ServerOptions plumbed through internal/), and I use them more than once, but you never mentioned it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming it's so if typecast fails, fails at init time rather than runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the big reason. Also, to avoid doing the cast every time you call it, since it isn't free.

ctx = tag.NewContext(ctx, tags)
}
}
ctx, _ = tag.New(ctx, tag.Upsert(keyServerMethod, removeLeadingSlash(info.FullMethodName)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error does hit, it means that the tag map does not have the correct KeyServerMethod. This isn't used for any metrics, and is really just populated for the caller to pull out for their own use

If it returns an error, is the context it returns valid? Or does it return nil? If the latter, then we need to check the error just in case.

Comment on lines +80 to +83
d := &rpcData{
startTime: time.Now(),
method: info.FullMethodName,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just thought you'd also pass info into setRPCData -- does that not work? Not a big deal, just a minor optional suggestion.

Comment on lines 714 to 715
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that strongly, but you're running the same test code multiple times and just checking different results, which seems strange to me. If it fails, you know what was wrong by what was different.

bytesDistributionBounds = []float64{1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456, 1073741824, 4294967296}
bytesDistribution = view.Distribution(bytesDistributionBounds...)
millisecondsDistribution = view.Distribution(0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 20, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400, 500, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000)
countDistributionBounds = []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536} // float64 correct here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG; I was just wanting you to follow-up on your question/comment and delete it. :)

@zasweq zasweq assigned zasweq and unassigned dfawley Jan 31, 2023
@zasweq zasweq assigned dfawley and unassigned zasweq Feb 3, 2023
@dfawley dfawley assigned zasweq and unassigned dfawley Feb 7, 2023
@zasweq zasweq merged commit f69e9ad into grpc:master Feb 7, 2023
arvindbr8 pushed a commit to arvindbr8/grpc-go that referenced this pull request Feb 8, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants