-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add profiler for request execution details for write api connection worker #2555
Conversation
will be added in the next PR
b4e467d
to
ef12805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, once this profiler is available, subsequent commits will then start using it in various places within the codebase. There is definitely overlap with the metrics being generated for OpenTelemetry. Perhaps eventually we could automatically profile those metrics with this tool.
|
||
private static final Logger log = Logger.getLogger(RequestProfiler.class.getName()); | ||
|
||
// Control per how many requests we log one time for a dropped operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear what is meant by "dropped"? Does this mean a request was rejected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we cache too many request I would drop that, added an comment.
* </pre> | ||
*/ | ||
public class RequestProfiler { | ||
enum OperationName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the name of the operation need to be pre-defined or known in advance to the profiler? Is this just to ensure consistency in how it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah consistency, so that we would deal with random name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ensure there is consistency between the metric names identified here, and those currently used and future ones to be added for OpenTelemetry metrics. The latter are defined in this document: http://go/writeapi-telemetry. See the names used under "List of instruments".
If there is a need for profiling metrics there is likely a need for the same OpenTelemetry metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current overlap is the backend latency, let's change it to NETWORK_RESPONSE_LATENCY
// Control per how many requests we log one time for a dropped operation. | ||
// An operation can be dropped if we are caching too many requests (by default 100000) in memory. If we are at that | ||
// state, any new operation for new requests would be dropped. | ||
private static final int LOG_PER_DROPPED_OPERATION = 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Now I understand this constant better. It really just helps with reducing the volume of log messages generated around dropped operations. Maybe it would be more clear if you defined this constant after defining the MAX_CACHED_REQUEST. Perhaps a better name would be DROPPED_OPERATION_LOG_FREQUENCY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, let's just remove this variable and related the logic, the upper cache bound is really large, I would say safe guarding such extreme case does not have too much value
// Json to proto conversion time. | ||
JSON_TO_PROTO_CONVERSION("json_to_proto_conversion"), | ||
// Time spent to fetch the table schema when user didn't provide it. | ||
SCHEMA_FECTCHING("schema_fetching"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here, should be SCHEMA_FETCHING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
082c376
to
a8f8d5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if tracking every single request will result in excessive memory consumption. To deal with this, you may need to:
(a) use histogram buckets to break down individual values into aggregate values, or
(b) dynamically keep track of only the K longest latency requests per type.
* </pre> | ||
*/ | ||
public class RequestProfiler { | ||
enum OperationName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ensure there is consistency between the metric names identified here, and those currently used and future ones to be added for OpenTelemetry metrics. The latter are defined in this document: http://go/writeapi-telemetry. See the names used under "List of instruments".
If there is a need for profiling metrics there is likely a need for the same OpenTelemetry metric.
daac318
to
f0cbac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that it's not easy to come up with a clean solution that would only keep the top k requests all the time to save memory. Since it's a debugging only class and each flush would release the memory of the past requests, we think it's fine to continue with the current solution
+ "seen before, this is possible when " | ||
+ "we are recording too much ongoing requests. So far we has dropped %s operations.", | ||
requestUniqueId, droppedOperationCount)); | ||
droppedOperationCount.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be incremented here as well? What if the calling code always calls startOperation() and endOperation()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* ----------------------------- | ||
* Request uuid: request_1 with total time 1000 milliseconds | ||
* Operation name json_to_proto_conversion starts at: 1720566109971, ends at: 1720566109971, total time: 200 milliseconds | ||
* Operation name backend_latency starts at: 1720566109971, ends at: 1720566109971, total time: 800 milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we can call this better, this is not pure backend_latency, it is backend_latency + network_latency?
And one problem is this is that we still cannot join this with full backend performance stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current way we can join is through timestamp, unless this request id can be passed to backend.
Let's call it response_latency to make it clear
* ... | ||
* </pre> | ||
*/ | ||
public class RequestProfiler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have package visibility? Maintain a public class is a lot more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it is changed? It is still a public class.
// Time spent to fetch the table schema when user didn't provide it. | ||
SCHEMA_FECTCHING("schema_fetching"), | ||
// Time spent within wait queue before it get picked up. | ||
WAIT_QUEUE("wait_queue"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if additional stats, such as wait queue length can be attached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but that requires really hard interface change, I wanna shrink the scope to time related profiling for now.
…tion worker (googleapis#2555) * Add profiler for request execution details. The usage of the new API will be added in the next PR * Add profiler for request execution details. The usage of the new API will be added in the next PR
Add a profiler to generate periodical report.
In the next PR, we would start to add interested point at which we want to examine the performance.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> ☕️
If you write sample code, please follow the samples format.