-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: add Outlier Detection Balancer #5435
Conversation
6a1f44b
to
b364c84
Compare
It seems this still needs to be rebased due to outlierdetection files. |
b364c84
to
0c2ef7c
Compare
aeb3c19
to
d8e42c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skimmed through. Expect more comments as I spend more time on this.
v3clusterpb "github.com/envoyproxy/go-control-plane/envoy/config/cluster/v3" | ||
v3endpointpb "github.com/envoyproxy/go-control-plane/envoy/config/endpoint/v3" | ||
v3listenerpb "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3" | ||
v3routepb "github.com/envoyproxy/go-control-plane/envoy/config/route/v3" | ||
"google.golang.org/grpc" | ||
"google.golang.org/grpc/credentials/insecure" | ||
"google.golang.org/grpc/internal" | ||
"google.golang.org/grpc/internal/envconfig" | ||
"google.golang.org/grpc/internal/stubserver" | ||
"google.golang.org/grpc/internal/testutils/xds/e2e" | ||
testgrpc "google.golang.org/grpc/test/grpc_testing" | ||
testpb "google.golang.org/grpc/test/grpc_testing" | ||
"google.golang.org/protobuf/types/known/durationpb" | ||
"google.golang.org/protobuf/types/known/wrapperspb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please group renamed proto imports separately.
// TestOutlierDetection tests an xDS configured ClientConn with an Outlier | ||
// Detection present in the system which is a logical no-op. An RPC should | ||
// proceed as normal. | ||
func (s) TestOutlierDetection(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe TestOutlierDetection_EmptyConfig
or TestOutlierDetection_NoopConfig
to be explicit about what is being tested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chose no-op config, empty management server update but od config is better termed "no-op". Thanks for suggestion.
"google.golang.org/protobuf/types/known/wrapperspb" | ||
) | ||
|
||
// TestOutlierDetection tests an xDS configured ClientConn with an Outlier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
// TestOutlierDetectionXxx tests the scenario where the outlier detection feature
// is enabled on the gRPC client, but it receives no outlier detection configuration
// from the management server. This should result in a no-op outlier configuration
// being used. This test verifies that an RPC is able to proceed normally with this
// configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (with a little rewording).
} | ||
} | ||
|
||
// defaultClientResourcesSpecifyingMultipleBackendsAndOutlierDetection returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the award for the author with the longest identifier name in our codebase goes to ...... lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol. This Java book I read called clean code told me to be verbose and explanatory in my variable names. Switched to "defaultClientResourcesMultipleBackendsAndOD". Is that better or too verbose/if so do you have a better suggestion?
Interval: &durationpb.Duration{ | ||
Nanos: 500000000, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: single line
And probably a comment with a human readable value for this time duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added comment // .5 seconds.
} | ||
} | ||
|
||
type object struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object
is not a very descriptive name for a type which is quite central to the implementation. Maybe something like subConnState
or subConnInfo
or subConnRunTimeInfo
or subConnRunTimeState
or something better. And a comment for the type describing its purpose would be useful too.
Also, this type does not contain any mutexes. It would be helpful for the above comment to talk about how synchronization is handled for this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct represents information about a certain address, not a SubConn. Each SubConn does hold a ref to this though. As such, I chose the name "addressInfo".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment for addressInfo explaining what it is and also explaining how it is synchronized. Let me know if it's too verbose or if you like it.
// The call result counter object | ||
callCounter *callCounter | ||
|
||
// The latest ejection timestamp, or null if the address is currently not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with
// The latest ejection timestamp, or zero if the address is currently not ejected.
And get rid of the // We represent the branching logic on the null with a time.Zero() value
.
Please terminate comment sentences with periods. go/go-style/decisions#comment-sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good merging. Switched. Added periods to every comment about each field in this struct.
type subConnWrapper struct { | ||
balancer.SubConn | ||
|
||
// "The subchannel wrappers created by the outlier_detection LB policy will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider not quoting the gRFC verbatim. The gRFC text makes sense in its context. But here (and everywhere else in this PR), please make comments self-explanatory and self-contained. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched as many as I could over. I left the one for swap() though, since that defines an implementation detail that I didn't follow with a subsequent explanation. I also left the ones in NewSubConn() as I thought they made sense, and the actual Outlier Detection Interval algorithm itself as it's explaining the algorithm step by step.
// eject(): "The wrapper will report a state update with the TRANSIENT_FAILURE | ||
// state, and will stop passing along updates from the underlying subchannel." | ||
func (scw *subConnWrapper) eject() { | ||
scw.scUpdateCh.Put(&ejectedUpdate{ | ||
scw: scw, | ||
ejected: true, | ||
}) | ||
} | ||
|
||
// uneject(): "The wrapper will report a state update with the latest update | ||
// from the underlying subchannel, and resume passing along updates from the | ||
// underlying subchannel." | ||
func (scw *subConnWrapper) uneject() { | ||
scw.scUpdateCh.Put(&ejectedUpdate{ | ||
scw: scw, | ||
ejected: false, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these have to be methods on the subConnWrapper
type?
Can callsites of scw.eject()
be replaced by:
b.scUpdateCh.Put(&ejectedUpdate{scw: scw, ejected: true})
and similarly for uneject()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't have to be, technically, but A50 clearly states to have methods on the subChannelWrapper type: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md#subchannel-wrapper. To keep implementations similar, I think it's best to do it as such. Plus, there's 8 callsites for both of these methods and I think it's cleaner for scw.eject()/scw.uneject() instead of putting stuff on update channel for each call site. However, main reason is that is explicitly defined in gRFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments :D!
// TestOutlierDetection tests an xDS configured ClientConn with an Outlier | ||
// Detection present in the system which is a logical no-op. An RPC should | ||
// proceed as normal. | ||
func (s) TestOutlierDetection(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chose no-op config, empty management server update but od config is better termed "no-op". Thanks for suggestion.
"google.golang.org/protobuf/types/known/wrapperspb" | ||
) | ||
|
||
// TestOutlierDetection tests an xDS configured ClientConn with an Outlier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (with a little rewording).
} | ||
} | ||
|
||
// defaultClientResourcesSpecifyingMultipleBackendsAndOutlierDetection returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol. This Java book I read called clean code told me to be verbose and explanatory in my variable names. Switched to "defaultClientResourcesMultipleBackendsAndOD". Is that better or too verbose/if so do you have a better suggestion?
Interval: &durationpb.Duration{ | ||
Nanos: 500000000, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added comment // .5 seconds.
// eject(): "The wrapper will report a state update with the TRANSIENT_FAILURE | ||
// state, and will stop passing along updates from the underlying subchannel." | ||
func (scw *subConnWrapper) eject() { | ||
scw.scUpdateCh.Put(&ejectedUpdate{ | ||
scw: scw, | ||
ejected: true, | ||
}) | ||
} | ||
|
||
// uneject(): "The wrapper will report a state update with the latest update | ||
// from the underlying subchannel, and resume passing along updates from the | ||
// underlying subchannel." | ||
func (scw *subConnWrapper) uneject() { | ||
scw.scUpdateCh.Put(&ejectedUpdate{ | ||
scw: scw, | ||
ejected: false, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't have to be, technically, but A50 clearly states to have methods on the subChannelWrapper type: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md#subchannel-wrapper. To keep implementations similar, I think it's best to do it as such. Plus, there's 8 callsites for both of these methods and I think it's cleaner for scw.eject()/scw.uneject() instead of putting stuff on update channel for each call site. However, main reason is that is explicitly defined in gRFC.
// The call result counter object | ||
callCounter *callCounter | ||
|
||
// The latest ejection timestamp, or null if the address is currently not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good merging. Switched. Added periods to every comment about each field in this struct.
} | ||
} | ||
|
||
type object struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct represents information about a certain address, not a SubConn. Each SubConn does hold a ref to this though. As such, I chose the name "addressInfo".
} | ||
} | ||
|
||
type object struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment for addressInfo explaining what it is and also explaining how it is synchronized. Let me know if it's too verbose or if you like it.
odAddrs map[string]*object | ||
odCfg *LBConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted od prefix. Kept outlierDetectionBalancer though, as clusterImplBalancer had prefix on Balancer type. GracefulSwitch doesn't though so let me know if you want me to switch outlierDetectionBalancer too.
type subConnWrapper struct { | ||
balancer.SubConn | ||
|
||
// "The subchannel wrappers created by the outlier_detection LB policy will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched as many as I could over. I left the one for swap() though, since that defines an implementation detail that I didn't follow with a subsequent explanation. I also left the ones in NewSubConn() as I thought they made sense, and the actual Outlier Detection Interval algorithm itself as it's explaining the algorithm step by step.
c688b0d
to
73528ce
Compare
|
||
} | ||
|
||
func (b *outlierDetectionBalancer) successRateAlgorithm() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ejectLowSuccessRateBackends
? or something like that to indicate more precisely what the function is doing. Same with the other Algorithm
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummmmmmm this is the configuration defines the algorithm as and A50, I agree that would be nicer, I'll write a comment for each xAlgorithm function.
return numAddrs | ||
} | ||
|
||
// meanAndStdDevOfSucceseesAtLeastRequestVolume returns the mean and std dev of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Succesees
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Switched.
// Reject whole config if any errors, don't persist it for later | ||
bb := balancer.Get(lbCfg.ChildPolicy.Name) | ||
if bb == nil { | ||
return fmt.Errorf("balancer %q not registered", lbCfg.ChildPolicy.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the name of this LB policy, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to "outlier detection: child balancer...".
if b.child != nil { | ||
b.child.Close() | ||
} | ||
// What if this is nil? Seems fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Note that 0 addresses is a valid update/state for a SubConn to be in. | ||
// This is correctly handled by this algorithm (handled as part of a non singular | ||
// old address/new address). | ||
if len(scw.addresses) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be a little clearer as:
if len(scw.addresses == 1) && len(addrs) == 1 {
// single address to single address
} else if len(scw.addresses == 1) {
// single address to multiple/no addresses
} else if len(addrs) == 1 {
// multiple address to single address
} // else multiple/no addresses to multiple/no addresses; ignore
or:
switch {
case len(scw.addresses == 1) && len(addrs) == 1:
case len(scw.addresses == 1):
case len(addrs) == 1:
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great clean suggestion. I chose option two, Easwar has made me use switches for scenarios like this and he knows go style so I'll go with option 2 thanks.
obj := b.appendIfPresent(addrs[0].Addr, scw) | ||
// 3. Relay state with eject() recalculated (using the corresponding | ||
// map entry to see if it's currently ejected). | ||
if obj == nil { // uneject unconditionally because could have come from an ejected address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says "uneject unconditionally" but goes on to eject
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, good catch. Switched to uneject().
// ejection_timestamp + min(base_ejection_time (type: time.Time) * | ||
// multiplier (type: int), max(base_ejection_time (type: time.Time), | ||
// max_ejection_time (type: time.Time))), un-eject the address. | ||
if !obj.latestEjectionTimestamp.IsZero() && now().After(obj.latestEjectionTimestamp.Add(time.Duration(min(b.cfg.BaseEjectionTime.Nanoseconds()*obj.ejectionTimeMultiplier, max(b.cfg.BaseEjectionTime.Nanoseconds(), b.cfg.MaxEjectionTime.Nanoseconds()))))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the use of max and min, this is still too long to follow. Please simplify into multiple statements or temporary variables if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried my best. Refactored. Let me know if it looks off still.
// This conditional only for testing (since the interval timer algorithm is | ||
// called manually), will never hit in production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the test can set this field manually first, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixed a race condition where the intervalTimer algorithm would go off at an arbitrary point in the future if not manually calling Stop() here. (Since we call the intervalTimerAlgorithm manually in test (much cleaner than setting field in test, since it'll get written to in UpdateClientConnState(), there would still be a configured timer from the UpdateClientConnState() that would leak if not closed here).
obj, ok := b.addrs[addr] | ||
if !ok { // Shouldn't happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass b.addrs[addr]
here (and unejectAddress
) instead to guarantee this won't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just realized at the call site you have access to the addrInfo (for addr, addrInfo := range b.addrs. Passed that instead.
od, tcc, _ := setup(t) | ||
defer internal.UnregisterOutlierDetectionBalancerForTesting() | ||
|
||
// This first config update should a child to be built and forwarded it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing verb between should
and a child
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh lol, sorry. Added "should".
BalancerConfig: &LBConfig{ | ||
Interval: 10 * time.Second, | ||
BaseEjectionTime: 30 * time.Second, | ||
MaxEjectionTime: 300 * time.Second, | ||
MaxEjectionPercent: 10, | ||
SuccessRateEjection: &SuccessRateEjection{ | ||
StdevFactor: 1900, | ||
EnforcementPercentage: 100, | ||
MinimumHosts: 5, | ||
RequestVolume: 100, | ||
}, | ||
FailurePercentageEjection: &FailurePercentageEjection{ | ||
Threshold: 85, | ||
EnforcementPercentage: 5, | ||
MinimumHosts: 5, | ||
RequestVolume: 50, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably reduce some lines in the test by making a variable for the configuration without the child policy config.
var baseCfgWithoutChildConfig = BalancerConfig: &LBConfig{
Interval: 10 * time.Second,
BaseEjectionTime: 30 * time.Second,
MaxEjectionTime: 300 * time.Second,
MaxEjectionPercent: 10,
SuccessRateEjection: &SuccessRateEjection{
StdevFactor: 1900,
EnforcementPercentage: 100,
MinimumHosts: 5,
RequestVolume: 100,
},
FailurePercentageEjection: &FailurePercentageEjection{
Threshold: 85,
EnforcementPercentage: 5,
MinimumHosts: 5,
RequestVolume: 50,
},
}
And change just the child policy config here and down below when calling UpdateClientConnState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummmm sorry I don't like this suggestion, it feels hacky and takes away readability in my opinion, since it's appending to something declared earlier in the funciton (think your comment on the other file about moving var declarations nearer to where they're used). I'll switch it to noop config though, since it doesn't need all the other fields, similar to your comment down below.
} | ||
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{}) | ||
if err != nil { | ||
t.Fatalf("error in od.NewSubConn call: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI: Calling t.Fatal
from goroutines other than the main test goroutine only causes that particular goroutine to exit. The main test goroutine will keep running. So, it is better to call t.Error
in these cases, to make it clear to the reader of the test.
Also, the err != nil
check is repeated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. I actually did not know that. Switched to error and deleted duplicate err != nil check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually breaks the package documentation: "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. This isn't created in a new goroutine, this is called in the main test goroutine. The calls down to the child all happen inline from the test goroutine, which then calls this. Switched regardless.
Interval: 8 * time.Second, | ||
BaseEjectionTime: 30 * time.Second, | ||
MaxEjectionTime: 300 * time.Second, | ||
MaxEjectionPercent: 10, | ||
SuccessRateEjection: &SuccessRateEjection{ | ||
StdevFactor: 1900, | ||
EnforcementPercentage: 100, | ||
MinimumHosts: 5, | ||
RequestVolume: 100, | ||
}, | ||
FailurePercentageEjection: &FailurePercentageEjection{ | ||
Threshold: 85, | ||
EnforcementPercentage: 5, | ||
MinimumHosts: 5, | ||
RequestVolume: 50, | ||
}, | ||
ChildPolicy: &internalserviceconfig.BalancerConfig{ | ||
Name: t.Name(), | ||
Config: emptyChildConfig{}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the complete config required for this test case? Can some fields be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Deleted everything but interval and SuccessRateEjection, since that causes it to be a non noop config. For this and the next UpdateClientConnState() below.
}, | ||
}, | ||
BalancerConfig: &LBConfig{ | ||
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use maxInt here as well. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and everywhere else.
okAddresses := []resolver.Address{ | ||
{Addr: addresses[0]}, | ||
{Addr: addresses[1]}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this definition closer to where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
// The full list of addresses. | ||
fullAddresses := []resolver.Address{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
defer mr.Close() | ||
|
||
sc := internal.ParseServiceConfig.(func(string) *serviceconfig.ParseResult)(test.odscJSON) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nix newline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// The addresses which don't return errors. | ||
okAddresses := []resolver.Address{ | ||
{Addr: addresses[0]}, | ||
{Addr: addresses[1]}, | ||
} | ||
|
||
// The full list of addresses. | ||
fullAddresses := []resolver.Address{ | ||
{Addr: addresses[0]}, | ||
{Addr: addresses[1]}, | ||
{Addr: addresses[2]}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"childPolicy": [ | ||
{ | ||
"round_robin": {} | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments :D
// | ||
// Returns a non-nil error if context deadline expires before RPCs start to get | ||
// roundrobined across the given backends. | ||
func checkRoundRobinRPCs(ctx context.Context, client testpb.TestServiceClient, addrs []resolver.Address) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, unfortunately this can't. I had to switch the logic for this one, because I start bad backends which return errors that I know will get called. Yours expect the EmptyCall RPC to return a non nil error.
od, tcc, _ := setup(t) | ||
defer internal.UnregisterOutlierDetectionBalancerForTesting() | ||
|
||
// This first config update should a child to be built and forwarded it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh lol, sorry. Added "should".
}, | ||
}, | ||
BalancerConfig: &LBConfig{ | ||
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and everywhere else.
}, | ||
}, | ||
BalancerConfig: &LBConfig{ | ||
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and everywhere else.
if err := backend.StartServer(); err != nil { | ||
t.Fatalf("Failed to start backend: %v", err) | ||
} | ||
t.Logf("Started good TestService backend at: %q", backend.Address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, good point. Switched.
BalancerConfig: &LBConfig{ | ||
Interval: 10 * time.Second, | ||
BaseEjectionTime: 30 * time.Second, | ||
MaxEjectionTime: 300 * time.Second, | ||
MaxEjectionPercent: 10, | ||
SuccessRateEjection: &SuccessRateEjection{ | ||
StdevFactor: 1900, | ||
EnforcementPercentage: 100, | ||
MinimumHosts: 5, | ||
RequestVolume: 100, | ||
}, | ||
FailurePercentageEjection: &FailurePercentageEjection{ | ||
Threshold: 85, | ||
EnforcementPercentage: 5, | ||
MinimumHosts: 5, | ||
RequestVolume: 50, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummmm sorry I don't like this suggestion, it feels hacky and takes away readability in my opinion, since it's appending to something declared earlier in the funciton (think your comment on the other file about moving var declarations nearer to where they're used). I'll switch it to noop config though, since it doesn't need all the other fields, similar to your comment down below.
Interval: 8 * time.Second, | ||
BaseEjectionTime: 30 * time.Second, | ||
MaxEjectionTime: 300 * time.Second, | ||
MaxEjectionPercent: 10, | ||
SuccessRateEjection: &SuccessRateEjection{ | ||
StdevFactor: 1900, | ||
EnforcementPercentage: 100, | ||
MinimumHosts: 5, | ||
RequestVolume: 100, | ||
}, | ||
FailurePercentageEjection: &FailurePercentageEjection{ | ||
Threshold: 85, | ||
EnforcementPercentage: 5, | ||
MinimumHosts: 5, | ||
RequestVolume: 50, | ||
}, | ||
ChildPolicy: &internalserviceconfig.BalancerConfig{ | ||
Name: t.Name(), | ||
Config: emptyChildConfig{}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Deleted everything but interval and SuccessRateEjection, since that causes it to be a non noop config. For this and the next UpdateClientConnState() below.
} | ||
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{}) | ||
if err != nil { | ||
t.Fatalf("error in od.NewSubConn call: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. I actually did not know that. Switched to error and deleted duplicate err != nil check.
} | ||
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{}) | ||
if err != nil { | ||
t.Fatalf("error in od.NewSubConn call: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually breaks the package documentation: "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."
} | ||
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{}) | ||
if err != nil { | ||
t.Fatalf("error in od.NewSubConn call: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. This isn't created in a new goroutine, this is called in the main test goroutine. The calls down to the child all happen inline from the test goroutine, which then calls this. Switched regardless.
pr.Done(di) | ||
} | ||
} | ||
// Shouldn't happen, defensive programming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is illegal (though it should be handled gracefully here, of course). SubConn
is required to be valid or the channel will re-pick after a new picker is provided. Can you make the tests not set it to nil?
pickerUpdateCh: buffer.NewUnbounded(), | ||
} | ||
b.logger = prefixLogger(b) | ||
b.logger.Infof("Created") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this show the config too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm I wouldn't mind that, but I just triaged the rest of the xDS balancer tree (which I based this on in the first place), and every balancer logs this word exactly. Thus, for the sake of consistency, I would prefer to keep it as such. Of course, if you feel strongly about this I'll go ahead and switch it over.
return nil | ||
b := &outlierDetectionBalancer{ | ||
cc: cc, | ||
bOpts: bOpts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field seems to be unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Graceful Switch gets constructed with this so we are good here. Removed this field.
err := b.child.SwitchTo(bb) | ||
if err != nil { | ||
b.childMu.Unlock() | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should wrap this error so there is more context whenever it's printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to (pun unintended): return fmt.Errorf("outlier detection: error switching to child of type %q: %v", lbCfg.ChildPolicy.Name, err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pass bossman!
return nil | ||
b := &outlierDetectionBalancer{ | ||
cc: cc, | ||
bOpts: bOpts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Graceful Switch gets constructed with this so we are good here. Removed this field.
pickerUpdateCh: buffer.NewUnbounded(), | ||
} | ||
b.logger = prefixLogger(b) | ||
b.logger.Infof("Created") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm I wouldn't mind that, but I just triaged the rest of the xDS balancer tree (which I based this on in the first place), and every balancer logs this word exactly. Thus, for the sake of consistency, I would prefer to keep it as such. Of course, if you feel strongly about this I'll go ahead and switch it over.
err := b.child.SwitchTo(bb) | ||
if err != nil { | ||
b.childMu.Unlock() | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to (pun unintended): return fmt.Errorf("outlier detection: error switching to child of type %q: %v", lbCfg.ChildPolicy.Name, err)
// childMu protects the closing of the child and also guarantees updates to | ||
// the child are sent synchronously (to uphold the balancer.Balancer API | ||
// guarantee of synchronous calls). | ||
// | ||
// For example, run() could read that the child is not nil while processing | ||
// SubConn updates, and then Close() could write to the the child, clearing | ||
// the child, making it nil, then you try and update a cleared and already | ||
// closed child, which breaks the balancer.Balancer API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
// childMu guards calls into child and also synchronizes reads of child in run() and the write in Close().
If you make sure b.run()
exits before calling b.child.Close()
, though, you don't need that last part of the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, waited for b.run() to exit. Changed docstring to just the first part of that sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also removed the clearing of the child as an invariant of close, and the reading of the child (determining whether it's been cleared or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
This PR adds the Outlier Detection Balancer as part of the xDS configured tree of balancers (see https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md).
Contains #5371.
RELEASE NOTES: