Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent the PassthroughCluster for clients/workloads in the service mesh #3711

Merged

Conversation

israel-hdez
Copy link
Contributor

What this PR does / why we need it

Background

The KServe Ingress VirtualServices are created with configurations targeting only the Gateways. Although this works, the omission of the Istio sidecars has the following downsides for client/workloads that belong to the Istio mesh (i.e. have an Istio sidecar):

  • Requests to InferenceServices will be treated as going to external services (i.e. not part of the mesh), because the sidecars are unaware of the routing rules.
  • In consequence, the requests will be hanlded as with any external (non-mesh) workload: the ingress gateway will first receive the request and will forward it to itself doing the URL rewrite to the relevant -predictor, -explainer or -transformer hostname. Such forwarding can be avoided (for mesh-workloads) and the rewrite can be performed by the sidecars with the right VirtualService configuration.

This can be verified in the metrics that Istio emits. For example, the istio_requests_total metric would be emitted like this (some labels omitted for brevity):

istio_requests_total{
 destination_canonical_revision="latest",
 destination_canonical_service="unknown",
 destination_cluster="unknown",
 destination_principal="unknown",
 destination_service="sklearn-v2-iris.kserve-test.svc.cluster.local",
 destination_service_name="PassthroughCluster",
 destination_service_namespace="unknown",
 destination_workload="unknown",
 destination_workload_namespace="unknown",
 [...]}

Specifically, the destination_service_name="PassthroughCluster" reveals that the requests are being hanlded as leaving the service mesh, despite the client-workload has a sidecar. Such requests would be blocked if Istio would be configured with REGISTRY_ONLY.

When using the Kiali project, its graph will reveal that observability is potentially lost. For example, the following image shows that the curl-inside-mesh workload has a sidecar and all its requests are going to the PassthroghCluster, despite everything is in the mesh.

image

Fix

This is adding the missing configurations in the KServe-created VirtualService, so that Istio sidecars are aware of the KServe services/hostnames and workloads with an Istio sidecar will correctly handle the requests as mesh-internal traffic. Also, there is the added benefit that the URL rewrite will be done in the sidecar, rather than delaying/deferring the rewrite to the Gateway and this saves one request forwarding (potentially, slightly better performance).

With the fixed configs, the following is an example of a metric that will be emitted by Istio:

istio_requests_total{
 destination_app="unknown",
 destination_canonical_revision="data-science-smcp",
 destination_canonical_service="istio-ingressgateway",
 destination_cluster="Kubernetes",
 destination_principal="unknown",
 destination_service="knative-local-gateway.istio-system.svc.cluster.local",
 destination_service_name="knative-local-gateway",
 destination_service_namespace="istio-system",
 destination_version="unknown",
 destination_workload="istio-ingressgateway",
 destination_workload_namespace="istio-system",
 [...]}

Such metric no longer involves the PassthroughCluster, which should be understood as Istio handling the requests as mesh-internal. Furthermore, if mTLS is enabled in Istio, the metrics will also have principal labels populated (requests are authenticated); e.g:

istio_requests_total{
 [...]
 destination_principal="spiffe://cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account",
 source_principal="spiffe://cluster.local/ns/kserve-workloads/sa/default"}

...and this gives the advantage of being able to use Istio security features, if needed (like AuthorizationPolicies). Also, better observability may also be possible. For example, the Kiali graph would now be like this:

image

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

  • Relevant unit tests were adapted.
  • Manual validation with Istio metrics was performed, as mentioned previously.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • [N/A] Have you made corresponding changes to the documentation?

The KServe Ingress VirtualServices are created with configurations targeting only the Gateways. Although this works, the omission of the Istio sidecars has the following downsides for workloads that belong to the Istio mesh:

* Requests to InferenceServices will be treated as going to external services (i.e. not part of the mesh), because the sidecars are unaware of the routing rules.
* In consequence, the requests will be hanlded as with any external (non-mesh) workload: the ingress gateway will first receive the request and will forward it to itself doing the URL rewrite to the relevant -predictor, -explainer or -transformer hostname. Such forwarding can be avoided (for mesh-workloads) and the rewrite can be performed by the sidecars with the right VirtualService configuration.

 This is adding the missing configurations in the KServe-created VirtualService, so that Istio sidecars are aware of the KServe services/hostnames and do the rewrite in the sidecar, rather than delaying/deferring the rewrite to the Gateway.

 For workloads that belong to the mesh, slightly better performance may be seen (given one request forwarding is saved) and better observability from Istio may also be possible.

Signed-off-by: Edgar Hernández <[email protected]>
@@ -231,6 +231,7 @@ const (

var (
LocalGatewayHost = "knative-local-gateway.istio-system.svc." + network.GetClusterDomainName()
IstioMeshGateway = "mesh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to hardcode to mesh?
I mean, it is isolated peer namespace, but the MeshConfig can have m ore than one namespace, I don't think it is the case for KServe, just double checking :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe.

The mesh gateway is a keyword in the VirtualService. It means that the configuration should be applied to the sidecars.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little nit - it is applied to ALL sidecars in the mesh.

@bartoszmajsak
Copy link

One thing I wonder is if "mesh" was deliberately skipped in the original architecture or if it was overlooked. I can't answer this question but I wonder if this has any impact on the traffic flow and alternates some rules or is it working as expected.

@yuzisun
Copy link
Member

yuzisun commented May 25, 2024

One thing I wonder is if "mesh" was deliberately skipped in the original architecture or if it was overlooked. I can't answer this question but I wonder if this has any impact on the traffic flow and alternates some rules or is it working as expected.

It was probably an oversight, I think this should work fine as it is adding the mesh gateway to resolve the routing in the sidecar instead of going to the passthrough.

@Jooho
Copy link
Contributor

Jooho commented Jun 6, 2024

@yuzisun looks like this pr is ready to merge. could you please approve it?

Copy link
Contributor

@spolti spolti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@israel-hdez
Copy link
Contributor Author

@yuzisun OK, I have one approval.
Let me know if you need something else before merging this one.

@yuzisun
Copy link
Member

yuzisun commented Jun 9, 2024

/approve

Copy link

oss-prow-bot bot commented Jun 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, spolti, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@oss-prow-bot oss-prow-bot bot added the approved label Jun 9, 2024
@yuzisun yuzisun merged commit 212a77c into kserve:master Jun 9, 2024
57 of 58 checks passed
@israel-hdez israel-hdez deleted the virtual-service-fix-mesh-workloads branch June 10, 2024 17:19
asdqwe123zxc pushed a commit to asdqwe123zxc/kserve that referenced this pull request Jun 11, 2024
…esh (kserve#3711)

Prevent the PassthroughCluster for clients in the service mesh

The KServe Ingress VirtualServices are created with configurations targeting only the Gateways. Although this works, the omission of the Istio sidecars has the following downsides for workloads that belong to the Istio mesh:

* Requests to InferenceServices will be treated as going to external services (i.e. not part of the mesh), because the sidecars are unaware of the routing rules.
* In consequence, the requests will be hanlded as with any external (non-mesh) workload: the ingress gateway will first receive the request and will forward it to itself doing the URL rewrite to the relevant -predictor, -explainer or -transformer hostname. Such forwarding can be avoided (for mesh-workloads) and the rewrite can be performed by the sidecars with the right VirtualService configuration.

 This is adding the missing configurations in the KServe-created VirtualService, so that Istio sidecars are aware of the KServe services/hostnames and do the rewrite in the sidecar, rather than delaying/deferring the rewrite to the Gateway.

 For workloads that belong to the mesh, slightly better performance may be seen (given one request forwarding is saved) and better observability from Istio may also be possible.

Signed-off-by: Edgar Hernández <[email protected]>
Signed-off-by: asdqwe123zxc <[email protected]>
asdqwe123zxc pushed a commit to asdqwe123zxc/kserve that referenced this pull request Jun 11, 2024
…esh (kserve#3711)

Prevent the PassthroughCluster for clients in the service mesh

The KServe Ingress VirtualServices are created with configurations targeting only the Gateways. Although this works, the omission of the Istio sidecars has the following downsides for workloads that belong to the Istio mesh:

* Requests to InferenceServices will be treated as going to external services (i.e. not part of the mesh), because the sidecars are unaware of the routing rules.
* In consequence, the requests will be hanlded as with any external (non-mesh) workload: the ingress gateway will first receive the request and will forward it to itself doing the URL rewrite to the relevant -predictor, -explainer or -transformer hostname. Such forwarding can be avoided (for mesh-workloads) and the rewrite can be performed by the sidecars with the right VirtualService configuration.

 This is adding the missing configurations in the KServe-created VirtualService, so that Istio sidecars are aware of the KServe services/hostnames and do the rewrite in the sidecar, rather than delaying/deferring the rewrite to the Gateway.

 For workloads that belong to the mesh, slightly better performance may be seen (given one request forwarding is saved) and better observability from Istio may also be possible.

Signed-off-by: Edgar Hernández <[email protected]>
Signed-off-by: asdqwe123zxc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants