Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bug): Ensure windows agent stability using hubble/legacy helm values #1128

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

BeegiiK
Copy link
Contributor

@BeegiiK BeegiiK commented Dec 11, 2024

Description

This PR aims to fix the stability of the retina windows agent. There were 4 causes identified and each commit resolves one respectively.

  1. Invalid rendering of the namespace helm value (1st commit)
matmerr@matmerr-cloud-dev: ~/go/src/github.com/Azure/telescope
[06:56:29 PM][matmerr-aks-pktmon-11][matmerr/enable-ama]$ k logs -f retina-agent-win-7f7kb
Starting Retina Agent
starting Retina daemon with legacy control plane v0.0.17
2024/12/02 18:56:22 metricsInterval is deprecated, please use metricsIntervalDuration instead
init client-go
KUBECONFIG set, using kubeconfig:  C:\hpc\kubeconfig
Error: starting daemon: creating controller-runtime manager: error loading config file "C:\hpc\kubeconfig": yaml: invalid map key: map[interface {}]interface {}{".Values.namespace":interface {}(nil)}
  1. Default operator value is enabled and will cause RBAC issues for the windows agents (2nd commit)
ts=2024-12-10T16:58:48.634Z level=info caller=hnsstats/hnsstats_windows.go:212 msg="Start hnsstats plugin..."
W1210 16:58:49.990792    7108 reflector.go:547] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "metricsconfigurations" in API group "retina.sh" at the cluster scope
  1. Telemetry enabled also causes the agent to panic if application insights is not defined. User can change the config map as desired but default values should not cause the agent to crash (3rd commit)

  2. kubeconfig file cannot be found for the legacy chart values. Executing the setkubeconfigpath.ps1 was required for the container setup (4th commit)

beegii@bignamboi:~/src/retina$ k logs retina-agent-win-4tl7m -n kube-system
Starting Retina Agent
starting Retina daemon with legacy control plane v0.0.17
2024/12/11 18:40:15 metricsInterval is deprecated, please use metricsIntervalDuration instead
init client-go
KUBECONFIG set, using kubeconfig:  C:\hpc\kubeconfig
Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.

Related Issue

#1122

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Each commit corresponding image was built and tested on the cluster to confirm each fix works!

image

Additional Notes

First three problems were experienced when deploying retina using the hubble path and the last issue was experienced when deploying retina using the legacy path


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@BeegiiK BeegiiK requested a review from a team as a code owner December 11, 2024 18:12
@BeegiiK BeegiiK requested review from jimassa and matmerr December 11, 2024 18:12
@BeegiiK BeegiiK changed the title Ensure windows agent stability using hubble helm values Ensure windows agent stability using hubble/legacy helm values Dec 11, 2024
@BeegiiK BeegiiK requested review from vakalapa and nddq and removed request for jimassa December 11, 2024 19:08
@BeegiiK BeegiiK changed the title Ensure windows agent stability using hubble/legacy helm values fix(bug): Ensure windows agent stability using hubble/legacy helm values Dec 11, 2024
@@ -130,8 +130,8 @@ data:
enabledPlugin: {{ .Values.enabledPlugin_win }}
metricsInterval: {{ .Values.metricsInterval }}
metricsIntervalDuration: {{ .Values.metricsIntervalDuration }}
enableTelemetry: {{ .Values.enableTelemetry }}
enablePodLevel: {{ .Values.enablePodLevel }}
enableTelemetry: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back these changes ?

Copy link
Contributor Author

@BeegiiK BeegiiK Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the linux OS, the default hubble values work but for windows, they failed with these default values. We can either create new helm values for the windows OS specifically or leave it as is. I don't mind

The enableTelemetry requires application insights and leaving the default as false prevents the agent crashing if its not defined. If the consumer wants it enabled then they can simply update it.
The enablePodLevel causes RBAC issues with the retina-agent service account. As it's currently not supported on Windows, I think a default of false makes sense.

In legacy, both values are currently set to false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can define a default like this: {{ .Values.enablePodLevel | default false }}

Copy link
Contributor Author

@BeegiiK BeegiiK Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original values exist but it's currently set to true which is causing the Windows agent to crash but not for Linux. It's not a matter of having a default boolean 😊

- controller.exe --config ./retina/config.yaml
- powershell.exe
- -command
- .\setkubeconfigpath.ps1; ./controller.exe --config ./retina/config.yaml --kubeconfig ./kubeconfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an issue with 1.30 and oidc with this approach. we should remove these changes for these k8s versions. cc @rbtr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom kubeconfig is only required with containerd <1.7 and only AKS 1.27 (LTS) is still using that, maybe we remove it completely?

Copy link
Contributor Author

@BeegiiK BeegiiK Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what specific ask here is. Should the custom kubeconfig only be enabled for K8s version <= 1.30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following code in both agent daemonsets (hubble/legacy) and windows.yaml manifest and the agent still crashes:

command:
            - powershell.exe
            - -command
            {{- if semverCompare ">=1.30" .Capabilities.KubeVersion.GitVersion }}
            - $env:CONTAINER_SANDBOX_MOUNT_POINT/controller.exe --config ./retina/config.yaml
            {{- else }}
            - .\setkubeconfigpath.ps1; ./controller.exe --config ./retina/config.yaml --kubeconfig ./kubeconfig
            {{- end }}

image

windows/kubeconfigtemplate.yaml Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants