Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

Closed
dotnet-eng-status-staging bot opened this issue Apr 28, 2022 · 7 comments
Assignees
Labels
Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Staging Tied to the Staging environment (as opposed to Production)

Comments

@dotnet-eng-status-staging
Copy link

💔 Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

  • WaitTime {Queue=debian.9.amd64} 52
  • WaitTime {Queue=debian.9.amd64.svc} 51
  • WaitTime {Queue=sles.12.amd64} 51
  • WaitTime {Queue=sles.12.amd64.open.svc} 51
  • WaitTime {Queue=ubuntu.1604.amd64} 52
  • WaitTime {Queue=ubuntu.1604.amd64.svc} 52

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99

@dotnet-eng-status-staging dotnet-eng-status-staging bot added Active Alert Issues from Grafana alerts that are now active Ops - First Responder Grafana Alert Issues opened by Grafana Staging Tied to the Staging environment (as opposed to Production) labels Apr 28, 2022
@MattGal
Copy link
Member

MattGal commented Apr 28, 2022

@premun based off the queues in question this makes me concerned about your startup changes, mind taking a peek?

@premun premun self-assigned this Apr 29, 2022
@premun
Copy link
Member

premun commented Apr 29, 2022

Seems like the machines are stuck in the "Initializing" state. I'll try to connect there and see what's up

@premun
Copy link
Member

premun commented Apr 29, 2022

I connected to the machines and it seems like they are stuck in a boot loop. However, I wasn't able to find anything in the logs that would hint why the reboot was requested:

image

I removed the logs and ran the agent and these are the full logs from boot to reboot (apart from installer.log).

@premun
Copy link
Member

premun commented Apr 29, 2022

There are two suspects - my change undoubtedly, and then "Forcibly take ownership of the log folder every work item".
Looking at the builds, they started failing after [Merged PR 22581: Forcibly take ownership of the log folder every work item]

It wasn't failing before and after this it's 100% rate:
image

(the purple build is my change but it fails on some windows queue while all the other linux queues are rebuilt there)

@premun
Copy link
Member

premun commented Apr 29, 2022

I opened https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/22601 so that it runs in the meanwhile while I investigate some more

@premun
Copy link
Member

premun commented Apr 29, 2022

The build is green so it is something about the ownership change that makes these go in the boot loop so I merged a revert

@dotnet-eng-status-staging dotnet-eng-status-staging bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Apr 29, 2022
@dotnet-eng-status-staging
Copy link
Author

💚 Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Staging Tied to the Staging environment (as opposed to Production)
Projects
None yet
Development

No branches or pull requests

2 participants