Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

dotnet-eng-status-staging · 2022-04-28T17:58:29Z

💔 Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

WaitTime {Queue=debian.9.amd64} 52
WaitTime {Queue=debian.9.amd64.svc} 51
WaitTime {Queue=sles.12.amd64} 51
WaitTime {Queue=sles.12.amd64.open.svc} 51
WaitTime {Queue=ubuntu.1604.amd64} 52
WaitTime {Queue=ubuntu.1604.amd64.svc} 52

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99

MattGal · 2022-04-28T17:59:08Z

@premun based off the queues in question this makes me concerned about your startup changes, mind taking a peek?

premun · 2022-04-29T10:17:47Z

Seems like the machines are stuck in the "Initializing" state. I'll try to connect there and see what's up

premun · 2022-04-29T12:31:58Z

I connected to the machines and it seems like they are stuck in a boot loop. However, I wasn't able to find anything in the logs that would hint why the reboot was requested:

I removed the logs and ran the agent and these are the full logs from boot to reboot (apart from installer.log).

premun · 2022-04-29T13:04:49Z

There are two suspects - my change undoubtedly, and then "Forcibly take ownership of the log folder every work item".
Looking at the builds, they started failing after [Merged PR 22581: Forcibly take ownership of the log folder every work item]

It wasn't failing before and after this it's 100% rate:

(the purple build is my change but it fails on some windows queue while all the other linux queues are rebuilt there)

premun · 2022-04-29T13:16:59Z

I opened https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/22601 so that it runs in the meanwhile while I investigate some more

premun · 2022-04-29T15:53:25Z

The build is green so it is something about the ownership change that makes these go in the boot loop so I merged a revert

dotnet-eng-status-staging · 2022-04-29T17:30:36Z

💚 Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

dotnet-eng-status-staging bot added Active Alert Issues from Grafana alerts that are now active Ops - First Responder Grafana Alert Issues opened by Grafana Staging Tied to the Staging environment (as opposed to Production) labels Apr 28, 2022

premun self-assigned this Apr 29, 2022

dotnet-eng-status-staging bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Apr 29, 2022

premun mentioned this issue May 2, 2022

Test reporter logging is owned by root if first work item is docker work item #9208

Closed

premun closed this as completed May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

dotnet-eng-status-staging bot commented Apr 28, 2022

MattGal commented Apr 28, 2022

premun commented Apr 29, 2022 •

edited

Loading

premun commented Apr 29, 2022 •

edited

Loading

premun commented Apr 29, 2022 •

edited

Loading

premun commented Apr 29, 2022

premun commented Apr 29, 2022

dotnet-eng-status-staging bot commented Apr 29, 2022

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

Staging - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #9217

Comments

dotnet-eng-status-staging bot commented Apr 28, 2022

MattGal commented Apr 28, 2022

premun commented Apr 29, 2022 • edited Loading

premun commented Apr 29, 2022 • edited Loading

premun commented Apr 29, 2022 • edited Loading

premun commented Apr 29, 2022

premun commented Apr 29, 2022

dotnet-eng-status-staging bot commented Apr 29, 2022

premun commented Apr 29, 2022 •

edited

Loading

premun commented Apr 29, 2022 •

edited

Loading

premun commented Apr 29, 2022 •

edited

Loading