-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial discussion around CI failure handling #1
Comments
BTW on why I chose to include the full URL in the |
@joyeecheung @nodejs/tsc Should this/could this be a strategic initiative? Also, what is current state? Flakes seem to be tracked as issues in nodejs/node, so what is current state? I'd never seen this repo until moments ago! FYI, I've contact some folks who have offered to look at windows platform issues in the past, and got some promises that some of them will look at the windows flakiness. https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master/ is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently. I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds. |
Currently @Trott and myself get emails when "Daily Master" fails. I wanted to setup a mailing list or an RSS/ATOM feed for that so anyone can easily track it's status.
IMHO that's currently our best way to structure the known-flaky DB as test configuration. But without a champion those can turn from temporary measure into permanent "solutions", as we do have some tests that have been flaky for a long time, e.g. nodejs/node#23207, nodejs/node#26401, nodejs/node#20750. |
If its a strategic initiative, the current count of ignored tests will get reported at every TSC meeting. No guarantee they will get fixed, of course, but at least they will be visible. |
I'd like to find a place to point contributors at the flakes issues as good first contributions, but didn't. I was hoping to find somewhere in our docs that pointed to the Perhaps I should simply label all the ci flakes as I think they are good first contributions because they often require no knowledge of node internals. If the tests are written entirely in terms of the public node.js API, then someone with sufficient knowledge of the API alone could take a crack at any of them. For example, some expert user of workers might not have the C++ ability to contribute to the worker implementation, but be perfectly capable of reasoning carefully about the worker tests, and making them more robust. Fixing a flake also adds appreciable value to the project, so make great drive-by contributions for those so inclined. |
I'd use |
addressing CI failures require an organized and continued team work. Individual efforts are not sustainable, and people get worn out easily. I agree with @Trott , these are definitely not good-first-issues. Often, the problem determination of these failures takes 2 phases:
second part is real hard. |
I am +1 on running this through tsc. |
The idea proposed by the OP has never been used in practice. Sometimes I use I think it's worth being a strategic initiative, but we'll need a champion (or more) who has enough time to keep an eye on this. |
Arguably, this could be a part of the existing strategic initiative for Build WG resources. |
I'm definitely +1 as a strategic initiative if we can find a champion or if @Trott is volunteering by integrating it into the existing Build strategic initiative. I agree raising the awareness of the flakes by reviewing the numbers in the TSC meeting would be a great first step. |
I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-( |
I'll think about what I have time to do (vs what I wish I could do with more time). I looked at the build strategic initiative, and I'm not sure it is a fit for this. It is related, in that build infratructure failures can manifest as CI failures. Problems with host memory, disk, jenkins disconnections, tap2xml failures, etc. Those are within scope of build resources. However, the flaky tests in particular seem to be solidly non-build, they need developer effort. The build team doesn't seem to me like it should take on the extra scope of fixing unreliable github.com/nodejs/node unit tests, its strained already. |
Refs: nodejs/node#27611 Also, Joyee has a proposed fix as of 7 hours ago. |
@Trott while highlighting an urgent problem will have some benefit, I think reporting on the trend in terms of CI stability would possibly be more useful. ie are we slowly degrading/getting better and how quickly versus simply. That is something I think the TSC should be on top of and work to try to influence as opposed informing the TSC that some issue needs more help right now and hoping the TSC members have time to help out on that specific issue. Just my 2cents though. |
Specifically
After we settle down on those I can start implement the actions in ncu-ci to automate the CI status reporting and flake tracking process
The text was updated successfully, but these errors were encountered: