Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

SEP 18: Release Label Improvements #24

Merged
merged 3 commits into from
Jun 3, 2020

Conversation

waynew
Copy link
Contributor

@waynew waynew commented Jan 13, 2020

@waynew waynew requested a review from a team as a code owner January 13, 2020 20:55
@ghost ghost requested review from Akm0d and removed request for a team January 13, 2020 20:55
@waynew waynew changed the title release label SEP SEP 18: Release Label Improvements Jan 13, 2020
@OrangeDog
Copy link

OrangeDog commented Jan 14, 2020

Immediate thought: "track what should make it into a release" is solved by GitHub milestones, not labels. As are all of these:

  • "when will my bug be fixed?"
  • "what should I be working on next?"
  • "is there anything that still needs to be done before the next release?"
  • and not mentioned in the SEP, "which release fixed this issue?"

@waynew
Copy link
Contributor Author

waynew commented Jan 14, 2020

Immediate thought: "track what should make it into a release" is solved by GitHub milestones, not labels.

That's true - they can be solved by milestones. While I would personally be happy to use milestones, it feels like removing the current release priority labels in favor of milestones would just be confusing. This proposal is to make a simple, easy change that should result in a fairly significant improvement. One might say that this 20% of a change gets us 80% of the way.

  • and not mentioned in the SEP, "which release fixed this issue?"

That's definitely orthogonal to this SEP, and should already be a solved problem.

@OrangeDog
Copy link

removing the current release priority labels in favor of milestones

I do not think that should be done at all. Rather I believe that this proposal doesn't achieve the stated aims. It should be the case that priority correlates with release version, but they are separate things.

Rather than trying to use priority as a proxy to imply a release plan, simply make the release plan explicit (especially as it changes).

@sagetherage
Copy link
Contributor

@OrangeDog I apologize we have not responded sooner. Your comment sparked many varied and colorful internal discussions and those will need to continue as you bring up a good point. However, not to ignore you or your comment, we would like to put this on the agenda for broader discussion at a weekly Open Hour to gain more from you and potentially others.

Is there a date and time that works for you to attend? Those are held weekly on Thursdays at 11 AM Mountain (US).

This is an attempt to make sure you are heard as well as potentially any one else (not sure ATM) and to drive towards the goal of better communication: in our labeling of tickets; our release plans and execution; as well as, what the labels mean, how documentation should be consumed, etc. etc. etc. This list could be much longer and really I want to hear from you. Please let me know if putting this on the agenda and for you attend does not work - also valuable information. I am open to your ideas.

@OrangeDog
Copy link

@sagetherage I'm on UK time (GMT) 10:00-18:00.

I'm not sure what further input is required though. You're simply not using the GitHub features as intended. You can look at other popular OS products on here (e.g. Spring) to see how labels and milestones are supposed to be used. Or just look at the GitHub help.

@sagetherage
Copy link
Contributor

Yes, I realize that we aren't using labels like other OS products and agree with you the GitHub help is fantastic!

If joining a call isn't what you want or isn't needed, that is fine. I asked because it is one avenue. The SEP comments aren't always broadly reviewed or known so I also wanted to give a broader forum.

@OrangeDog
Copy link

As I said, my availability does not coincide with when these calls are.

The SEP comments aren't always broadly reviewed or known

Isn't that a major issue? What's the point of the SEP process if the feedback isn't looked at?

@waynew
Copy link
Contributor Author

waynew commented Feb 10, 2020

The SEP comments aren't always broadly reviewed or known

Isn't that a major issue? What's the point of the SEP process if the feedback isn't looked at?

That is the challenge that we're working to address. While the core team does review SEPs, that's ~15 people. We had 130 contributors to the Neon release alone. It's pretty clear that we have a fairly broad contributor base, and it's also clear that we're not even getting half of those to even click 👍 / 👎 on a SEP.

We're continuing to publish these in Slack, we should probably also make sure they get posted to IRC, we're discussing them in our Community Open Hours. It's probably going to take a while before we reliably get a broad audience reviewing and discussing SEPs. But we want to make sure that as many people are interested have an opportunity to contribute. It is a process, though, so it will take some time before we're at a point where we're getting enough healthy discussion.

@sagetherage
Copy link
Contributor

Currently, labels for priority and severity create a matrix determining the number of users impacted and the impact within Salt, thus creating a complex, difficult to decipher, matrix of when, where, and by whom will issues will be addressed. This SEP is an attempt to remove the confusion and give 4 combined definitions that are aligned with common and industry understanding. Blocker and Release Blocker are not easily understood.

In this SEP the labels an definitions are combined:

  • Release Blocker (combining P1 and Blocker): will be seen by all users and is the highest severity within Salt thus will block the named/labeled release until corrected

  • Critical (combining P2 and Critical): will be seen by a majority of users and may cause data loss, crashes or hangs salt processes, makes the system unresponsive, etc.

  • High (combining P3 and High Severity): will be seen by half of the users and has incorrect functionality, bad functionality, a confusing user experience, etc.

  • Low (combining P4 and Medium Severity): will not be seen by most users and/or is a very specific use case and has cosmetic items, formatting, spelling, colors, etc.

A number of combinations can be used for these 4 categories. It is easily understood to use the full word at the beginning of the label to clearly indicate the use i.e. severity: high. It is easier to simply read either: severity: critical, severity: high, severity: medium, severity: low; or priority:1, priority:2, priority:3, priority:4. I also suggest this with other labels i.e. release:Sodium or team:boto.

This explicitly indicates order and how issues will be worked, prioritized, and determined for release cycles; however, labels do not give a good sense of releases and using Milestones will give a bucket or container.

Instead of the sole indicator for a release for any issue or PR, being the label and the use of a Project Board, I propose moving to the use of Milestones. Although, I would continue the use of the release label, and introduce more Milestones: Consider, Plan, and Commit indicating what phase any issue or PR is in for any known release cycle (many issues in the Consider milestone could have release:Sodium release:Aluminium release:Silicon. These milestone are defined as:

  • Consider: indicates the issue is being considered as part of a release, indicated by the release < code name > label, but the level of effort, priority in general or within the release, and resource assigned may not yet be known or determined. Those are determined in the next phase, Plan.

  • Plan: indicates it is in the planning phase to determine the level of effort, priority in general or within (any) release cycle and what release cycle (for commitment), and resources assigned. This phase will be the largest and the longest.

  • Commit: this is the most deterministic, and indicates the level of effort, priority, release commitment and resource assignment are committed to and the entire team agrees. Many issues may be in the planning phase until the Feature Code Freeze date prior to the project release date. If a particular issue is considered lower priority and assigned resource will attempt to work on the issue to get into a release but cannot commit, it will reside in the Plan milestone until commitment can be fulfilled or committed to for any release cycle.

@sagetherage
Copy link
Contributor

sagetherage commented May 13, 2020

The milestones in my comment above would be determined throughout the release cycle process including triage, weekly planning, as well as, release planning, execution, and delivery. These will be the responsibility of the release manager and the release team, and actionable by the entire Open Core Team, who will thus inform the larger community and other stakeholders by the use of labels with a combination of milestones. I would propose to move to the following labels with these adjusted definitions:

  • Severity: Critical affects all users within the functional area and is the highest severity within Salt will block the release until corrected or determined a lesser severity

  • Severity: High affects a majority of users within the functional area and may cause data loss, crashes or hangs a salt processes, and/or makes the system unresponsive

  • Severity: Medium affects about half of users within the functional area, incorrect functionality, bad functionality, a confusing user experience, or a confusing error message

  • Severity: Low affects few users within the functional area, a specific use case/corner case, or cosmetic fixes, such as formatting, spelling, colors

Severity can also include the amount of time of existence within the code base and adjusted by the release manager and release team to ensure the fix is addressed, especially if a work-around does not exist. Time will not directly equal a higher severity or sooner release cycle fix necessarily, but can be adjusted to determine a combination of severity and priority to determine appropriate resources with appropriate capacity to determine a better flow of addresses all level of issues over many release cycles as well as individually.

This comment is to reference the release process adjustments and look towards improvements and perhaps more Salt Enhancement Proposals with clear, defined process as well as a path towards iteration and improvement.

@sagetherage
Copy link
Contributor

This also references the triage process and the need for iteration and improvement that will also lead more Salt Enhancement Proposals to define and iterate.

@waynew
Copy link
Contributor Author

waynew commented May 13, 2020

@sagetherage your proposal is to just have those 4 labels, correct? If so, I'm totally on board with that - they're simple and easy to grok, and there are only 4 of them, which also aligns fairly well with the Eisenhower matrix.

Granted, it's not exactly dumping things into a matrix... but -- having the 4 buckets does make it pretty clear which things the core team will focus on, and which things are less of a priority.

@sagetherage
Copy link
Contributor

I am a huge fan of the Eisenhower matrix!

@waynew
Copy link
Contributor Author

waynew commented May 14, 2020

@sagetherage thoughts?

Copy link
Contributor

@Akm0d Akm0d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a compelling argument that we would have to retrain ourselves with or without these changes and that it will be easier to learn on a more simple labeling system.

@waynew
Copy link
Contributor Author

waynew commented May 26, 2020

We had some other internal discussion about workflow changes, but I think this one is ready for Final Comment Period now.

(Expected decision date: 2020-06-02)

@waynew waynew added the Final Comment Period Speak now or forever hold your peace. label May 26, 2020
@dkacar-oradian
Copy link

My comment is probably going to be out of scope, but I have a different perspective,

My main objection is to this:

As bug reporter (or someone with the same issue), I just want to know: **when
will my bug be fixed?** In other words, what major or point release will
solve my problem?

That's not necessarily how it is. I'm willing to write the code to fix a bug or add a feature, but I'm rather confused with the whole process. I can write something and create a PR, but I have no idea what would happen then.

As far as I'm concerned, you can assign priorities as you wish, but if the priority is low enough that it won't make the next release (point or major) I'd like to know that as soon as possible. That is, when the report is triaged and priority assigned. The SEP states:

It will also make it easy to see what issues are certain to make it into the next
release, and which issues are *not* the focus for the core team.

I have to say that, after reading the SEP, it's not easy for me to see which issues will make it into the next release. Criticals would, but I'm really not sure about the rest. From the description it looks like Highs also would, but it isn't explicitely spelled out. And I have no clue about Mediums and Lows.

Something with a Low priority probably wouldn't on its own, but it could be in a part of code which will be worked on because there is another High issue reported against it, so it would be reasonable to expect that the Low issue would also be completed by the same programmer. Or at least that's how I do things. But I'm not sure about you.

If Lows can be combined with higher priority issues then there should be a separate label indicating that it would make it into the next release.

Another thing is that I'd like to have an option (could be a label) to mark an issue with "I'm willing to create a PR for this if you won't work on it in the near future", provided that I could be fairly certain that someone would look into that PR and merge it in a reasonable time frame (in practice that usually means before the next point release).

And I'm aware that "someone looking into it" might require a largish amount of time from that someone.

@OrangeDog
Copy link

a separate label indicating that it would make it into the next release

As I keep saying, that's what milestones are supposed to be for.

@waynew
Copy link
Contributor Author

waynew commented May 27, 2020

I have to say that, after reading the SEP, it's not easy for me to see which issues will make it into the next release. Criticals would, but I'm really not sure about the rest. From the description it looks like Highs also would, but it isn't explicitely spelled out. And I have no clue about Mediums and Lows.

That's actually intentional, because (at least currently) there's no way to guarantee. We currently have over 1,500 open issues, and I'd guess more than a release worth of those would be labeled "High".

This SEP is explicitly not trying to overhaul our entire process - just pick this low-hanging fruit. We do have plans and desires to improve the rest of the process, this is just laying the foundation for the rest 🙂

@dkacar-oradian
Copy link

Yeah, but I'd like to know that something isn't going to make the next release. Which is different from asking for commitment for the next release.

If I know that Saltstack the company isn't going to work on something for the next release then I can organize something myself (write a patch, for example). Your labeling doesn't give me that kind of information. If I understand correctly I'd still have to wait until feature freeze to figure out what's in and what isn't. And that is for features. For bugs I really have no clue.

And yes, that's what milestones are supposed to be for.

If I have to wait until feature freeze (or something similar) that doesn't give me a lot of time to organize something myself. And I see that as a pretty big problem.

I'm not opposed to this change, just saying that I don't see a tangible benefit for me. There is definitely a cosmetic one (not being confused with P1 + Medium kind of thing), but that's just cosmetic.

@cachedout
Copy link
Contributor

The next highest priority is for issues that are serious issues, but **not
serious enough to be considered a release blocker**. These affect a many or
most users in the functional area, may cause data loss, crashes or hangs,
and/or makes the system unresponsive. This label will be **High**

My reading of this language is that issues which may cause data loss, crashes or hangs, or cause a system to become unstable are not necessarily considered serious enough to block a release.

This description is so similar to the Critical description that it seems as though the difference is entirely subjective. Granted, to a certain extent, that's always going to be true and I understand that it's hard to draw a crisp line, but I suspect this is going to be a problem at some point. Beyond that, I think it sends a dispiriting message to say that a bug affecting the stability of systems on which Salt is run might knowingly make its way into a release.

Have you considered possibly trying to clarify the language to better describe what types of issues constitute a release blocker and how that decision is made?

@waynew
Copy link
Contributor Author

waynew commented Jun 1, 2020

My reading of this language is that issues which may cause data loss, crashes or hangs, or cause a system to become unstable are not necessarily considered serious enough to block a release.
...
Have you considered possibly trying to clarify the language to better describe what types of issues constitute a release blocker and how that decision is made?

I'm not sure if at this point there are clear enough criteria to make that decision. Consider this contrived example:

The night before a targeted release date an issue is uncovered that, with the right combination of pillar data and configs, Salt erroneously strips the newline from the end of a file.

Under a rigorous definition, that's data loss. It's trivial data loss that could probably be addressed by some kind of onchanges hacks, but it's data loss. If we simply say "all data loss blocks a release", suddenly we have to delay the release for everyone based on some weird corner case. And our current release process is tremendously involved.

The purist in me agrees: it's data loss and should block the release in that case. The pragmatist says that the release process is currently too involved and it would be silly to block the release for such an apparently simple issue.

While I fundamentally I agree with you - I want better and clearer criteria for that decision making - I don't currently have a better approach. To clearly spell out, discuss, and decide the decision criteria would take an unreasonable amount of time and effort, and would likely not be a real improvement over this approach anyway. So my approach is that the default choice for data loss/instability/etc. is Critical, but there may be exceptions to that rule.

To me the current wording in the SEP is clear enough, but if you've got a better way to state that, I'm 👂 👂 👂 👂 👂 👍

@cachedout
Copy link
Contributor

cachedout commented Jun 1, 2020

While I fundamentally I agree with you - I want better and clearer criteria for that decision making - I don't currently have a better approach.

The scenario you've put forward is -- almost to a T -- the reason that the current system is the way that it is. :) While it has its downsides, it's built to handle the case you're describing.

The big issue with way that this proposal is designed is that it seems to blend two concerns. The first is user impact, which is an amalgam of how likely it is to affect users and what the impact to those users might be.

The second is release impact which seems to be only present in the Critical labels and affects go/no-go on a release.

I think @OrangeDog has the right idea here when he says that a second mechanism (such as milestones) would make that distinction between those vectors more clear. (Forgive me if I'm misappropriating that comment, but that's how I read it.)

So, specifically, here is what I propose:

  1. Collapse Critical/High into a single label, since the difference seems to be very vague and has the issues I outline above.
  2. Develop a SEP concurrent to this one which expands on how release items are managed and how blockers are identified and tracked.
  3. Block the merge of this SEP until 1. and 2. are complete.

@waynew
Copy link
Contributor Author

waynew commented Jun 1, 2020

So, specifically, here is what I propose:

1. Collapse Critical/High into a single label, since the difference seems to be very vague and has the issues I outline above.

2. Develop a SEP concurrent to this one which expands on how release items are managed and how blockers are identified and tracked.

3. Block the merge of this SEP until 1. and 2. are complete.

The difference in criteria may not be cut and dried, but the difference in the outcome is clear, useful, and an improvement.

From "what this is not":

It also does not attempt to address the entire development and triage process.

Again, the purpose of this SEP is only to simplify and clarify the labels. We have a very real cost associated with the current confusion around labels, and this SEP improves that. It doesn't fix everything, but it doesn't need to, and that shouldn't be a criteria for acceptance.

I don't see a compelling reason to not simplify the labels as proposed. In fact (1) makes a case for applying this SEP earlier, rather than later. If, in practice, the difference is vague and confusing then we will see that. Because we already have vague and confusing that isn't a step backwards - we're not losing anything, but we get real feedback that we can apply when we do (2). If the existing proposal works, we'll see that. If it doesn't work, then we'll see that, and will address it in (2).

@dkacar-oradian
Copy link

There are a few things about that contrived example I'd like to point out.

You could say "significant data loss" instead of "data loss", but then you have to describe significant. I would say that a syntax error in the configuration file which then prevents some service from starting can be a part of the description.

On its own, the newline at the end of file probably wouldn't cause that. But if something's doing cat cfg_* >config_file.conf, then the missing newline could very likely cause a syntax error and service problem.

So you might want to delay a release. However, if you delay a release (for any reason, including some far better than the contrived example), you're delaying bug fixes for serious issues which are already present in the installed base. I'll give you a non-contrived example of that.

Currently the vault module will return None if any kind of error happens while retreiving a secret from Vault. And then that None as a literal string gets used instead of the data from Vault. So all passwords, ssh keys and other kinds of secrets become literal None in files on the file system. If that happens the server will very likely become completely unusable. And fixing that could take hours. If it's my bug, like firewall misconfiguration, then not so much time. But if it's yours, like what was released in 3000.2, then it could easily be hours.

While my Vault functions perfectly nothing bad's going to happen. But there is a non-zero probability that it won't function perfectly, so the more time passes before the next release the greater the risk that things woud go horribly wrong. (And this is fixed in the current master branch, it's not a contrived example.)

So you'd have to find a way to chose a lesser evil: pain that could be caused by not delaying a release versus pain that could be caused by delaying a release.

But you can't. Noone can figure that one. Even if you had all the data (because a burning bush visited you and told you how things are), that data wouldn't be applicable to the release after that.

You shouldn't be trying to model a process based on "suppose that a day before the release ..." discussion. No matter what you decide to do or not to do bad things could happen. And you don't really have a chance of preventing that. You can't even minimize the damage. Unless you actually have a direct connection with a burning bush. But in that case your process would simply be "We're going to do whatever the burning bush says we should do".

Another view of this is: suppose that report came 24 hours later (so, after the release). Not a big difference, really. And you do have a process for that. So, just apply that process to reports which came 24 hours before the release. It is not your responsibility to report shit on time. It's mine.

But you need to do something else. Promote RC testing. Try to deal with the issues reported against RC. If people see that their issues aren't handled they will stop testing and reporting because it would just look like a waste of time. And you need to publish reports about that, so that people can see that you're handling the stuff and that things are improving. Which should make them more confident about investing their own time into testing, reporting and submitting PRs.

@waynew
Copy link
Contributor Author

waynew commented Jun 3, 2020

this SEP has received the requisite number of ✔️ and is now approved.

@waynew waynew merged commit cd903eb into saltstack:master Jun 3, 2020
@waynew waynew added Accepted and removed Final Comment Period Speak now or forever hold your peace. labels Jun 3, 2020
@gtmanfred

This comment has been minimized.

@sagetherage
Copy link
Contributor

Thank you all for you comments and feedback. We will continue to develop the definitions - and I am drafting a new SEP specific to the planning phase of a release cycle (major, minor, CVE) that will, in part, address, expanded definitions of the labels, here, other labels not mentioned here, milestones, and the question - when will my issue be fixed?

It will also include answers to: where this information live and be accessed, and will that be available to me in a timely manner? This will be a large SEP, but it is clearly needed, wanted, warranted. Please, keep the feedback coming - we want to hear from you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.