-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
base: master
Are you sure you want to change the base?
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
Conversation
@wking: This pull request references OTA-541 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b0c8d2e
to
69eca53
Compare
## Proposal | ||
|
||
The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`. | ||
This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge.
/cc |
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
69eca53
to
11f8243
Compare
|
||
### Goals | ||
|
||
ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean, if no operator is unavailable, then the upgrade should always complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on status.versions
not being as far along as the manifest claimed, in addition to blocking if Available
isn't True
. Personally, status.versions
seems like the main thing that's relevant, e.g. a component coming after the Kube API server knows it can use 4.18 APIs if the Kube API server has declared 4.18 versions
. As an example of what the 4.18 Kube API server asks the CVO to wait on:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.18.0-rc.0-x86_64
Extracted release payload from digest sha256:054e75395dd0879e8c29cd059cf6b782742123177a303910bf78f28880431d1c created at 2024-12-02T21:11:00Z
$ yaml2json <manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml | jq -c '.status.versions[]'
{"name":"operator","version":"4.18.0-rc.0"}
{"name":"raw-internal","version":"4.18.0-rc.0"}
{"name":"kube-apiserver","version":"1.31.3"}
A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its Degraded
or Available
conditions to slow the CVO down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so could I say, if failing=true for an upgrade, the reason should not be ClusterOperatorDegraded
only.
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
11f8243
to
e10df2a
Compare
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
enhancements/update/do-not-block-on-degraded-true-clusteroperators.md
Outdated
Show resolved
Hide resolved
|
||
## Test Plan | ||
|
||
**Note:** *Section not required until targeted at a release.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the dev-guide/cluster-version-operator/user/reconciliation.md
file suggest that the enhancement is targeted at the 4.19 release, and thus the Test Plan
section should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strongly opinionated on what the test plan looks like. We don't do a lot of intentional-sad-path update testing today in CI, and I'm fuzzy on what QE does in that space that could be expanded into this new space (or maybe they already test pushing a ClusterOperator component to Degraded=True
mid update to see how the cluster handles that?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test pushing a ClusterOperator component to Degraded=True mid update to see how the cluster handles that?
+1, that's also what I want to explore during test. I also had some other immature checkpoints in my mind when I read this enhancement doc at the first time, but I still need some inputs from @wking to help me tidy up them. For example #1719 (comment).
I asked this because there's already some cv.conditions check in CI, I'm thinking about if we could update the logic to help catching issues once the feature implemented.
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
…ew enhancement proposal The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on Degraded=True ClusterOperator. This enhancement proposes ignoring Degraded when deciding whether to block on a ClusterOperator manifest. The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes. However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful. Our expirience with Degraded=True blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block on Available=False ClusterOperator, or when the ClusterOperator versions have not yet reached the values requested by the ClusterOperator's release manifest.
e10df2a
to
9498fb9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: petr-muller The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
The goal of blocking on manifests with sad resources is to avoid further destabilization. | ||
For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. | ||
Or if we are unable to update the Kube-API-server operator, we don't want to inject [unsupported kubelet skew][kubelet-skew] by asking the machine-config operator to update nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we observed another kind of upgrade blocker here. Applying the infrastructures.config.openshift.io
manifest failed as the CRD had introduced some validations and that needed the apiserver to be upgraded to support it. Unfortunately, the upgrade didn't progress and we had to manually step in to update the kube-apiserver to let the upgrade proceed. Is there a way to enhance these cases to at least let the apiserver upgrade before blocking?
…pproval list David Eads suggested these acks to avoid surprising anyone. List generated with: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.19.0-ec.0-x86_64 $ grep -rl 'kind: ClusterOperator' manifests | while read MANIFEST; do yaml2json < "${MANIFEST}" | jq -r '.[] | select(.kind == "ClusterOperator").metadata.name'; done | sort | uniq
c6616a3
to
111c8fe
Compare
@wking: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on
Degraded=True
ClusterOperator. This enhancement proposes ignoringDegraded
when deciding whether to block on a ClusterOperator manifest.The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes.
However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.
Our expirience with
Degraded=True
blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block onAvailable=False
ClusterOperator, or when the ClusterOperatorversions
have not yet reached the values requested by the ClusterOperator's release manifest.