Allow marking releases stuck in a pending state as failed #116

SimonBaeumer · 2021-11-16T15:37:09Z

PR to discuss how to recover from pending state.
This is our current implementation used in https://github.com/stackrox/helm-operator

Fixes #94

Currently if the Helm releases exits unexpectedly (i.e. due to node crash or a bug) the pending state of a Helm release is never released and leads to an infinite reconciliation loop.

To automatically resolve pending states the reconciler now takes an option via WithMarkFailedAfter which configures a timeout after the pending state is handled as a failure.

ToDo

Add Tests
Only mark failed on Operator owned Helm secrets

coveralls · 2021-11-16T15:48:44Z

Pull Request Test Coverage Report for Build 1593318507

29 of 44 (65.91%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.5%) to 88.06%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/client/actionclient.go	0	7	0.0%
pkg/reconciler/reconciler.go	29	37	78.38%

Totals
Change from base Build 1579656302:	-0.5%
Covered Lines:	1652
Relevant Lines:	1876

💛 - Coveralls

SimonBaeumer · 2021-12-17T17:26:25Z

@joelanford @varshaprasad96 @fabianvf On the only recover from operator owned secrets-topic.
I think it does not make sense to limit the recovering of pending-states to operator owned Helm secrets because desired states the operator tries to match.
In example looking at this workflow:

Install CR with operator
Helm release is installed by operator with CR values
Admin installs a release manually, values are updated and do not match desired state from the CR anymore
Reconciliation kicks in, compares the manifest diff and reconciles again
Admin changes reverted by operator

Imho if a user wants to do manually upgrade their release the operator should be disabled.

WDYTH on this?

joelanford · 2021-12-17T17:54:13Z

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
Create CR for the same release with operator
Should operator adopt release and attempt to reconcile?

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

Just block creation of the CR if a release secret already exists
Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

joelanford · 2021-12-17T18:34:01Z

pkg/client/actionclient.go

+func (c *actionClient) MarkFailed(rel *release.Release, reason string) error {
+	infoCopy := *rel.Info
+	releaseCopy := *rel
+	releaseCopy.Info = &infoCopy
+	releaseCopy.SetStatus(release.StatusFailed, reason)
+	return c.conf.Releases.Update(&releaseCopy)
+}


While I see how it's convenient to put this functionality into the actionClient, I'm not convinced it makes a ton of sense otherwise. There's really only one way to mark a release as failed (this is it), so I'm wondering if we pull this back out of the action client interface and just put this logic directly into the reconciler.

The only missing piece I see for that is giving the Reconciler an ActionConfigGetter field, which would likely just involve adding the field, adding a WithActionConfigGetter functional option, and then tweaking the addDefaults function slightly to handle the fact that the reconciler may already have an ActionConfigGetter setup via the new functional option.

Agree, extracted it and now head to adding a fake implementation for the ActionConfigGetter. After that it should be finished. The implementation is still a bit rough though.

Sorry for the delay @joelanford.
I've tried to fake the ActionConfig but it was more complex than expected, i.e. also interacting with the storage.Storage interface to call the Update func for the given release.
I stopped from there and wondered if it fits the abstraction if the MarkFailed func is renamed to Update to be more generic, so the ActionClient wraps all interactions with Helm in a single struct.
Also the already existing fake client can be leveraged and easily extended.

If theUpdate func does not match the expectations I would implement a fake ActionConfig with a memory release driver in a separate PR.

SimonBaeumer · 2021-12-20T13:31:15Z

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending

Create CR for the same release with operator

Should operator adopt release and attempt to reconcile?

I agree. An operator ideally should only reconcile resources which belong to them or which are explicitly labeled to allow it (opt-in to reconcile).

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

Just block creation of the CR if a release secret already exists

Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

I think adopting the release is reasonable if opted-in. This seems to be outside of the scope of this PR, created an issue for it: #144

joelanford · 2021-12-22T18:00:41Z

pkg/reconciler/reconciler.go

 	if r.actionClientGetter == nil {
-		actionConfigGetter := helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log)
-		r.actionClientGetter = helmclient.NewActionClientGetter(actionConfigGetter)
+		r.actionConfigGetter = helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log)


This should be moved out of the r.actionClientGetter == nil if block and into its own, right?

if r.actionConfigGetter == nil { r.actionConfigGetter = helmclient.NewActionConfigGetter(mgr.GetConfig(), mgr.GetRESTMapper(), r.log) } if r.actionClientGetter == nil { r.actionClientGetter = helmclient.NewActionClientGetter(r.actionConfigGetter) }

Agree and done

asmacdo · 2022-02-08T16:22:54Z

@joelanford and @SimonBaeumer to pair on this

SimonBaeumer · 2022-02-19T18:19:53Z

pkg/reconciler/reconciler.go

 		return ctrl.Result{}, err
 	}
+	if state == statePending {
+		return r.handlePending(actionClient, rel, &u, log)


@joelanford To be honest, I don't get to a good solution here and would like that you take a decision here.

1. handlePending in actionClient
My suggestion would be moving the handlePending to the actionClient.
This makes sense as it is a Helm state which is handled such the actionClient abstracts Helm interactions.
To include handlePending in handleReconcile the statePending must be checked at a different execution time (within the switch state in line 559).

2. Moving handlePending to the switch state
As far as I see the only disadvantage is that pre-hooks are executed before the pending state is resolved. This is not necessarily bad and depends more on our definition of pre-hooks/extensions functions.

openshift-ci · 2022-03-07T06:41:49Z

@SimonBaeumer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

perdasilva · 2024-09-17T11:45:52Z

Closing as stale. Please re-open if necessary.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2021

github-actions bot added the area/sdk label Nov 16, 2021

SimonBaeumer force-pushed the recover-from-pending-state branch from 9166fc6 to c26cd1e Compare December 6, 2021 10:15

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2021

github-actions bot added the area/testing label Dec 17, 2021

SimonBaeumer force-pushed the recover-from-pending-state branch from a48929e to 8805e70 Compare December 17, 2021 17:09

Allow marking releases stuck in a pending state as failed

2961ecb

SimonBaeumer force-pushed the recover-from-pending-state branch from 8805e70 to 2961ecb Compare December 17, 2021 17:11

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2021

SimonBaeumer added 3 commits December 17, 2021 18:12

Fix style

3deb69b

fix style

aa12c68

fix style

a98ee77

SimonBaeumer marked this pull request as ready for review December 17, 2021 17:28

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2021

joelanford reviewed Dec 17, 2021

View reviewed changes

SimonBaeumer mentioned this pull request Dec 20, 2021

How should the Helm operator handle existing Helm installations? #144

Open

WIP

44e9cd1

joelanford reviewed Dec 22, 2021

View reviewed changes

Rename MarkFailed to Update

3ef3256

SimonBaeumer commented Feb 19, 2022

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2022

kovayur mentioned this pull request Jul 31, 2023

getReleaseState may sometimes cause an unwanted rollback #227

Open

perdasilva closed this Sep 17, 2024

Allow marking releases stuck in a pending state as failed #116

Allow marking releases stuck in a pending state as failed #116

Uh oh!

Conversation

SimonBaeumer commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 1593318507

💛 - Coveralls

Uh oh!

SimonBaeumer commented Dec 17, 2021

Uh oh!

joelanford commented Dec 17, 2021

Uh oh!

joelanford Dec 17, 2021

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Dec 20, 2021

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer commented Dec 20, 2021

Uh oh!

joelanford Dec 22, 2021

Choose a reason for hiding this comment

Uh oh!

SimonBaeumer Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

asmacdo commented Feb 8, 2022

Uh oh!

SimonBaeumer Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 7, 2022

Uh oh!

perdasilva commented Sep 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SimonBaeumer commented Nov 16, 2021 •

edited

Loading

coveralls commented Nov 16, 2021 •

edited

Loading