Upgrade and Rollback Procedures¶
How to upgrade CloudTaser components with zero downtime and roll back if something goes wrong.
Prerequisites¶
helmv3.x installedkubectlaccess to the target cluster- The CloudTaser Helm chart repository configured:
Upgrade Strategy¶
CloudTaser consists of three independently-versioned components deployed by a single Helm chart:
| Component | Chart Value | Current Default |
|---|---|---|
| Operator | operator.image.tag |
v0.5.14-amd64 |
| Wrapper | wrapper.image.tag |
v0.0.31-amd64 |
| eBPF Agent | ebpf.image.tag |
v0.1.21-amd64 |
| S3 Proxy | s3proxy.image.tag |
v0.2.7-amd64 |
The Helm chart version (currently 0.4.34) tracks independently from component versions. Upgrading the chart may update one or more component images.
Zero-Downtime Upgrade¶
Step 1: Check current versions¶
helm list -n cloudtaser-system
kubectl get deployment -n cloudtaser-system cloudtaser-operator \
-o jsonpath='{.spec.template.spec.containers[0].image}'
kubectl get daemonset -n cloudtaser-system cloudtaser-ebpf \
-o jsonpath='{.spec.template.spec.containers[0].image}'
Step 2: Review the new chart version¶
# Check available chart versions
helm search repo cloudtaser --versions
# Diff the values between current and new chart
helm diff upgrade cloudtaser cloudtaser/cloudtaser \
-n cloudtaser-system \
-f values.yaml
Install the helm-diff plugin
helm plugin install https://github.com/databus23/helm-diff -- shows the exact Kubernetes resource diffs before applying.
Step 3: Upgrade the Helm release¶
helm upgrade cloudtaser cloudtaser/cloudtaser \
-n cloudtaser-system \
-f values.yaml \
--version <target-chart-version>
Step 4: Verify the upgrade¶
# Check operator is running
kubectl rollout status deployment/cloudtaser-operator -n cloudtaser-system
# Check eBPF daemonset is running on all nodes
kubectl rollout status daemonset/cloudtaser-ebpf -n cloudtaser-system
# Verify webhook is responding
kubectl get mutatingwebhookconfiguration cloudtaser-operator-webhook -o yaml | head -20
Component Upgrade Order¶
When upgrading individual components (setting specific image tags), follow this order:
- eBPF agent first -- the agent is backward compatible with older wrapper versions. Upgrading it first ensures enforcement is active during the transition.
- Operator second -- the operator generates injection patches. A new operator version may inject new annotations or environment variables that the wrapper needs.
- Wrapper last -- wrapper upgrades require pod restarts (the wrapper binary is copied via the init container). The new wrapper image is used on the next pod creation.
Wrapper upgrades require pod restarts
The wrapper binary is copied into each pod's emptyDir volume by an init container at pod creation time. Existing running pods continue using the old wrapper binary until they are restarted. To roll out a new wrapper version:
# Restart all deployments with CloudTaser injection
kubectl get deployments --all-namespaces \
-o jsonpath='{range .items[?(@.spec.template.metadata.annotations.cloudtaser\.io/inject=="true")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
while read dep; do
kubectl rollout restart deployment "${dep##*/}" -n "${dep%%/*}"
done
CRD Upgrades¶
The Helm chart includes CRDs for CloudTaserConfig and SecretMapping. Helm does not upgrade CRDs automatically after initial install.
To upgrade CRDs:
# Apply CRDs from the new chart version
kubectl apply -f https://raw.githubusercontent.com/cloudtaser/cloudtaser-helm/main/charts/cloudtaser/crds/api.cloudtaser.io_cloudtaserconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/cloudtaser/cloudtaser-helm/main/charts/cloudtaser/crds/api.cloudtaser.io_secretmappings.yaml
Apply CRDs before upgrading the Helm release
If the new operator version references new CRD fields, apply the CRDs first. Otherwise the operator may fail to start because the API server rejects unknown fields.
Certificate Rotation During Upgrade¶
The operator manages its own webhook TLS certificates. During an upgrade:
- Certificates are stored in a Kubernetes Secret (
cloudtaser-operator-certs). All operator replicas share the same certificate. - The operator checks certificate expiry every 24 hours and rotates certificates 30 days before expiry.
- On startup, the operator patches the
MutatingWebhookConfigurationandValidatingWebhookConfigurationwith the current CA bundle. - Broker mTLS certificates (stored in
cloudtaser-broker-tls) follow the same rotation schedule.
No manual certificate action is needed during upgrades. The new operator pod reads the existing certificate Secret and patches the webhook configurations automatically.
Rollback¶
Helm rollback¶
# List revision history
helm history cloudtaser -n cloudtaser-system
# Roll back to the previous revision
helm rollback cloudtaser -n cloudtaser-system
# Roll back to a specific revision
helm rollback cloudtaser <revision-number> -n cloudtaser-system
Verify rollback¶
kubectl rollout status deployment/cloudtaser-operator -n cloudtaser-system
kubectl rollout status daemonset/cloudtaser-ebpf -n cloudtaser-system
Rollback considerations¶
| Scenario | Impact | Action |
|---|---|---|
| Operator rollback | New pods will be injected with the previous operator logic. Existing pods are unaffected. | Restart affected deployments if needed |
| eBPF rollback | Previous enforcement behavior is restored on all nodes | No pod restart needed |
| Wrapper rollback | Running pods keep the current wrapper. New pods get the old wrapper. | Restart deployments to pick up old wrapper binary |
| CRD rollback | CRDs are not managed by Helm rollback | Manually reapply old CRD versions if fields were removed |
CRDs are not rolled back by Helm
helm rollback does not revert CRD changes. If a CRD was updated with new fields, rolling back the chart leaves the new CRD in place. This is generally safe because CRDs are additive, but verify compatibility.
Database Migration Considerations¶
SaaS Control Plane¶
The CloudTaser SaaS control plane (cloudtaser-saas) uses an in-memory tenant store. There are no database migrations required when upgrading the SaaS component.
Database Proxy¶
The database proxy (cloudtaser-db-proxy) does not have its own database. It proxies PostgreSQL connections and performs transparent encryption/decryption. Upgrading the proxy is safe because:
- The encrypted value format is versioned (currently version 1). New proxy versions can always read values encrypted by older versions.
- The encryption key lives in Vault Transit, not in the proxy. Key continuity is guaranteed by the vault.
- The proxy is stateless. Restart it at any time.
However, if a new proxy version changes the encryption format:
- The new format version byte ensures old values remain readable
- New writes use the new format
- Rollback to an older proxy that does not understand the new format will fail to decrypt values written by the new version
Test proxy upgrades in staging first
Deploy the new proxy version in a staging environment and verify both reads (of existing encrypted data) and writes work correctly before upgrading production.
Canary Upgrade¶
For large clusters, consider a canary upgrade using namespace selectors:
Step 1: Deploy the new operator version to a canary namespace¶
Override the webhook namespace selector to target only the canary namespace:
Step 2: Restart a test workload in the canary namespace¶
Step 3: Verify protection score¶
Step 4: Roll out to all namespaces¶
Once verified, upgrade the Helm release for the full cluster.
Emergency: Disable Injection¶
If an upgrade causes pod creation failures, disable the webhook temporarily:
# Option 1: Set failurePolicy to Ignore (pods start without injection)
kubectl patch mutatingwebhookconfiguration cloudtaser-operator-webhook \
--type='json' \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
# Option 2: Delete the webhook configuration entirely (emergency only)
kubectl delete mutatingwebhookconfiguration cloudtaser-operator-webhook
Disabling the webhook removes protection
Setting failurePolicy: Ignore means new pods start without CloudTaser injection. Existing running pods continue to operate normally. Restore the webhook after resolving the issue.
After resolving the issue, restore the webhook: