Upgrade and Rollback Procedures¶
How to upgrade cloudtaser components with zero downtime and roll back if something goes wrong.
Upgrading to chart 1.0.20+ (from 0.4.x or 1.0.0-1.0.19)¶
Chart 1.0.20 completes the sealed-TLS-from-t0 cutover. Three contracts changed in load-bearing ways; each has a distinct pre-flight check. Read this section in full BEFORE running helm upgrade. The matching in-cluster NOTES.txt block is reproduced in charts/cloudtaser/templates/NOTES.txt in the cloudtaser-helm repo.
A. Audit stored values before helm upgrade --reuse-values (cloudtaser-helm#230)¶
helm upgrade --reuse-values merges the NEW chart defaults on top of your OLD stored values. Chart 1.0.0 introduced operator.broker.*, which did not exist in 0.4.x - so the defaults land cleanly on a fresh release, but on an upgraded release the NEW default operator.broker.tls.enabled=true starts serving TLS and the operator v0.6.30 refuse-to-start guardrail (section D below) will abort helm upgrade if your stored values carried a stale operator.broker.tls.enabled=false from an intermediate pin WITHOUT the matching operator.broker.allowHTTP=true.
The failure surfaces DURING helm upgrade (clear error, no silent drift) - but only after the release has been partially mutated. You want to catch it BEFORE running the command.
Run this first:
helm get values cloudtaser -n cloudtaser-system > /tmp/stored-values.yaml
grep -E '^\s*(broker|tls|allowHTTP|enabled):' /tmp/stored-values.yaml || \
echo 'no broker overrides stored - safe to upgrade'
If you see operator.broker.tls.enabled: false in the output, add operator.broker.allowHTTP: true to the same values file BEFORE upgrading (or, recommended, remove the override entirely and accept the TLS default). If you see only operator.broker.tls.enabled: true or nothing, you are safe to proceed.
Either way, bump cloudtaser-cli to v0.16.2+ BEFORE the upgrade. The new CLI auto-fetches the broker CA from the cloudtaser-operator-broker-tls Secret and auto-upgrades to https://. Older CLIs dial http:// and fail with tls: first record does not look like a TLS handshake.
B. Two-phase upgrade order - operator-first wedges running pods (cloudtaser-helm#231)¶
Operator v0.6.30 dials https:// for pod unseal unconditionally. It does NOT fall back to plain HTTP - silent fallback would violate the sealed-TLS-from-t0 invariant. Wrapper v0.1.14 serves :8199 as plain HTTP; wrapper v0.1.15 serves :8199 over TLS.
DO NOT bump operator image first
DO NOT bump operator.image.tag to v0.6.30 (or upgrade the chart with operator-image defaults) while ANY running pod still carries a wrapper v0.1.14 binary. The new operator's HTTPS unseal dial fails against the old wrapper's plain HTTP listener, and injected readiness probes use the HTTPS scheme against old-wrapper pods. The failure mode is connection refused / bad record version and looks like a NetworkPolicy problem - it is not. Mixed-version protected pods will crashloop until wrapper is upgraded AND the pod is recreated.
Follow these four steps in order. No skipping.
Step 1. Upgrade the chart with wrapper image bumped, operator pinned:
helm upgrade cloudtaser cloudtaser/cloudtaser --version 1.0.20 \
--set operator.image.tag=v0.6.29 \
--set wrapper.image.tag=v0.1.15 \
--reuse-values
Step 2. Recreate every already-injected pod so the new wrapper binary AND the new webhook-injected HTTPS probes land. Use kubectl rollout restart (not kubectl delete pod) so Deployments/StatefulSets/DaemonSets respect surge settings:
kubectl get deployments,statefulsets,daemonsets --all-namespaces \
-o jsonpath='{range .items[?(@.spec.template.metadata.annotations.cloudtaser\.io/inject=="true")]}{.kind}/{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
while IFS=/ read kind ns name; do
kubectl rollout restart "${kind,,}/${name}" -n "${ns}"
done
Step 3. Verify every injected pod is Ready (nothing stuck in Init or CrashLoopBackOff) BEFORE moving to step 4:
kubectl get pods -A -l cloudtaser.io/protected=true | \
grep -v '1/1\|2/2\|3/3\|Running\|Completed' || \
echo 'all injected pods Ready - safe to bump operator'
Step 4. Bump the operator image (now safe - every wrapper is v0.1.15):
Do not merge steps 1 and 4
Each step must complete and be verified independently. DO NOT merge steps 1 and 4 into a single helm upgrade unless you have zero already-running protected pods (fresh install, or all protected Deployments scaled to 0). Fresh installs can skip directly to the 1.0.20 defaults - operator v0.6.30 + wrapper v0.1.15 are a matched pair.
C. Prometheus scrape migration - hostPort removed (cloudtaser-helm#232)¶
The ebpf daemonset no longer publishes a hostPort: 9099. Scrapers configured against node:9099/metrics receive zero samples with no error (the node port is simply closed). The agent now splits its HTTP server into two listeners inside the daemonset pod's network namespace:
| Listener | Endpoint | Reachable |
|---|---|---|
pod-IP:9099 |
/healthz, /readyz |
Public via cloudtaser-ebpf-agent-metrics Service |
127.0.0.1:9098 |
/metrics, /status |
Loopback-only (intentional) |
/metrics AND /status are INTENTIONALLY unreachable from off-pod. The cloudtaser-ebpf-agent-metrics Service (which is NOT annotated with prometheus.io/scrape by the chart) fronts pod-IP:9099 only - it cannot reach /metrics on 127.0.0.1:9098 no matter how you configure the ServiceMonitor. See the upstream audit at cloudtaser-ebpf#101 for the security rationale.
You have three honest options. There is no middle path.
Option A - Scrape /healthz + /readyz as an up-indicator only. The only off-pod observable surface on the Service is the readiness of each daemonset pod. This tells you the agent is alive but gives you NO runtime metrics (block counters, kprobe stats, event rates). Acceptable for platform teams who only need an "is the enforcement layer running everywhere" dashboard; NOT sufficient for detection engineering on block-event counts.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cloudtaser-ebpf-agent-up
namespace: cloudtaser-system
labels:
release: prometheus # whatever your operator selects on
spec:
selector:
matchLabels:
app.kubernetes.io/name: cloudtaser-ebpf
endpoints:
- port: health
path: /healthz
interval: 30s
The up{} series plus an HTTP 200 on /healthz is all you will get over this Service.
Option B - Deploy an in-pod sidecar that scrapes loopback. Add a prometheus-agent (or vmagent, otel-collector - any scraper) sidecar to the ebpf daemonset with hostNetwork: false and its own network namespace shared with the agent container, then point it at 127.0.0.1:9098/metrics and remote-write to your long-term store. This is the ONLY way to get the full /metrics firehose off-cluster while preserving the loopback-only security property. The chart does not ship a built-in sidecar values block today; track cloudtaser-helm#232 for the chart-native variant.
Option C - Accept pod-local metrics. If you do not need /metrics aggregated centrally, leave it loopback. An attestation agent running inside the same pod (Falco, custom) can consume 127.0.0.1:9098/metrics and 127.0.0.1:9098/status directly. No chart changes required.
Do not target /metrics on the Service
Do NOT write a ServiceMonitor that targets cloudtaser-ebpf-agent-metrics with path: /metrics or path: /status. The Service only exposes port 9099, which serves /healthz + /readyz only. Prometheus will scrape the endpoint, receive HTTP 404 for /metrics or /status, and drop the series silently - same failure mode as the old node:9099/metrics scrape, just relocated to the Service. Rewrite your scrape config to one of the three options above.
If you already had serviceMonitor.enabled=true in your values, it continues to cover the OPERATOR metrics - you need an additional ServiceMonitor (or equivalent scrape config) for the ebpf agent.
D. operator.broker.tls.enabled=false now requires operator.broker.allowHTTP=true¶
Operator v0.6.30 adds a refuse-to-start guardrail: if you disable broker TLS you MUST explicitly opt in to plain HTTP. The chart enforces this at helm template time with a clear error; you cannot apply a bricked Deployment. Update your values.yaml override:
If you have no such historical override, this section does not apply - the default (tls.enabled=true, allowHTTP=false) needs no change.
Section E: May 2026 -- Beacon and proxy TLS hardening¶
The May 2026 wave hardens four components: beacon relay, on-prem bridge, S3 proxy, and DB proxy. Most changes are transparent -- secure defaults tighten without requiring operator action. Two changes require explicit attention if you use development-mode TLS bypass flags.
E1: Beacon hardening (beacon v0.3.x+)¶
Three changes landed in beacon PRs #108, #109, and #110.
IPv6 rate-limiter fix (PR #108)¶
The beacon's per-IP rate limiter now aggregates IPv6 addresses to their /64 prefix. Previously, an attacker rotating through addresses within a single /64 block got a fresh rate-limit bucket per address, effectively bypassing the limiter. IPv4 per-IP behaviour is unchanged.
The rate-limit map is also capped at 10,000 entries with oldest-entry eviction, preventing unbounded memory growth under sustained attack from diverse source IPs.
Operator action: none. The fix is automatic.
Admin drain endpoint now loopback-only (PR #109)¶
The /admin/drain endpoint is now served on a dedicated loopback-only listener when the health server binds to a non-loopback address. Previously, /admin/drain was protected only by a RemoteAddr check, which can be spoofed by a reverse proxy. Now the network layer enforces access: the dedicated listener binds to 127.0.0.1 on the same port as the health server, so only processes running on the beacon host can reach it.
When the health server is already bound to loopback (the default 127.0.0.1:8080), /admin/drain remains on the health mux as before -- the loopback bind already provides the same guarantee.
Operator action: if you call /admin/drain from a remote host (e.g. a load-balancer health check or an external orchestrator), that will no longer work. Move your drain invocation to run on the beacon host itself (e.g. via SSH, a sidecar, or a pre-stop lifecycle hook). Calling drain from outside the beacon host was never recommended -- it exposed an unauthenticated state-mutation endpoint to the network.
TCP keepalive before TLS handshake (PR #110)¶
TCP keepalive probes are now applied immediately after Accept, before the TLS handshake begins. Previously, keepalive was applied only after a successful handshake, which meant half-open connections that stalled during the TLS handshake could accumulate without being cleaned up by keepalive probes.
Operator action: none. The fix is automatic and prevents half-open connection accumulation under adverse network conditions.
E2: On-prem bridge vault TLS dual-flag requirement (onprem v0.6.x+)¶
PR #223 adds a paired-flag guard to the bridge's vault TLS skip-verify behaviour. Previously, setting --vault-skip-verify=true (or VAULT_SKIP_VERIFY=true in the environment) alone was sufficient to disable vault TLS certificate verification. A single stale environment variable copied from a demo .env file could silently disable TLS verification in production.
Now, both --vault-skip-verify=true and --vault-allow-insecure=true must be set. If only --vault-skip-verify is set, the bridge refuses to start with a clear error:
The same guard applies to the environment variable form:
# Old (no longer works alone):
VAULT_SKIP_VERIFY=true
# New (both required):
VAULT_SKIP_VERIFY=true
CLOUDTASER_BRIDGE_VAULT_ALLOW_INSECURE=true
Additionally, the backup/restore scripts and the enroll server now default to https://127.0.0.1:8200 instead of http://127.0.0.1:30200. The admin/root token was previously transmitted in cleartext over loopback.
Operator action for production deployments with proper TLS certificates: none. The flags default to false and proper TLS verification is the default path.
Operator action for development setups using --vault-skip-verify: add --vault-allow-insecure=true alongside --vault-skip-verify=true. This is intentional friction -- two flags must agree before TLS verification is disabled, preventing a single stale env var from degrading security.
E3: S3 proxy security defaults (s3-proxy v0.3.x+)¶
PR #62 hardens three areas:
Dual-flag vault TLS guard¶
Same pattern as the bridge (E2 above). CLOUDTASER_S3PROXY_VAULT_SKIP_VERIFY=true now requires CLOUDTASER_S3PROXY_VAULT_ALLOW_INSECURE=true. The proxy refuses to start if only the skip-verify flag is set.
Health and metrics servers bound to loopback¶
The health server (/healthz, /readyz) default address changed from :8191 to 127.0.0.1:8191. The Prometheus metrics server default changed from :9090 to 127.0.0.1:9090. Both are now loopback-only by default, preventing exposure of health and metrics endpoints to the pod network.
If you need to expose these endpoints to Kubernetes probes or a ServiceMonitor, override the addresses explicitly:
env:
- name: CLOUDTASER_S3PROXY_HEALTH_ADDR
value: "0.0.0.0:8191" # needed if kubelet probes target the pod IP
- name: CLOUDTASER_S3PROXY_METRICS_ADDR
value: "0.0.0.0:9090" # needed if Prometheus scrapes the pod IP
Kubelet probes may need updating
If your liveness/readiness probes target the pod IP (the default for most Helm-deployed pods), they cannot reach a loopback-bound health server. Either override CLOUDTASER_S3PROXY_HEALTH_ADDR to bind to all interfaces, or use an exec probe that curls 127.0.0.1:8191/healthz from inside the container.
TLS 1.3 minimum for vault connections¶
All TLS connections from the S3 proxy to vault now enforce TLS 1.3 as the minimum version. Previously, the Go default (TLS 1.0) was used. This closes the window for protocol-downgrade attacks on the proxy-to-vault path.
Operator action: ensure your vault server supports TLS 1.3. All OpenBao versions and HashiCorp Vault 1.12+ support TLS 1.3 by default. If your vault is behind a TLS-terminating proxy, verify that the proxy negotiates TLS 1.3.
E4: DB proxy security defaults (db-proxy v0.2.x+)¶
PR #52 applies the same three hardening patterns as the S3 proxy, plus a fail-closed encryption guard:
Fail-closed on vault transit outage¶
The DB proxy now rejects queries when encryption fails (vault unreachable, transit key missing, auth expired), instead of silently forwarding plaintext to the upstream database. This is the fail-closed default (CLOUDTASER_DBPROXY_ENCRYPT_FAIL_MODE=fail-closed).
For mid-migration scenarios where some queries must continue during vault unavailability, set CLOUDTASER_DBPROXY_ENCRYPT_FAIL_MODE=fail-open explicitly. This preserves the old behaviour but must be an intentional opt-in.
fail-open leaks plaintext
fail-open mode means a vault outage silently passes plaintext to the cloud-provider-hosted database. Use it only during migration windows and revert to fail-closed (the default) as soon as migration is complete.
Dual-flag vault TLS guard¶
Same pattern as E2 and E3. CLOUDTASER_DBPROXY_VAULT_SKIP_VERIFY=true requires CLOUDTASER_DBPROXY_VAULT_ALLOW_INSECURE=true.
Health server bound to loopback¶
Default health address changed from :9091 to 127.0.0.1:9091. Same considerations as E3 -- override if kubelet probes target the pod IP.
TLS 1.3 minimum for vault and database connections¶
All TLS connections (vault client, PostgreSQL upstream, MySQL upstream) now enforce TLS 1.3 minimum.
E-series summary¶
| Change | Component | Breaking? | Action required |
|---|---|---|---|
| IPv6 /64 rate-limiter aggregation | Beacon | No | None |
| Admin drain loopback-only | Beacon | Yes, if called remotely | Move drain calls to beacon host |
| TCP keepalive before TLS | Beacon | No | None |
| Vault TLS dual-flag guard | Bridge (onprem) | Yes, if using --vault-skip-verify alone |
Add --vault-allow-insecure=true |
| Vault TLS dual-flag guard | S3 proxy | Yes, if using skip-verify alone | Add CLOUDTASER_S3PROXY_VAULT_ALLOW_INSECURE=true |
| Vault TLS dual-flag guard | DB proxy | Yes, if using skip-verify alone | Add CLOUDTASER_DBPROXY_VAULT_ALLOW_INSECURE=true |
| Health/metrics loopback bind | S3 proxy | Yes, if probing via pod IP | Override health/metrics addr to 0.0.0.0 |
| Health loopback bind | DB proxy | Yes, if probing via pod IP | Override health addr to 0.0.0.0 |
| TLS 1.3 minimum | S3 proxy, DB proxy | Possibly, if vault < TLS 1.3 | Verify vault supports TLS 1.3 |
| Fail-closed encryption | DB proxy | Yes, if relying on silent plaintext fallback | Set fail-open explicitly during migration, or fix vault connectivity |
| Backup/restore HTTPS default | Bridge (onprem) | No (loopback only) | None |
Section F: May 2026 -- Wrapper health server hardening and crash-loop prevention¶
The wrapper health server received three hardening changes and a crash-loop prevention fix. All changes are backward-compatible with no operator action required for standard deployments.
F1: Health server timeouts and body limits (wrapper v0.2.12+, PR #213)¶
The health server now enforces server-side timeouts to prevent slowloris and connection-holding attacks:
| Setting | Previous | New |
|---|---|---|
ReadTimeout |
Not set (unlimited) | 30s |
WriteTimeout |
Not set (unlimited) | 30s |
IdleTimeout |
Not set (unlimited) | 120s |
MaxHeaderBytes |
Not set (Go default 1MB) | 64 KB |
| Unseal body limit | 256 KB | 64 KB |
| Body read deadline | Not set | 10s per request |
Operator action: none. The limits are generous for legitimate traffic. The 64 KB unseal body limit accommodates the ECDH-encrypted payload (cert chain + key + token + nonce + ciphertext) with room to spare.
F2: Per-source-IP rate limiting on /v1/unseal (wrapper v0.2.12+, PR #213)¶
The unseal rate limiter is now per-source-IP (100 req/min per IP, burst of 10) with a global cap (1000 req/min, burst of 50), replacing the previous global-only 10 req/min limiter. This prevents a co-located attacker from exhausting the rate limit budget and locking out the legitimate operator.
Rate limiting is applied after JWT authentication, so unauthenticated requests are rejected before consuming rate limit tokens.
Operator action: none. The new limits are strictly more permissive for legitimate operator traffic (the operator's single IP gets 100 req/min instead of sharing a global budget of 10 req/min).
F3: Secret-not-provisioned stay-alive behaviour (wrapper v0.2.12+, PR #209)¶
When the upstream secret has not been provisioned in vault yet, the wrapper now stays alive indefinitely with bounded-backoff retries instead of calling os.Exit(1). This prevents the CrashLoopBackOff cascade where kubelet restart backoff escalates to 5 minutes, making the pod unrecoverable even after the secret is later provisioned.
Health endpoint behaviour during this state:
/healthzreturns 200 (wrapper PID 1 is alive)/readyzreturns 503 with{"ready":false,"reason":"secret_not_provisioned"}
Only configuration errors (auth failure, invalid vault path) still trigger os.Exit.
Operator action: none. This is strictly better behaviour -- pods that previously crash-looped on missing secrets now self-heal when the secret appears.
F4: RLIMIT_CORE=0 and envmap format injection guard (wrapper v0.2.12+, PR #208)¶
Two additional security hardening fixes:
-
RLIMIT_CORE=0 -- The wrapper sets both soft and hard core dump rlimits to zero as defense-in-depth alongside
PR_SET_DUMPABLE(0). This blocks core dump generation viasystemd-coredump,core_patternpipes, or container runtimes that override the dumpable bit. -
Envmap format injection guard -- The
CLOUDTASER_ENV_MAPparser now uses explicit type handling instead of Go's%vformat verb. String values containing newline, carriage return, or NUL bytes are rejected to prevent environment variable injection through crafted vault values.
Operator action: none. Both are transparent hardening improvements.
F-series summary¶
| Change | Component | Breaking? | Action required |
|---|---|---|---|
| Server-side timeouts (Read/Write/Idle) | Wrapper | No | None |
| Unseal body limit reduced (256KB to 64KB) | Wrapper | No | None |
| Per-IP rate limiting on unseal | Wrapper | No (more permissive for legitimate traffic) | None |
| Secret-not-provisioned stay-alive | Wrapper | No (strictly better) | None |
| RLIMIT_CORE=0 defense-in-depth | Wrapper | No | None |
| Envmap format injection guard | Wrapper | Possibly, if vault values contain newline/CR/NUL | Remove injection characters from vault values |
Section G: May 2026 -- Beacon unauthenticated surface hardening (E#16)¶
Four changes harden the beacon's unauthenticated attack surface against scanner noise, gossip-based DoS, and connection exhaustion. All landed in beacon PR #114 (2026-05-15) and are included in beacon v0.4.x+.
G1: Scanner log demotion (beacon #41)¶
Ten noisy log lines triggered by port scanners and health probes (TLS handshake failures, unexpected protocol bytes, premature connection close) have been demoted from WARN to DEBUG in internal/beacon/server.go.
Operator action: if your log aggregation pipeline filters on WARN-level beacon entries to detect attacks, those scanner-triggered lines will no longer appear at WARN. Adjust your filters to include DEBUG if you relied on them, or -- preferably -- alert on the rate-limit and concurrency-cap events described below, which are a more reliable signal of sustained abuse.
G2: Gossip size cap (beacon #67)¶
ACME gossip entries (used by the beacon's distributed Let's Encrypt solver) are now capped at 8,192 bytes. Both the Store() write path and the handleUserMsg() receive path reject values exceeding the cap and return errGossipValueTooLarge.
This prevents a gossip-based DoS where a malicious or misconfigured peer floods the cluster with oversized gossip values that consume memory across all beacon replicas.
Operator action: none under normal operation. ACME certificates and challenge tokens are well within 8 KB. If you see errGossipValueTooLarge in beacon logs, a peer is sending malformed or malicious gossip -- investigate the source.
G3: Forward rate limit (beacon #74)¶
Each client IP is limited to 100 forward connections per minute. IPv6 addresses are bucketed by /64 prefix (consistent with the existing rate-limiter aggregation from Section E1). Connections exceeding the limit receive a rateLimit error and are closed.
This caps the blast radius of a single-source connection flood. The 100/min default is generous for legitimate bridge and broker traffic (a single cluster typically maintains 1-3 persistent connections).
Operator action: none for standard deployments. If you operate a beacon serving a very large number of clusters behind a shared NAT (where many distinct clusters share the same source IP), the 100/min limit may be reached. This is an unlikely topology for production -- each cluster's broker maintains a small number of long-lived connections, not bursts of short-lived ones. If you encounter rateLimit errors in the beacon log for legitimate traffic, contact the cloudtaser team to discuss raising the default.
G4: Concurrency cap (beacon #79)¶
The beacon enforces a maximum of 1,000 simultaneous inner forwarded connections via a buffered channel semaphore in NewForwardServer. When the cap is reached, new connections block (they are not dropped) until a slot frees up. This is a backpressure mechanism, not a hard reject -- legitimate connections that arrive during a spike will proceed once capacity is available.
The cap prevents unbounded goroutine and file-descriptor growth under sustained connection floods, protecting the beacon process from OOM kills and file-descriptor exhaustion.
Operator action: none for standard deployments. A single bridge-broker pair uses a small number of forwarded connections. If you host a single beacon for hundreds of clusters and observe sustained blocking (visible as elevated connection latency in beacon metrics), increase the beacon replica count or deploy per-region beacon instances.
G-series summary¶
| Change | Issue | Breaking? | Action required |
|---|---|---|---|
| Scanner log demotion (WARN to DEBUG) | #41 | No (log level only) | Adjust WARN-based log filters if needed |
| Gossip size cap (8,192 bytes) | #67 | No | None |
| Forward rate limit (100/min per IP, IPv6 /64) | #74 | Possibly, behind shared NAT with many clusters | None for standard deployments |
| Concurrency cap (1,000 simultaneous connections) | #79 | No (blocks, not drops) | None for standard deployments |
Section H: May 2026 -- Operator broker auth hardening (E#13)¶
Five security fixes in operator PRs #418 and #441 harden the broker's HTTP surface. Three of the five changes are breaking for automation that interacts with the broker directly; the other two are transparent improvements.
H1: /v1/bridge/status now requires Bearer token authentication (operator v0.12.x+, PR #418)¶
The /v1/bridge/status endpoint previously returned bridge connectivity state to any unauthenticated in-cluster caller. This leaked the beacon address and connection status to any pod that could reach the operator's broker port.
The endpoint now requires a valid Authorization: Bearer <token> header. The token is the same operator auth token used for other authenticated broker endpoints (/v1/bridge/init, /v1/proxy/admin).
Responses without a valid token:
| Condition | HTTP status | Response body |
|---|---|---|
Missing or malformed Authorization header |
401 | {"error":"unauthorized"} |
| Invalid bearer token | 401 | {"error":"unauthorized"} |
| No auth token configured on the operator (fail-closed) | 403 | {"error":"broker auth not configured"} |
Operator action: if you have automation or monitoring that polls /v1/bridge/status, add the Authorization: Bearer <token> header. The token is the same one used for /v1/bridge/init. Example:
curl -s -H "Authorization: Bearer ${OPERATOR_AUTH_TOKEN}" \
https://cloudtaser-operator.cloudtaser-system:8443/v1/bridge/status
If you only interact with the operator through the CLI (cloudtaser-cli target status), no action is required -- the CLI handles authentication automatically.
H2: Unseal token format validation (operator v0.12.x+, PR #418)¶
The /v1/unseal endpoint now performs a lightweight format check on the vault token before forwarding it to vault. Tokens must be:
- At least 8 characters long
- Composed entirely of printable ASCII characters (0x20-0x7E)
Tokens that fail either check are rejected with HTTP 400 ({"error":"malformed token"}) and never reach vault. Previously, any string -- including empty strings, single characters, or strings containing control characters -- was forwarded to vault, which would reject them with a less informative error.
This is a fail-fast sanity check, not a cryptographic validation. Vault remains the authority for token validity. Real vault/OpenBao tokens (prefixed s., hvs., or raw UUID format) are 26+ characters and always pass this check.
Operator action: none for standard deployments. The operator's unseal flow sends properly formatted tokens. This change only affects tooling that sends hand-crafted requests to /v1/unseal with malformed tokens.
H3: Error response sanitization (operator v0.12.x+, PR #441)¶
Vault error responses returned through the broker's CreateChildToken path are now sanitized before being returned to callers. Previously, vault error details -- which can contain accessor IDs, policy names, and cluster topology information -- were forwarded to callers. In a US-cloud deployment, these details could propagate to US-hosted SIEMs and logging pipelines, undermining the data sovereignty guarantee.
The caller now receives a generic error with the HTTP status code only:
The full vault error is logged at Debug level on the operator side, accessible via kubectl logs for operators who need to diagnose the root cause.
Operator action: if your automation parses vault error messages returned by the broker, update it to handle the generic format. For debugging, check the operator's Debug-level logs:
kubectl logs -n cloudtaser-system -l app.kubernetes.io/name=cloudtaser-operator \
--tail=200 | grep "sanitizedError"
H4: JWT scrubbing from response bodies (operator v0.12.x+, PR #441)¶
The broker's /v1/proxy/secret handler now scrubs the calling pod's service-account JWT from all response bodies before writing them to the wire. A malicious or misconfigured bridge could echo the pod's JWT back in an error message; any in-cluster caller that can read the broker's response (the secret proxy does not use bearer-token auth -- it authenticates via the JWT in the request body) would then obtain the pod's service-account token.
Scrubbing replaces any occurrence of the full JWT string with [scrubbed-jwt] in both the structured response and serialized JSON output.
Operator action: none. This is a transparent defense-in-depth improvement. If you see [scrubbed-jwt] in broker responses during debugging, it means the bridge echoed a JWT that was correctly stripped.
H5: Dynamic-linker environment variable stripping at admission (operator v0.12.x+, PR #441)¶
The mutating admission webhook now strips dangerous dynamic-linker environment variables from all containers in injected pods before the pod is created. The following prefixes are removed:
| Prefix | Risk |
|---|---|
LD_* |
Arbitrary shared-library injection via LD_PRELOAD, LD_LIBRARY_PATH |
GLIBC_TUNABLES |
glibc tunable abuse (CVE-2023-4911) |
GCONV_PATH |
Character-set converter path hijack |
MALLOC_* |
Heap allocator configuration manipulation |
NSS_* |
Name Service Switch module hijack |
OPENSSL_* |
OpenSSL configuration override |
Previously, the wrapper stripped these from its own child's environment at fork+exec time. However, kubectl exec-spawned shells inherit the pod-spec environment directly and never pass through the wrapper -- so a hostile LD_PRELOAD value in the pod spec persisted into debug shells. Stripping at admission closes both vectors.
The cloudtaser operator's own LD_PRELOAD=/cloudtaser/libcloudtaser.so (the getenv interposer) is not affected -- dangerous env vars are stripped from the user's container spec before the operator injects its own environment variables.
Existing pod specs with legitimate loader env vars
If your container specs set LD_LIBRARY_PATH for legitimate reasons (e.g. custom shared-library paths for native dependencies), those values will be silently stripped after this upgrade. Move custom library paths into the container image's /etc/ld.so.conf.d/ configuration or bake them into the LD_LIBRARY_PATH inside the Dockerfile's ENV layer (which the webhook does not modify -- only pod-spec env entries are filtered).
Operator action: review your Deployment/StatefulSet specs for any LD_*, GLIBC_TUNABLES, GCONV_PATH, MALLOC_*, NSS_*, or OPENSSL_* environment variables. If present and intentional, relocate them to the container image. The webhook logs dropped variable names at Info level:
kubectl logs -n cloudtaser-system -l app.kubernetes.io/name=cloudtaser-operator \
--tail=200 | grep "envDropped"
H-series summary¶
| Change | PR | Breaking? | Action required |
|---|---|---|---|
/v1/bridge/status requires Bearer auth |
#418 | Yes, if polling without auth | Add Authorization: Bearer <token> header |
| Unseal token format validation | #418 | Possibly, if sending malformed tokens | None for standard deployments |
| Vault error response sanitization | #441 | Possibly, if parsing vault error details | Parse generic status-code errors; check operator Debug logs for details |
| JWT scrubbing from responses | #441 | No | None |
| Dynamic-linker env var stripping | #441 | Yes, if pod specs set LD_* etc. intentionally |
Relocate loader env vars to container image |
Section I: May 2026 -- Admin proxy hardening (E#44)¶
Twelve security fixes across four on-prem bridge PRs (#250, #251, #253, #254) harden the bridge's admin proxy, JWT authentication, enrollment flow, and Helm chart defaults. Six changes are breaking for specific configurations; the remaining six are transparent improvements.
I1: JWT issuer and audience validation (onprem v0.7.x+, PR #250)¶
When TRUST_JWT is enabled (--bridge-trust-jwt=true), the bridge now validates iss (issuer) and aud (audience) claims in addition to the JWKS signature verification. Tokens missing either claim, or presenting values that do not match the configured expectations, are rejected with a typed ErrJWTInvalidClaims error.
New flags:
| Flag | Env var | Purpose |
|---|---|---|
--trust-jwt-issuer |
CLOUDTASER_BRIDGE_TRUST_JWT_ISSUER |
Expected iss claim value |
--trust-jwt-audience |
CLOUDTASER_BRIDGE_TRUST_JWT_AUDIENCE |
Expected aud claim value |
Not-before (nbf) validation is also enforced: tokens presented before their nbf timestamp are rejected with ErrJWTNotYetValid.
When the issuer and audience flags are left empty (the default), claim validation is skipped and only signature verification is performed -- backward compatible with existing deployments.
Operator action: if you use --bridge-trust-jwt=true with a custom JWKS issuer, ensure your tokens set both iss and aud claims to match the values you pass to --trust-jwt-issuer and --trust-jwt-audience. Tokens that previously passed signature-only validation will now be rejected if the claims do not match.
I2: Trust-mode bypass protection (onprem v0.7.x+, PR #250)¶
--bridge-trust-jwt (env: TRUST_JWT) defaults false. When false, the bridge validates JWT signatures using its own mTLS client certificate -- no external JWKS endpoint is needed.
Enabling trust-mode without configuring --trust-jwt-jwks now fails closed: the bridge refuses to start rather than silently accepting unsigned tokens.
Operator action: do not enable --bridge-trust-jwt=true unless you have a functioning JWKS endpoint and have set --trust-jwt-jwks, --trust-jwt-issuer, and --trust-jwt-audience. Enabling trust-mode without the JWKS URL will prevent the bridge from starting.
I3: Fingerprint UUID validation (onprem v0.7.x+, PR #251)¶
All admin operations that accept a cluster fingerprint (/enroll, /wipe, /init, /token) now validate that the fingerprint is a well-formed UUID. Cluster fingerprints are always UUIDs -- they are derived from the kube-system namespace UID. Non-UUID values are rejected with HTTP 400 before reaching vault.
This closes a path-traversal vector: a crafted fingerprint containing ../ could previously reach vault paths outside the intended cloudtaser/data/clusters/<fingerprint>/ prefix.
Operator action: if your automation sends non-UUID fingerprints (unlikely in standard deployments -- the CLI always sends the real kube-system UID), those requests will now fail with HTTP 400. Ensure all fingerprints are valid UUIDs.
I4: Admin proxy header allow-list (onprem v0.7.x+, PR #251)¶
The admin proxy now forwards only an explicit set of allowlisted headers to vault. All other headers -- including custom X-* headers injected by intermediate proxies, load balancers, or service meshes -- are silently dropped before the request reaches vault.
This prevents header-injection attacks where a caller sets vault-meaningful headers (X-Vault-Token, X-Vault-Namespace, X-Vault-Request) to manipulate vault behavior through the admin proxy.
Operator action: if your vault policies depend on custom headers forwarded through the admin proxy, those headers will no longer reach vault. Standard deployments that interact with vault exclusively through the bridge's API (CLI, operator) are unaffected.
I5: Vault error sanitization (onprem v0.7.x+, PR #251)¶
Bridge error responses to callers no longer include vault internal error details such as vault paths, URLs, or token prefixes. Callers receive a generic error with the HTTP status code:
The full vault error is logged at Debug level on the bridge side.
Operator action: if your automation parses vault error messages returned through the admin proxy, update it to handle the generic format. For debugging, check the bridge's Debug-level logs directly:
I6: Constant-time token comparison (onprem v0.7.x+, PR #253)¶
The admin wipe endpoint's auth token comparison already used crypto/subtle.ConstantTimeCompare. This PR added comprehensive test coverage confirming the behavior: wrong tokens are rejected, empty admin tokens fail closed, and correct tokens pass through.
When no admin token is configured on the bridge, all wipe requests are rejected (fail-closed) rather than accepted.
Operator action: none. This is a defense-in-depth improvement with no behavioral change for operators using properly configured admin tokens.
I7: Enrollment token caps (onprem v0.7.x+, PR #253)¶
mintEnrollmentToken now creates vault tokens with the following restrictions:
| Property | Previous | New |
|---|---|---|
num_uses |
Unlimited | 3 |
period |
Default | 0s (non-periodic) |
renewable |
Default (yes) | false |
The 3-use cap covers: delivery to the wrapper, one retry, and one audit/verification call. This limits the blast radius if an enrollment token leaks during mTLS transit.
Operator action: if you have automation that reuses enrollment tokens for multiple enrollments or renews them, update it to request a fresh token per enrollment. Single-use tokens that were valid before are still valid (1 < 3), but tokens that were reused across many enrollments will exhaust the 3-use cap.
I8: Initialized marker fingerprint scoping and hard-fail (onprem v0.7.x+, PR #253)¶
The /initialized marker path is scoped to the cluster fingerprint: cloudtaser/data/clusters/<fingerprint>/initialized. This was already the case in practice, but is now enforced with an AI-CONTRACT marker confirming the design intent.
The behavioral change: marker write failures now hard-fail. When the bridge cannot write the initialized marker to vault (network error, permission denied), it clears the cert material from the response body before returning it to the caller. This preserves one-time-use replay protection -- previously, a vault 403 on the marker write was silently swallowed, allowing the same init payload to be replayed.
Operator action: ensure vault is reachable and the bridge's vault token has write access to the marker path during the initialization flow. If the marker write fails, the bridge returns a response without cert material; the caller must retry the entire init flow.
I9: Dev-mode default false (onprem v0.7.x+, PR #253)¶
DEV_MODE (env: CLOUDTASER_BRIDGE_DEV_MODE) defaults false in the ClientConfig zero value and in the CLI flag declaration. Dev mode was never intended for production; this change ensures that misconfigured deployments (accidentally setting DEV_MODE=1 in a production .env file) fail closed rather than bypassing security checks.
validateSecurityFlags enforces mutual exclusion: DevMode=true cannot be combined with EnrollAdminToken (the production auth mechanism).
Operator action: none for production deployments (the default is already correct). If you run the bridge in dev mode for local testing, the flag continues to work -- you must set it explicitly.
I10: X-Forwarded-For injection protection (onprem v0.7.x+, PR #253)¶
extractClientIP now validates all values from X-Forwarded-For and CF-Connecting-IP headers with net.ParseIP. Invalid values -- including path-traversal payloads, UUIDs, and injection attempts -- are rejected, and the function falls back to the raw TCP RemoteAddr.
This prevents per-IP rate limit bypass via header spoofing: an attacker who could set X-Forwarded-For: <unique-value> on each request previously got a fresh rate-limit bucket per forged IP.
Operator action: none. Legitimate proxy chains that set valid IP addresses in X-Forwarded-For are unaffected.
I11: Wildcard service account rejection (onprem v0.7.x+, PR #254)¶
boundServiceAccountNames and boundServiceAccountNamespaces in the cloudtaser-openbao Helm chart (values.yaml) now default to explicit values instead of the wildcard "*":
| Field | Previous default | New default |
|---|---|---|
boundServiceAccountNames |
"*" |
cloudtaser-wrapper |
boundServiceAccountNamespaces |
"*" |
default |
The Go code already rejected "*" at runtime (since PR #228), so this change closes the chart-level gap -- preventing the wildcard from reaching vault role configuration if the runtime guard were ever bypassed.
Operator action: if you use non-default service account names or namespaces for your wrapper pods, update the boundServiceAccountNames and boundServiceAccountNamespaces values in your values.yaml before upgrading. The wildcard "*" is no longer accepted at either the chart or runtime level.
I12: Anonymous enrollment prevention (onprem v0.7.x+, PR #254)¶
The /token endpoint now verifies that the cluster fingerprint is registered in vault (cloudtaser/data/clusters/<fingerprint>/registered) before minting an enrollment token. Unregistered clusters receive HTTP 403.
Previously, calling /token before /enroll would silently succeed -- minting a scoped token for a cluster that vault had no record of. This allowed anonymous enrollment: any caller with network access to the bridge could obtain a valid vault token without going through the registration flow.
Operator action: ensure your clusters are enrolled (/enroll) before requesting tokens (/token). The standard CLI flow (cloudtaser-cli target install) already does this in the correct order. Automation that called /token directly without prior enrollment will now fail with HTTP 403.
I-series summary¶
| Change | PR | Breaking? | Action required |
|---|---|---|---|
| JWT issuer + audience validation | #250 | Yes, if using trust-mode with custom JWKS | Set --trust-jwt-issuer and --trust-jwt-audience |
| Trust-mode bypass protection | #250 | Yes, if enabling trust-mode without JWKS URL | Configure --trust-jwt-jwks before enabling trust-mode |
| Fingerprint UUID validation | #251 | Possibly, if sending non-UUID fingerprints | Ensure fingerprints are valid UUIDs |
| Admin proxy header allow-list | #251 | Yes, if vault policies depend on custom headers | Move vault-meaningful logic off custom proxy headers |
| Vault error sanitization | #251 | Possibly, if parsing vault error details | Check bridge Debug logs for full errors |
| Constant-time token comparison | #253 | No | None |
| Enrollment token caps (num_uses=3) | #253 | Yes, if reusing enrollment tokens | Request a fresh token per enrollment |
| Initialized marker hard-fail | #253 | Possibly, if vault unreachable during init | Ensure vault reachability during init flow |
| Dev-mode default false | #253 | No (was already the practical default) | None |
| X-Forwarded-For injection protection | #253 | No | None |
| Wildcard service account rejection | #254 | Yes, if using non-default SA names | Update boundServiceAccountNames / boundServiceAccountNamespaces |
| Anonymous enrollment prevention | #254 | Yes, if calling /token before /enroll |
Enroll clusters before requesting tokens |
Prerequisites¶
helmv3.x installedkubectlaccess to the target cluster- The cloudtaser Helm chart repository configured:
Upgrade Strategy¶
cloudtaser consists of three independently-versioned components deployed by a single Helm chart:
| Component | Chart Value | Current Default |
|---|---|---|
| Operator | operator.image.tag |
v0.6.9 |
| Wrapper | wrapper.image.tag |
v0.1.6 |
| eBPF Agent | ebpf.image.tag |
v0.1.30 |
| S3 Proxy | s3proxy.image.tag |
v0.2.13 |
The Helm chart version (currently 0.4.77) tracks independently from component versions. Upgrading the chart may update one or more component images.
Zero-Downtime Upgrade¶
Step 1: Check current versions¶
helm list -n cloudtaser-system
kubectl get deployment -n cloudtaser-system cloudtaser-operator \
-o jsonpath='{.spec.template.spec.containers[0].image}'
kubectl get daemonset -n cloudtaser-system cloudtaser-ebpf \
-o jsonpath='{.spec.template.spec.containers[0].image}'
Step 2: Review the new chart version¶
# Check available chart versions
helm search repo cloudtaser --versions
# Diff the values between current and new chart
helm diff upgrade cloudtaser cloudtaser/cloudtaser \
-n cloudtaser-system \
-f values.yaml
Install the helm-diff plugin
helm plugin install https://github.com/databus23/helm-diff -- shows the exact Kubernetes resource diffs before applying.
Step 3: Upgrade the Helm release¶
helm upgrade cloudtaser cloudtaser/cloudtaser \
-n cloudtaser-system \
-f values.yaml \
--version <target-chart-version>
Step 4: Verify the upgrade¶
# Check operator is running
kubectl rollout status deployment/cloudtaser-operator -n cloudtaser-system
# Check eBPF daemonset is running on all nodes
kubectl rollout status daemonset/cloudtaser-ebpf -n cloudtaser-system
# Verify webhook is responding
kubectl get mutatingwebhookconfiguration cloudtaser-operator-webhook -o yaml | head -20
Component Upgrade Order¶
When upgrading individual components (setting specific image tags), follow this order:
- eBPF agent first -- the agent is backward compatible with older wrapper versions. Upgrading it first ensures enforcement is active during the transition.
- Operator second -- the operator generates injection patches. A new operator version may inject new annotations or environment variables that the wrapper needs.
- Wrapper last -- wrapper upgrades require pod restarts (the wrapper binary is copied via the init container). The new wrapper image is used on the next pod creation.
Wrapper upgrades require pod restarts
The wrapper binary is copied into each pod's emptyDir volume by an init container at pod creation time. Existing running pods continue using the old wrapper binary until they are restarted. To roll out a new wrapper version:
# Restart all deployments with cloudtaser injection
kubectl get deployments --all-namespaces \
-o jsonpath='{range .items[?(@.spec.template.metadata.annotations.cloudtaser\.io/inject=="true")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
while read dep; do
kubectl rollout restart deployment "${dep##*/}" -n "${dep%%/*}"
done
CRD Upgrades¶
The Helm chart includes CRDs for CloudTaserConfig and SecretMapping. Helm does not upgrade CRDs automatically after initial install.
To upgrade CRDs:
# Apply CRDs from the new chart version
kubectl apply -f https://raw.githubusercontent.com/cloudtaser/cloudtaser-helm/main/charts/cloudtaser/crds/api.cloudtaser.io_cloudtaserconfigs.yaml
kubectl apply -f https://raw.githubusercontent.com/cloudtaser/cloudtaser-helm/main/charts/cloudtaser/crds/api.cloudtaser.io_secretmappings.yaml
Apply CRDs before upgrading the Helm release
If the new operator version references new CRD fields, apply the CRDs first. Otherwise the operator may fail to start because the API server rejects unknown fields.
Certificate Rotation During Upgrade¶
The operator manages its own webhook TLS certificates. During an upgrade:
- Certificates are stored in a Kubernetes Secret (
cloudtaser-operator-certs). All operator replicas share the same certificate. - The operator checks certificate expiry every 24 hours and rotates certificates 30 days before expiry.
- On startup, the operator patches the
MutatingWebhookConfigurationandValidatingWebhookConfigurationwith the current CA bundle. - Broker mTLS certificates (stored in
cloudtaser-broker-tls) follow the same rotation schedule.
No manual certificate action is needed during upgrades. The new operator pod reads the existing certificate Secret and patches the webhook configurations automatically.
Rollback¶
Helm rollback¶
# List revision history
helm history cloudtaser -n cloudtaser-system
# Roll back to the previous revision
helm rollback cloudtaser -n cloudtaser-system
# Roll back to a specific revision
helm rollback cloudtaser <revision-number> -n cloudtaser-system
Verify rollback¶
kubectl rollout status deployment/cloudtaser-operator -n cloudtaser-system
kubectl rollout status daemonset/cloudtaser-ebpf -n cloudtaser-system
Rollback considerations¶
| Scenario | Impact | Action |
|---|---|---|
| Operator rollback | New pods will be injected with the previous operator logic. Existing pods are unaffected. | Restart affected deployments if needed |
| eBPF rollback | Previous enforcement behavior is restored on all nodes | No pod restart needed |
| Wrapper rollback | Running pods keep the current wrapper. New pods get the old wrapper. | Restart deployments to pick up old wrapper binary |
| CRD rollback | CRDs are not managed by Helm rollback | Manually reapply old CRD versions if fields were removed |
CRDs are not rolled back by Helm
helm rollback does not revert CRD changes. If a CRD was updated with new fields, rolling back the chart leaves the new CRD in place. This is generally safe because CRDs are additive, but verify compatibility.
Database Migration Considerations¶
SaaS Control Plane¶
The cloudtaser SaaS control plane (cloudtaser-saas) uses an in-memory tenant store. There are no database migrations required when upgrading the SaaS component.
Database Proxy¶
The database proxy (cloudtaser-db-proxy) does not have its own database. It proxies PostgreSQL connections and performs transparent encryption/decryption. Upgrading the proxy is safe because:
- The encrypted value format is versioned (currently version 1). New proxy versions can always read values encrypted by older versions.
- The encryption key lives in OpenBao Transit, not in the proxy. Key continuity is guaranteed by OpenBao.
- The proxy is stateless. Restart it at any time.
However, if a new proxy version changes the encryption format:
- The new format version byte ensures old values remain readable
- New writes use the new format
- Rollback to an older proxy that does not understand the new format will fail to decrypt values written by the new version
Test proxy upgrades in staging first
Deploy the new proxy version in a staging environment and verify both reads (of existing encrypted data) and writes work correctly before upgrading production.
Canary Upgrade¶
For large clusters, consider a canary upgrade using namespace selectors:
Step 1: Deploy the new operator version to a canary namespace¶
Override the webhook namespace selector to target only the canary namespace:
Step 2: Restart a test workload in the canary namespace¶
Step 3: Verify protection score¶
cloudtaser-cli target status -n canary
cloudtaser-cli target audit --secretstore-address https://vault.eu.example.com -n canary
Step 4: Roll out to all namespaces¶
Once verified, upgrade the Helm release for the full cluster.
Emergency: Disable Injection¶
If an upgrade causes pod creation failures, disable the webhook temporarily:
# Option 1: Set failurePolicy to Ignore (pods start without injection)
kubectl patch mutatingwebhookconfiguration cloudtaser-operator-webhook \
--type='json' \
-p='[{"op":"replace","path":"/webhooks/0/failurePolicy","value":"Ignore"}]'
# Option 2: Delete the webhook configuration entirely (emergency only)
kubectl delete mutatingwebhookconfiguration cloudtaser-operator-webhook
Disabling the webhook removes protection
Setting failurePolicy: Ignore means new pods start without cloudtaser injection. Existing running pods continue to operate normally. Restore the webhook after resolving the issue.
After resolving the issue, restore the webhook: