Operational Readiness¶
The most honest question a platform reviewer can ask is: "If your thing breaks, what happens to my cluster?" This page answers that for each moving part of a CloudTaser deployment -- mutating webhook, wrapper, eBPF DaemonSet, beacon, OpenBao -- plus a one-paragraph backout runbook for the worst case.
It is intended to be read before a production install, referenced during an incident, and presented to an SRE or security reviewer who needs to understand CloudTaser's blast radius without reading the source.
Companion pages: Sovereign Deployment Decision Guide, Beacon Trust Model, Preview Status & Roadmap.
Preview status caveats
CloudTaser is in Preview. No published uptime SLA exists for the skipOPS-operated beacon. No SOC 2 Type II has been completed. See Preview Status & Roadmap for the full picture. This page documents the current operational posture, not the target GA posture.
1. Mutating webhook failure modes¶
The CloudTaser operator registers a MutatingWebhookConfiguration that intercepts Pod create requests and rewrites container entrypoints to point at the wrapper. If the webhook is unreachable during a pod create, the outcome depends on its failurePolicy.
Default and recommended settings¶
| Setting | Helm value | Default in chart 1.0.20+ | Production recommendation |
|---|---|---|---|
failurePolicy |
operator.webhook.failurePolicy |
Ignore |
Ignore for most clusters; Fail only where injection is security-critical and operator HA is guaranteed |
timeoutSeconds |
operator.webhook.timeoutSeconds |
10 |
10 is correct; lower increases apparent failure rate, higher blocks API server |
objectSelector |
operator.webhook.objectSelector |
matchExpressions: cloudtaser.io/inject=true |
Keep opt-in. Never opt-out-by-default in a shared cluster. |
namespaceSelector |
operator.webhook.namespaceSelector |
unset (cluster-wide for opted-in pods) | Consider restricting to specific namespaces in very large clusters. |
What happens when the webhook is unreachable¶
failurePolicy |
Webhook up | Webhook down (operator pods gone, network partition) |
|---|---|---|
Ignore (default) |
Annotated pods are injected. | Annotated pods are created WITHOUT injection. They run as plain containers without CloudTaser protection. Cluster keeps working. |
Fail |
Annotated pods are injected. | Pod creation is blocked cluster-wide (or in the selected namespaces). New deployments stall until operator recovers. |
Why the default is Ignore¶
With Ignore, a CloudTaser operator outage degrades the security posture of newly created pods but does not stop the cluster. Running pods keep running with their already-injected wrappers. This is the right trade-off for most production clusters where the wider blast radius (inability to schedule pods at all) outweighs the narrow window of running a handful of pods without protection.
Use Fail only when:
- You operate the operator with HA (≥3 replicas across zones) and have tested failover.
- An unprotected pod creation is a higher-severity event for your threat model than delayed pod scheduling.
- You have alerting on operator health that will page before the scheduler starts stalling pods.
Manually disabling the webhook (emergency)¶
If the operator is misbehaving -- emitting 500s, stuck in a crashloop, or breaking admission for reasons unrelated to your workload -- unblock the cluster first, debug second:
# Delete the webhook config. Existing pods unaffected; new pod creates pass straight to the API server.
kubectl delete mutatingwebhookconfiguration cloudtaser-operator
# Later, re-install by rerunning helm upgrade or `kubectl apply` on the operator chart.
With failurePolicy: Ignore this is rarely necessary -- the API server itself skips unreachable webhooks. With failurePolicy: Fail this is the one-liner that unwedges scheduling.
2. Wrapper-as-PID-1 blast radius¶
The wrapper is injected via init container and becomes PID 1 inside the main container. Your application runs as a child of the wrapper. This is the design that allows secret injection without writing to disk or env vars the provider can see -- but it also means a wrapper bug crashes the container.
What kubelet does when the wrapper crashes¶
| Scenario | kubelet behaviour |
|---|---|
| Wrapper crashes once on first start | Normal restart (restartPolicy: Always). |
| Wrapper crashes repeatedly | Exponential backoff (10s, 20s, 40s, ..., capped at 5min). Pod shows CrashLoopBackOff. |
| Wrapper blocks on OpenBao unreachability | Wrapper exits with a dedicated exit code (configurable timeout). Restart loop continues until OpenBao is back. |
| Wrapper is killed by eBPF enforcement hook | Wrapper exits cleanly; pod restarts. If the trigger is a genuine attack, this is the designed behaviour. |
kubelet does NOT "give up" -- the pod stays in CrashLoopBackOff indefinitely unless you delete it, roll the deployment back, or disable injection for it.
Bypassing injection for a specific pod¶
Two escape hatches exist. Use the first for a hot fix, the second for systematic exclusion.
Annotation override (one pod):
Setting inject: false skips the mutating webhook for that pod even if the objectSelector would otherwise match. Use this when one specific app is misbehaving and you want to unblock it while keeping the rest of the cluster protected.
Un-inject a running pod (kubectl):
A pod that is already injected cannot be "un-injected" without recreating it. The operation is:
# 1. Scale down the deployment.
kubectl scale deploy myapp --replicas=0
# 2. Strip the injection annotation (or set it to false).
kubectl annotate deploy myapp cloudtaser.io/inject-
# 3. Scale back up. New pods are created without wrapper.
kubectl scale deploy myapp --replicas=3
Namespace-wide exclusion:
Add a label to the namespace that excludes it from the operator's namespaceSelector, or opt out explicitly with:
3. eBPF DaemonSet supply chain¶
The eBPF agent runs as a privileged DaemonSet, one pod per node. Privileged access to the kernel is the price of the runtime enforcement guarantee -- and that means the supply chain for the eBPF image is load-bearing. A malicious or compromised CloudTaser-owned eBPF image would be a kernel-level privilege escalation on every node it runs on.
This is the risk we take most seriously. The defences are layered and all verifiable by you without trusting our word.
What we ship to defend against this class of risk¶
| Defence | Status | How to verify |
|---|---|---|
| Cosign image signing | Shipped (operator v0.9.0, released today) | cosign verify --certificate-identity-regexp 'https://github.com/cloudtaser/' ghcr.io/cloudtaser/cloudtaser-ebpf:vX.Y.Z |
| Signature enforcement at admission | Shipped -- helm value admission.enforceSignature |
Set to true; the operator rejects unsigned images into the cloudtaser-system namespace. |
| SBOM per release (SPDX) | Shipped -- via syft | cosign download sbom ghcr.io/cloudtaser/cloudtaser-ebpf:vX.Y.Z |
| Reproducible builds | In progress (target Q3 2026) | Build steps published; bit-reproducibility under CI with pinned toolchain. Track via [cloudtaser-pipeline#reproducible-builds]. |
| Kernel cmdline lockdown | Customer-configurable (recommended) | Add lockdown=confidentiality to node kernel cmdline; see the CIS Kubernetes Benchmark for details. |
| Third-party security audit | Q2 2026 (see Preview Roadmap) | Published report, redacted form linked from roadmap page. |
Recommended admission policy¶
In your cluster, enforce that only cosign-signed CloudTaser images can run:
# Helm values
admission:
enforceSignature: true
signatureIdentityRegexp: "https://github.com/cloudtaser/"
This rejects any image that isn't signed by a GitHub Actions workflow from the CloudTaser org. If someone pushes a rogue image to ghcr.io/cloudtaser/cloudtaser-ebpf, it cannot run in your cluster.
Recommended kernel cmdline¶
For nodes running the eBPF DaemonSet, we recommend:
lockdown=confidentiality blocks kernel reads that could exfiltrate CloudTaser-loaded eBPF programs. module.sig_enforce=1 ensures no unsigned kernel modules can be loaded to interfere with eBPF.
4. Beacon SLA and failure modes¶
The skipOPS-operated beacon relay at beacon.cloudtaser.io:443 is the default connectivity layer.
Uptime target¶
- Target: 99.9% monthly uptime (~43 minutes/month allowed downtime).
- Topology: 3-node HA across EU availability zones with gossip for connection state.
- Published SLA: None today. An SLA commitment is contingent on SOC 2 Type I readiness (Q3 2026). Historical uptime statistics will be published on a status page in Q2 2026.
- Self-host alternative: see the Beacon Trust Model.
What happens when the beacon is down¶
This is the failure-mode table every SRE wants:
| Scenario | Impact on running pods | Impact on new pods | Impact on secret rotation |
|---|---|---|---|
| Beacon up, bridge up, OpenBao up | Fully operational. | Fully operational. | Normal lease renewal. |
| Beacon down (all 3 nodes), bridge up | Running pods keep working. Existing secret material is in wrapper memory; applications keep serving. | New pod secret fetches fail until beacon recovers. Pods created during outage stall in wrapper boot. | Lease renewals fail silently and degrade. Pods continue until leases expire at OpenBao policy boundary. |
| Bridge down, beacon up | Running pods keep working (secrets in memory). | New pod fetches fail. | Lease renewals fail. |
| OpenBao down, bridge and beacon up | Running pods keep working. | New pod fetches fail. | Lease renewals fail. |
| Operator down, beacon and bridge up | Running pods keep working. | New pod creates succeed (with failurePolicy: Ignore) but WITHOUT injection. |
Lease renewals for already-running pods continue. |
Key property: the in-memory secret material in running wrappers is resilient to beacon / bridge / OpenBao outage. This is by design. A brief outage at any layer does not take production down; a sustained outage degrades at the rate of lease expiry.
If the beacon outage is unacceptable¶
Self-host it. Running your own beacon eliminates CloudTaser-operated infrastructure from the critical path entirely. See the Beacon Trust Model → Self-hosting the beacon.
5. Backout procedure¶
The one-paragraph runbook you should rehearse before production install.
One-paragraph backout¶
# Remove CloudTaser from the cluster.
helm uninstall cloudtaser -n cloudtaser-system
# Delete the operator namespace.
kubectl delete namespace cloudtaser-system
# Delete cluster-scoped CloudTaser CRDs.
kubectl delete crd cloudtaserconfigs.api.cloudtaser.io \
secretmappings.api.cloudtaser.io \
debugreports.api.cloudtaser.io
What gets left behind¶
- Running pods with an already-injected wrapper keep running. They continue to function with their in-memory secrets until they restart or the lease expires. On restart, they will be re-scheduled without injection (because the webhook is gone) and will fail fast if their application expected the wrapper-provided env.
- Annotations on your Deployments / StatefulSets. Cosmetic; they're ignored without the webhook. Clean them up at leisure.
- Secrets in your EU-hosted OpenBao. These are yours; CloudTaser does not delete them. Review whether you want to keep them, rotate them, or migrate them into whatever secret store you're using next.
- Kernel state from the eBPF DaemonSet. DaemonSet pod deletion unloads the eBPF programs.
helm uninstallhandles this. Verify withbpftool prog list | grep cloudtaser-- should return empty.
What's restored automatically¶
- The mutating webhook is gone, so subsequent pod creates pass through the API server unchanged.
- eBPF programs unload cleanly; no residual kernel state.
- Your application containers run as their original image defined, once they restart.
Re-enabling K8s Secrets for rollback¶
If your application needs its secrets via K8s Secrets again (because you're rolling back to a pre-CloudTaser state), the usual pattern:
# 1. Recreate the K8s Secret from your EU OpenBao.
bao kv get -format=json secret/myapp/config | jq -r '.data.data' | \
kubectl create secret generic myapp-config --from-file=/dev/stdin -n myapp
# 2. Update the Deployment to mount it.
kubectl set env deploy/myapp --from=secret/myapp-config -n myapp
# 3. Roll the pods.
kubectl rollout restart deploy/myapp -n myapp
This reinstates the etcd-storage exposure you were protecting against, so only do it as part of a deliberate rollback -- not as a workaround.
6. OpenBao stability¶
CloudTaser depends on an EU-hosted OpenBao (or HashiCorp Vault) as the secret source of truth. Version pinning and upgrade cadence affect your operational risk.
Why OpenBao, not Vault?¶
In August 2023 HashiCorp relicensed Vault from MPL 2.0 to the Business Source License (BSL) 1.1. OpenBao is the Linux Foundation-hosted MPL 2.0 fork, maintained by a community that includes ex-HashiCorp engineers. CloudTaser defaults to OpenBao because:
- License compatibility. BSL 1.1 has field-of-use restrictions that conflict with CloudTaser being redistributed commercially.
- Sovereignty. An EU-forward, Linux Foundation-governed project is more aligned with the sovereignty story than a US-HQ vendor with a restrictive license.
- API compatibility. OpenBao maintains HashiCorp Vault API compatibility for the subset CloudTaser depends on (KV v2, Kubernetes auth, token renewal, transit -- as of OpenBao 2.0).
Version pinning advice¶
| Component | Pin advice |
|---|---|
| OpenBao server | Pin to a specific minor version (e.g., 2.0.x). Upgrade minor versions deliberately after reading the OpenBao changelog. |
| OpenBao KV v2 schema | Stable; no pinning needed. |
| CloudTaser-side client | The bridge uses OpenBao's Go SDK; pinned per bridge release. |
Upgrade cadence¶
OpenBao currently aligns to HashiCorp Vault's feature cadence for the compat surface. Expect a minor release every 8-12 weeks. CloudTaser's bridge is tested against the three most recent OpenBao minor versions and the three most recent Vault minor versions; a support matrix is published on the platform compatibility page.
We do NOT recommend:
- Running HEAD from the OpenBao main branch in production.
- Running a HashiCorp Vault version newer than what we've tested against (listed in platform compatibility).
- Mixing OpenBao and Vault in the same CloudTaser deployment (HA cluster). Use one or the other.
Upgrade procedure¶
OpenBao upgrades are server-side and do not require CloudTaser restart, as long as the KV v2 and token renewal APIs are unchanged between versions. Standard OpenBao HA upgrade procedure applies:
- Upgrade standbys first.
- Force failover.
- Upgrade the now-standby former leader.
- Verify with
bao statusandbao kv get secret/canary.
The CloudTaser bridge reconnects automatically on OpenBao failover -- no manual action needed.
Related pages¶
- Sovereign Deployment Decision Guide -- substrate decisions
- Beacon Trust Model -- what CloudTaser as a company sees
- Preview Status & Roadmap -- audit timeline and honest gaps
- Upgrade & Rollback -- detailed upgrade procedures for the operator chart
- Troubleshooting -- symptom-level debugging
- Troubleshooting Decision Trees -- decision-tree debugging
- Key Rotation -- rotating bridge CA, broker cert, and OpenBao root