Database Proxy: Searchability, Encryption Modes, and What Breaks¶
cloudtaser's database proxy encrypts PII columns transparently on the wire. Unlike an S3 object -- which is an opaque blob and trivially encryptable -- a database column is queried: WHERE, ORDER BY, LIKE, JOIN, GROUP BY, range scans. Client-side column encryption breaks those operators unless you deliberately pick a mode that leaks just enough structure to keep the operator working, and every such mode carries a known weakness you must understand.
This page explains the trade-off honestly, per column, per query type, so you can decide before you configure what the downstream cost will be.
Why client-side database encryption is hard (not like S3)¶
S3 objects are opaque blobs. You encrypt the object, store the ciphertext, read the ciphertext, decrypt in the client. Done. The storage layer never needs to reason about the object's content -- it never sorts it, filters it, or joins it against another object.
A database column is different. The database engine is expected to:
- Filter --
WHERE email = '[email protected]'needs to find matching rows - Range-scan --
WHERE transaction_amount > 1000needs ordered comparison - Sort --
ORDER BY created_at DESC - Join --
JOIN orders ON users.email = orders.customer_emailneeds equality across tables - Aggregate --
GROUP BY country,SUM(amount),COUNT(DISTINCT email) - Pattern-match --
LIKE '%@example.com' - Full-text search --
MATCH ... AGAINST/tsvector
Every one of those operators needs either plaintext or a ciphertext that preserves the structural property the operator tests. "Random" ciphertext -- the cryptographically strongest option -- preserves nothing. "Deterministic" ciphertext preserves equality but leaks frequency. "Order-preserving" ciphertext preserves order but leaks so much structure that it's been shown breakable with an auxiliary dataset.
There is no free lunch. Every client-side DB encryption scheme trades cryptographic strength for some piece of structural leakage that the database engine needs to keep operating. The question is which leakage you can live with on which column.
Encryption-mode decision tree¶
You pick the mode per column, not per database. The proxy supports mixing modes in the same table.
1. Random AES-256-GCM -- the default¶
How it works: each value gets a fresh random 32-byte DEK and 12-byte nonce. The DEK is wrapped by the OpenBao Transit engine. The same plaintext produces a completely different ciphertext every time.
What it protects against: a cloud provider DBA, a CLOUD Act subpoena against the managed database, an exfiltrated snapshot, a compromised replica. None of them can recover plaintext without the EU-hosted KEK.
What it costs you:
- No equality search (
WHERE email = ?fails -- same email → different ciphertext each insert) - No range search (
WHERE transaction_amount > 1000fails -- no order relation in ciphertext) - No ordering (
ORDER BY ssnreturns ciphertext in arbitrary order) - No JOIN on encrypted columns
- No GROUP BY on encrypted columns
- No LIKE / pattern match
- No full-text search
- Only retrievable by a non-encrypted key -- primary key, foreign key, any other plaintext index
Use this mode for: sensitive columns you only ever retrieve by ID (SSN, BSN, passport number, full address, date of birth, free-text notes, medical diagnosis codes). This is the right default for the vast majority of regulated-workload PII because those fields should not be search keys anyway.
2. Deterministic AES -- opt-in per column¶
How it works: same plaintext → same ciphertext. Achieved by deriving the nonce from the plaintext (AES-SIV) or by using a stable nonce keyed from the value. The proxy ships this as blind indexes: a separate column stores HMAC-SHA256(key, plaintext) next to the encrypted column, and the proxy rewrites WHERE encrypted_col = $1 into WHERE encrypted_col_idx = HMAC-SHA256(key, $1) transparently.
What it unlocks:
- Equality search (
WHERE email_idx = HMAC(email)) - Equality JOINs (both sides use the same HMAC key)
- GROUP BY on the blind-index column
- COUNT(DISTINCT ...) on the blind-index column
What it costs you:
- Frequency analysis. If a column has a small cardinality (
gender,marital_status,country), an attacker who can see ciphertext distribution learns the frequency of each value. Matching that against population statistics is trivial (the most common ciphertext in a Dutch citizens table is "Netherlands"). Deterministic encryption on low-entropy columns is catastrophic. - Correlation across tables. If two tables both deterministically encrypt
emailunder the same key, the attacker can JOIN them without the key. That's the feature you asked for -- it's also structural leakage. - Still no ordering, no range, no LIKE, no FTS. Determinism only unlocks equality.
Use this mode for: high-entropy identifiers that are genuinely unique per row (email, phone_number, iban, tax_id). Do NOT use it for low-cardinality columns (gender, country, marital_status, status_code) -- the frequency leak is a PII disclosure in its own right.
Today the proxy exposes this as blind indexes via CLOUDTASER_DBPROXY_BLIND_INDEX_COLUMNS and CLOUDTASER_DBPROXY_BLIND_INDEX_KEY. The ciphertext column remains random AES-GCM; the companion HMAC column provides the deterministic index. This two-column approach is intentional: the random ciphertext is what the cloud provider stores at rest, while the HMAC is what the application joins on -- and the HMAC key never leaves OpenBao.
3. Order-preserving / order-revealing encryption -- NOT shipped, deliberately¶
How it works (in theory): a scheme like CryptDB's OPE or later ORE constructions produces ciphertexts that preserve either the plaintext order (OPE -- the ciphertext can be <-compared directly) or reveal the order upon query (ORE -- a comparison function takes two ciphertexts and returns </=/>).
What it would unlock: range queries (WHERE salary > 50000), ORDER BY salary, min/max aggregates.
Why cloudtaser does not ship it: every practical OPE/ORE construction has a known structural attack. Grubbs, Sekniqi, Kamara, and others (2017+) have shown that order-alone leakage, combined with a moderately-sized auxiliary plaintext dataset (a public census, a leaked breach corpus), is sufficient to recover a large fraction of the protected plaintexts. The leakage isn't just "the order" -- it's a strong signal that reconstructs the distribution. For regulated data (medical, financial, government), this is unacceptable.
Our answer for range queries: don't encrypt the column. If you genuinely need WHERE transaction_amount > 1000, transaction_amount is a low-sensitivity bucket of numeric data; put the access-control policy in the application layer (OPA, OAuth scopes, row-level RLS on the DB side) rather than relying on encryption to gatekeep it. Encrypt the account identifier, not the amount.
4. Format-preserving encryption (FPE) -- NOT currently shipped¶
What it is: encryption that preserves the format of the plaintext -- a 16-digit credit-card number encrypts to another 16-digit string, an IBAN encrypts to a string of the same length and checksum validity. Typically via FF1 or FF3-1 (NIST SP 800-38G).
Why it's useful: legacy systems with rigid schema constraints -- a CHAR(16) column for PAN, a fixed-length field for IBAN -- can store ciphertext in place without schema changes.
Why it's a trade-off: FPE is a small domain, and small-domain encryption has weaker security bounds than AES-GCM on arbitrary-length inputs. FF3-1 had to be repaired in 2017 after a practical attack on specific domain sizes. It remains secure within the parameters NIST specifies, but it's narrower than general-purpose AEAD.
Status in cloudtaser: not currently supported. The proxy stores ciphertext as bytea (PostgreSQL) or VARBINARY (MySQL), which assumes your schema can hold binary of arbitrary length. If you have a hard legacy-format constraint, file an issue on cloudtaser/cloudtaser-db-proxy and we'll scope it -- FF1 integration is straightforward; the question is whether your threat model tolerates small-domain PRP security bounds.
5. Tokenization via OpenBao lookup -- pattern, not first-class feature¶
How it works: instead of encrypting the plaintext in the DB column, replace it with an opaque token (e.g. tok_01HZ...). Store the token → plaintext mapping in OpenBao's KV engine. At read time, the application (or a lookup helper) resolves the token against OpenBao.
What it unlocks: arbitrary DB operations on the token itself -- indexes, joins, aggregation, even LIKE if the token format permits. The DB never holds the plaintext.
What it costs you:
- You can't filter on the underlying value from the DB (
WHERE plaintext_email = 'alice@...') -- the DB doesn't know the mapping. - Every read that needs plaintext roundtrips to OpenBao. For list endpoints with 100 rows, that's 100 KV lookups. Batch API helps; latency floor is bounded by OpenBao roundtrip.
- Token format matters. Using a random UUID maximises unlinkability but blocks
LIKEon token shape. Using a deterministic tokenizer (tokens are HMAC(plaintext)) reintroduces the deterministic-encryption frequency-analysis concern.
Status in cloudtaser: not currently shipped as a first-class proxy feature; described here because it's the pattern we recommend when the DB column needs to participate in queries the encryption modes above cannot support (e.g. LIKE '%foo%' on a customer-name column with a legitimate customer-service UI requirement). Implement it in the application layer against OpenBao KV; the proxy does not need to be involved.
Decision table -- common column types¶
For a typical enterprise customer record. "Mode" is the recommendation; "Why" is the threat-model justification.
| Column | Mode | Why |
|---|---|---|
id (PK) |
plaintext | Primary key, never PII. Needed for every lookup. |
email |
deterministic (blind index) | High-entropy, unique per user, needed for login-by-email. Frequency leak tolerable because cardinality ~= row count. |
password_hash |
do not encrypt -- use bcrypt/argon2 in the application | A password hash is already a one-way function. Encrypting it adds latency without adding security and breaks the auth flow. This is the #1 anti-pattern we see. |
full_name |
random AES-GCM | PII, retrieved by id. Never a search key in well-designed apps. |
date_of_birth |
random AES-GCM | PII. If you need age-range queries, store a non-encrypted age_bucket column (0-17, 18-65, 65+) -- the application computes it on insert. |
ssn / bsn / tax_id |
random AES-GCM | PII. Retrieved by id. Deterministic (blind index) acceptable ONLY if you have a legitimate WHERE ssn = ? query path and you accept the frequency-analysis risk. |
full_address |
random AES-GCM | PII. Address-matching (delivery lookup) should use a separate postcode plaintext column, not the full address. |
order_status (enum: pending/shipped/delivered/cancelled) |
plaintext | Low-cardinality operational state. Encrypting it is catastrophic for determinism (4 ciphertexts for the whole table) and breaks every dashboard query. Enforce access control at the row/RBAC level instead. |
transaction_amount |
plaintext + RBAC | Range queries and sums are core to the business function. OPE is unsafe; tokenization destroys the aggregate story. Encrypt the account_id, not the amount. |
free_text_notes |
random AES-GCM | High-entropy unstructured text. Never a search key. |
iban |
deterministic (blind index) | High-entropy unique identifier, common lookup key. Same reasoning as email. |
credit_card_pan |
random AES-GCM (or FPE if schema demands) | PCI-DSS sensitive. Retrieved for display (often only the last-4 are shown -- store a separate pan_last4 plaintext column for that). |
gender |
plaintext | Low-cardinality (typically 2-4 values). Deterministic encryption would leak frequency trivially. If the column is PII-sensitive under your regulator's definition, drop the column entirely or encrypt at the row level via RBAC. |
country_code |
plaintext | Same reasoning as gender -- cardinality <250 globally, frequency-analysis resistant only in very uniform populations. |
The pattern that keeps repeating: encrypt high-entropy per-record values retrieved by ID; do not encrypt low-entropy operational columns; do not encrypt hashes.
Worked example -- customers(id, email, full_name, dob, country, notes)¶
Starting schema:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
email TEXT NOT NULL,
full_name TEXT,
dob DATE,
country CHAR(2),
notes TEXT
);
Application queries in production:
-- Login
SELECT id, email FROM customers WHERE email = $1;
-- Customer lookup (CS UI)
SELECT email, full_name, dob, country, notes FROM customers WHERE id = $1;
-- Dashboards
SELECT country, COUNT(*) FROM customers GROUP BY country;
Encrypted schema cloudtaser-style:
CREATE TABLE customers (
id SERIAL PRIMARY KEY, -- plaintext
email BYTEA NOT NULL, -- random AES-GCM (proxy)
email_idx BYTEA NOT NULL UNIQUE, -- HMAC-SHA256 blind index (proxy)
full_name BYTEA, -- random AES-GCM (proxy)
dob BYTEA, -- random AES-GCM (proxy)
country CHAR(2), -- plaintext (low-cardinality; dashboards)
notes BYTEA -- random AES-GCM (proxy)
);
Proxy config (sidecar env, set by the operator from annotations):
CLOUDTASER_DBPROXY_ENCRYPT_COLUMNS=customers.email,customers.full_name,customers.dob,customers.notes
CLOUDTASER_DBPROXY_BLIND_INDEX_COLUMNS=customers.email:customers.email_idx
CLOUDTASER_DBPROXY_BLIND_INDEX_KEY=customers-email-hmac
Equivalent pod annotations:
cloudtaser.io/db-proxy: "true"
cloudtaser.io/db-proxy-upstream: "customer-db.eu-west3.rds.amazonaws.com:5432"
cloudtaser.io/db-proxy-transit-key: "customers"
cloudtaser.io/db-proxy-encrypt-columns: "customers.email,customers.full_name,customers.dob,customers.notes"
cloudtaser.io/db-proxy-blind-index-columns: "customers.email:customers.email_idx"
cloudtaser.io/db-proxy-blind-index-key: "customers-email-hmac"
What the application sees: no change. Queries run as before; the proxy rewrites the WHERE email = $1 into WHERE email_idx = HMAC(key, $1) transparently and decrypts email, full_name, dob, notes on the way back.
What the cloud provider sees:
id-- plaintext integeremail-- 130+ bytes of per-row-unique ciphertext, unlinkable across rowsemail_idx-- 32-byte HMAC; enables JOIN/lookup, leaks equality (intentional)full_name,dob,notes-- ciphertext blobscountry-- plaintext"NL","DE","FR"...
The dashboard GROUP BY country still works against the plaintext column. The login flow still works via email_idx. The CS-UI lookup still works via id. The only new DB-provider-visible signal is email_idx equality, which matches the uniqueness of the email column anyway (every row has a distinct email, and therefore a distinct email_idx) -- no frequency-analysis leverage.
What breaks and how to mitigate¶
WHERE encrypted_col = 'literal'¶
Breaks under random AES-GCM. Works via blind index (deterministic HMAC) when you've configured it.
If you need this but haven't configured a blind index, the options are:
- Add a blind-index column for that specific column (see config above)
- Rewrite the query to look up by a non-encrypted key (
id,foreign key) - If the column is low-cardinality, don't encrypt it -- use RBAC/RLS instead
ORDER BY encrypted_col¶
Breaks for random AES-GCM and deterministic (HMAC output order is unrelated to plaintext order).
Options:
- Fetch rows by a non-encrypted filter, sort application-side after decryption. Works for <10k rows; does not scale to
ORDER BY ... LIMIT 10over millions. - Don't encrypt the sort column. If you need
ORDER BY created_aton a timestamp,created_atis usually not sensitive -- leave it plaintext. - Store a derived non-sensitive sort key alongside the encrypted column (e.g. an
age_bucketnext to an encrypteddob).
LIKE '%foo%'¶
Breaks entirely. No deployable client-side encryption scheme supports arbitrary substring search on encrypted text -- searchable symmetric encryption (SSE) is a live research area with impractical overhead and known leakage attacks at scale.
Options:
- Don't encrypt the column -- put an RBAC / row-level-security / application-policy layer in front instead. For customer-name search in a CS UI, this is usually the right answer: the column is visible to authenticated CS staff only.
- Store a separate sanitised "searchable" column -- e.g.
full_name_search_tokenscontaining a tokenised/normalised n-gram set, plaintext, with application-side access control. Accept that you now have two surfaces of sensitive data; you've traded crypto enforcement for RBAC enforcement. - Use tokenization (see §5 above) with a tokenizer that preserves the substring of interest -- rare, highly app-specific.
JOIN on encrypted columns¶
Works if BOTH sides use deterministic encryption (blind index) with the same HMAC key. Under random AES-GCM the ciphertexts don't line up and the join returns an empty set.
This is why the blind-index key is a named OpenBao Transit key, not a per-column random secret -- two tables joining on email_idx share CLOUDTASER_DBPROXY_BLIND_INDEX_KEY=customers-email-hmac and thus produce matching HMACs for the same email.
Indexes on encrypted columns¶
Work normally at the index level. A B-tree of ciphertext bytes is still a valid B-tree -- the engine can do point lookups by the full ciphertext value. What breaks is range and prefix queries, which the index cannot satisfy when the ciphertext ordering is unrelated to plaintext ordering.
So: CREATE INDEX ON customers(email_idx) is fast (exact-match lookups hit the index). CREATE INDEX ON customers(email) on the random-AES-GCM column is useless because you'll never look up by that ciphertext (you'd need to know the exact ciphertext, which changes on every insert for the same plaintext).
Full-text search¶
Breaks. Encrypted columns cannot feed a Postgres tsvector / MySQL FULLTEXT index in any useful way.
Our honest answer: full-text search on encrypted columns is an open research problem. The credible production answer is "don't encrypt the FTS column; add an OPA/access-policy layer in front of the query path instead, and accept that FTS results are visible to the cloud provider if they can subpoena the index." For most customer-facing search (search transactions by description, search tickets by content), the access-control story carries the sovereignty claim, not the encryption.
If the FTS column carries regulated PII and that isn't acceptable, the architectural answer is to host the FTS engine (Elasticsearch, Meilisearch, typesense) in the sovereign substrate itself -- your EU OpenBao host or a nearby sovereign VM -- and keep the managed-DB copy without the FTS columns.
Supported databases¶
As of the current proxy release:
| Database | Status | Notes |
|---|---|---|
| PostgreSQL (self-hosted, Cloud SQL, RDS, Aurora PG, Azure Database for PostgreSQL, AlloyDB) | Supported | Wire protocol v3, prepared statements, blind indexes, jackc/pgx/v5/pgproto3 for parsing |
| MySQL (self-hosted, Cloud SQL MySQL, RDS MySQL, Aurora MySQL, Azure Database for MySQL) | Supported | MySQL protocol, prepared statements; set CLOUDTASER_DBPROXY_TYPE=mysql |
| MariaDB | Not explicitly tested | Wire-compatible with MySQL protocol; should work but not in CI. File an issue if you need it certified. |
| MongoDB (self-hosted, Atlas, DocumentDB) | Not supported | MongoDB wire protocol is BSON-based, not schema-ful at the protocol level; field-level encryption requires per-collection configuration and a different approach. Roadmap item; check cloudtaser/cloudtaser-db-proxy issues. |
| DynamoDB | Not supported | AWS-proprietary API; client-side field-level encryption requires application SDK integration (AWS offers its own FLE SDK) rather than a wire-protocol proxy. |
| Cosmos DB | Not supported | Same reasoning -- proprietary API + SDK model. |
| Redis / ElastiCache / MemoryStore / Azure Cache | Not supported | Key-value protocol; the application must encrypt values before SET and decrypt after GET. A proxy is technically possible but not on our roadmap -- the redis client libraries already support this pattern. |
| Elasticsearch / OpenSearch / CloudSearch | Not supported | Search indexes are plaintext by design; see the FTS discussion above. |
| Snowflake / Redshift / BigQuery / Databricks | Not supported | Analytics warehouses. Client-side encryption before load is the customer-app responsibility. |
| Kafka topics / MSK / Event Hubs / Pub/Sub / Kinesis | Not supported | Message payloads are opaque to the broker; encrypt in the producer, decrypt in the consumer. The Kafka ecosystem has conventions (schema registry envelope encryption) outside the DB-proxy scope. |
| ClickHouse | Not supported | Columnar engine; no column-level encryption proxy story today. |
| Trino / Presto / Athena | Not supported | Query engines over underlying storage; encryption is a property of the storage layer, not the engine. |
The short version: PostgreSQL and MySQL today; MongoDB is the most-asked next target. Everything else is either a different data model (key-value, document, columnar analytics) that needs its own proxy design, or a SaaS API where client-side encryption is an application concern rather than a proxy concern.
Performance¶
Per-value overhead¶
AES-GCM is AES-NI-accelerated on every x86_64 server CPU shipped since 2010 and on ARMv8-A (ARM Cryptography Extensions) on every GCP Tau T2A / AWS Graviton / Azure Ampere Altra. Per-value encryption is in the 100-500 ns range for the sizes typical of PII columns (email, name, SSN -- tens to hundreds of bytes).
The cost that actually shows up on a profile is the OpenBao Transit roundtrip for DEK wrap/unwrap. Each INSERT or UPDATE of an encrypted column triggers one /v1/transit/encrypt/<key> call; each SELECT of N encrypted values triggers N /v1/transit/decrypt/<key> calls (batched where the proxy can detect batchability).
Realistic numbers on a warm OpenBao:
transit/encryptlatency: 1-3 ms p50, 5-10 ms p99 on a local VPC OpenBaotransit/decryptlatency: same order of magnitude- Batched decrypt of N values: amortises the HTTP roundtrip; ~10% overhead per additional value
The proxy connection-pools to OpenBao, keeps HTTP/2 streams warm, and caches the bao.Client token lifecycle. We do not cache DEKs -- every value has a fresh DEK wrapped under the KEK; caching would re-use DEKs across values and invalidate the security story.
Proxy hop overhead¶
The proxy itself adds ~1-3 ms per query on localhost sidecar (no network hop -- the app and proxy share the pod network namespace). On a separate-pod deployment (which we do not recommend) the overhead is dominated by the pod-to-pod hop.
End-to-end¶
For a typical API request that does one SELECT ... WHERE id = $1 returning 5 encrypted columns:
- Direct-to-DB baseline: ~2 ms (managed-DB p50)
- Via cloudtaser DB proxy: ~2 ms + 1-3 ms (proxy) + 2-8 ms (OpenBao batched-decrypt) = ~6-13 ms total
For bulk operations (list endpoints returning 100 rows × 5 encrypted columns), batching keeps OpenBao roundtrips to 1-2 total, so the additional overhead is approximately linear in value count at ~50-200 µs per value on top of the batch-wrap latency.
See Performance & Benchmarks for the full roadmap on benchmark coverage, cached-key-derivation optimisations, and the open work on reducing the transit/decrypt batch floor.
The one exception you will hit first -- login-by-email¶
The most common legitimate PII-column lookup in every regulated app is:
The clean answer is a blind index on email (mode 2 above). The companion email_idx BYTEA column holds HMAC-SHA256(key, email); the proxy rewrites the WHERE email = $1 into WHERE email_idx = HMAC(key, $1); the DB looks up by HMAC; the proxy decrypts the id and password_hash on the way back.
No application change, no schema gymnastics beyond adding the one extra column, and the cloud provider still only sees ciphertext + HMAC -- both unreadable without the EU-hosted Transit key.
If you can't roll blind indexes yet
Keep the email column unencrypted for the login path and encrypt the rest of the PII (SSN, full name, DOB, address, phone, card number). The sovereignty claim is weaker for email specifically but still covers the bulk of customer PII. This is the staged-rollout pattern most teams follow.
Summary¶
| If your application... | Impact | Action needed |
|---|---|---|
| Looks up records by primary or foreign key | None | None |
| Filters by date, status, type, country | None | Leave those columns plaintext (they shouldn't be encrypted anyway) |
| Retrieves PII for display | None | Proxy decrypts transparently |
| Writes PII on registration/update | None | Proxy encrypts transparently |
| Searches by email for login | Works with blind index | Add an email_idx column and configure BLIND_INDEX_COLUMNS |
| Searches by SSN/tax_id/IBAN | Works with blind index (accept frequency leak if cardinality is high enough) | Same as email |
Runs ORDER BY encrypted_col |
Breaks | Redesign (non-encrypted sort column) or sort application-side |
Runs LIKE '%...%' on encrypted_col |
Breaks | Don't encrypt; use RBAC/RLS; or use tokenization pattern |
| Runs full-text search on PII | Breaks | Don't encrypt; move FTS engine into sovereign substrate |
| Joins two tables on an encrypted column | Works with blind index + shared key | Add matching blind indexes on both sides |
| Uses a non-PG / non-MySQL database | Not currently supported | File an issue for your DB; in the meantime do field-level encryption in the application |
The proxy does not break applications that encrypt high-entropy per-record PII retrieved by ID, add blind indexes for the handful of columns that need equality lookup, and leave operational/low-cardinality columns plaintext with RBAC enforcement. It does break applications that try to encrypt everything indiscriminately -- that's a mis-use, not a product bug.
Related¶
- DB Proxy CLI & Configuration -- environment variables, annotations, sidecar deployment
- Managed-Service Coverage -- which services cloudtaser covers vs which require customer-side encryption
- Sovereign Deployment Decision Guide -- where OpenBao must live for the claim to hold
- Shared Responsibility -- what still sits with you after the proxy does its job
- Performance & Benchmarks -- proxy benchmarks and roadmap
- Blind index design -- source and issue tracker