Skip to content

Database Proxy: Searchability, Encryption Modes, and What Breaks

cloudtaser's database proxy encrypts PII columns transparently on the wire. Unlike an S3 object -- which is an opaque blob and trivially encryptable -- a database column is queried: WHERE, ORDER BY, LIKE, JOIN, GROUP BY, range scans. Client-side column encryption breaks those operators unless you deliberately pick a mode that leaks just enough structure to keep the operator working, and every such mode carries a known weakness you must understand.

This page explains the trade-off honestly, per column, per query type, so you can decide before you configure what the downstream cost will be.


Why client-side database encryption is hard (not like S3)

S3 objects are opaque blobs. You encrypt the object, store the ciphertext, read the ciphertext, decrypt in the client. Done. The storage layer never needs to reason about the object's content -- it never sorts it, filters it, or joins it against another object.

A database column is different. The database engine is expected to:

  • Filter -- WHERE email = '[email protected]' needs to find matching rows
  • Range-scan -- WHERE transaction_amount > 1000 needs ordered comparison
  • Sort -- ORDER BY created_at DESC
  • Join -- JOIN orders ON users.email = orders.customer_email needs equality across tables
  • Aggregate -- GROUP BY country, SUM(amount), COUNT(DISTINCT email)
  • Pattern-match -- LIKE '%@example.com'
  • Full-text search -- MATCH ... AGAINST / tsvector

Every one of those operators needs either plaintext or a ciphertext that preserves the structural property the operator tests. "Random" ciphertext -- the cryptographically strongest option -- preserves nothing. "Deterministic" ciphertext preserves equality but leaks frequency. "Order-preserving" ciphertext preserves order but leaks so much structure that it's been shown breakable with an auxiliary dataset.

There is no free lunch. Every client-side DB encryption scheme trades cryptographic strength for some piece of structural leakage that the database engine needs to keep operating. The question is which leakage you can live with on which column.


Encryption-mode decision tree

You pick the mode per column, not per database. The proxy supports mixing modes in the same table.

1. Random AES-256-GCM -- the default

How it works: each value gets a fresh random 32-byte DEK and 12-byte nonce. The DEK is wrapped by the OpenBao Transit engine. The same plaintext produces a completely different ciphertext every time.

What it protects against: a cloud provider DBA, a CLOUD Act subpoena against the managed database, an exfiltrated snapshot, a compromised replica. None of them can recover plaintext without the EU-hosted KEK.

What it costs you:

  • No equality search (WHERE email = ? fails -- same email → different ciphertext each insert)
  • No range search (WHERE transaction_amount > 1000 fails -- no order relation in ciphertext)
  • No ordering (ORDER BY ssn returns ciphertext in arbitrary order)
  • No JOIN on encrypted columns
  • No GROUP BY on encrypted columns
  • No LIKE / pattern match
  • No full-text search
  • Only retrievable by a non-encrypted key -- primary key, foreign key, any other plaintext index

Use this mode for: sensitive columns you only ever retrieve by ID (SSN, BSN, passport number, full address, date of birth, free-text notes, medical diagnosis codes). This is the right default for the vast majority of regulated-workload PII because those fields should not be search keys anyway.

2. Deterministic AES -- opt-in per column

How it works: same plaintext → same ciphertext. Achieved by deriving the nonce from the plaintext (AES-SIV) or by using a stable nonce keyed from the value. The proxy ships this as blind indexes: a separate column stores HMAC-SHA256(key, plaintext) next to the encrypted column, and the proxy rewrites WHERE encrypted_col = $1 into WHERE encrypted_col_idx = HMAC-SHA256(key, $1) transparently.

What it unlocks:

  • Equality search (WHERE email_idx = HMAC(email))
  • Equality JOINs (both sides use the same HMAC key)
  • GROUP BY on the blind-index column
  • COUNT(DISTINCT ...) on the blind-index column

What it costs you:

  • Frequency analysis. If a column has a small cardinality (gender, marital_status, country), an attacker who can see ciphertext distribution learns the frequency of each value. Matching that against population statistics is trivial (the most common ciphertext in a Dutch citizens table is "Netherlands"). Deterministic encryption on low-entropy columns is catastrophic.
  • Correlation across tables. If two tables both deterministically encrypt email under the same key, the attacker can JOIN them without the key. That's the feature you asked for -- it's also structural leakage.
  • Still no ordering, no range, no LIKE, no FTS. Determinism only unlocks equality.

Use this mode for: high-entropy identifiers that are genuinely unique per row (email, phone_number, iban, tax_id). Do NOT use it for low-cardinality columns (gender, country, marital_status, status_code) -- the frequency leak is a PII disclosure in its own right.

Today the proxy exposes this as blind indexes via CLOUDTASER_DBPROXY_BLIND_INDEX_COLUMNS and CLOUDTASER_DBPROXY_BLIND_INDEX_KEY. The ciphertext column remains random AES-GCM; the companion HMAC column provides the deterministic index. This two-column approach is intentional: the random ciphertext is what the cloud provider stores at rest, while the HMAC is what the application joins on -- and the HMAC key never leaves OpenBao.

3. Order-preserving / order-revealing encryption -- NOT shipped, deliberately

How it works (in theory): a scheme like CryptDB's OPE or later ORE constructions produces ciphertexts that preserve either the plaintext order (OPE -- the ciphertext can be <-compared directly) or reveal the order upon query (ORE -- a comparison function takes two ciphertexts and returns </=/>).

What it would unlock: range queries (WHERE salary > 50000), ORDER BY salary, min/max aggregates.

Why cloudtaser does not ship it: every practical OPE/ORE construction has a known structural attack. Grubbs, Sekniqi, Kamara, and others (2017+) have shown that order-alone leakage, combined with a moderately-sized auxiliary plaintext dataset (a public census, a leaked breach corpus), is sufficient to recover a large fraction of the protected plaintexts. The leakage isn't just "the order" -- it's a strong signal that reconstructs the distribution. For regulated data (medical, financial, government), this is unacceptable.

Our answer for range queries: don't encrypt the column. If you genuinely need WHERE transaction_amount > 1000, transaction_amount is a low-sensitivity bucket of numeric data; put the access-control policy in the application layer (OPA, OAuth scopes, row-level RLS on the DB side) rather than relying on encryption to gatekeep it. Encrypt the account identifier, not the amount.

4. Format-preserving encryption (FPE) -- NOT currently shipped

What it is: encryption that preserves the format of the plaintext -- a 16-digit credit-card number encrypts to another 16-digit string, an IBAN encrypts to a string of the same length and checksum validity. Typically via FF1 or FF3-1 (NIST SP 800-38G).

Why it's useful: legacy systems with rigid schema constraints -- a CHAR(16) column for PAN, a fixed-length field for IBAN -- can store ciphertext in place without schema changes.

Why it's a trade-off: FPE is a small domain, and small-domain encryption has weaker security bounds than AES-GCM on arbitrary-length inputs. FF3-1 had to be repaired in 2017 after a practical attack on specific domain sizes. It remains secure within the parameters NIST specifies, but it's narrower than general-purpose AEAD.

Status in cloudtaser: not currently supported. The proxy stores ciphertext as bytea (PostgreSQL) or VARBINARY (MySQL), which assumes your schema can hold binary of arbitrary length. If you have a hard legacy-format constraint, file an issue on cloudtaser/cloudtaser-db-proxy and we'll scope it -- FF1 integration is straightforward; the question is whether your threat model tolerates small-domain PRP security bounds.

5. Tokenization via OpenBao lookup -- pattern, not first-class feature

How it works: instead of encrypting the plaintext in the DB column, replace it with an opaque token (e.g. tok_01HZ...). Store the token → plaintext mapping in OpenBao's KV engine. At read time, the application (or a lookup helper) resolves the token against OpenBao.

What it unlocks: arbitrary DB operations on the token itself -- indexes, joins, aggregation, even LIKE if the token format permits. The DB never holds the plaintext.

What it costs you:

  • You can't filter on the underlying value from the DB (WHERE plaintext_email = 'alice@...') -- the DB doesn't know the mapping.
  • Every read that needs plaintext roundtrips to OpenBao. For list endpoints with 100 rows, that's 100 KV lookups. Batch API helps; latency floor is bounded by OpenBao roundtrip.
  • Token format matters. Using a random UUID maximises unlinkability but blocks LIKE on token shape. Using a deterministic tokenizer (tokens are HMAC(plaintext)) reintroduces the deterministic-encryption frequency-analysis concern.

Status in cloudtaser: not currently shipped as a first-class proxy feature; described here because it's the pattern we recommend when the DB column needs to participate in queries the encryption modes above cannot support (e.g. LIKE '%foo%' on a customer-name column with a legitimate customer-service UI requirement). Implement it in the application layer against OpenBao KV; the proxy does not need to be involved.


Decision table -- common column types

For a typical enterprise customer record. "Mode" is the recommendation; "Why" is the threat-model justification.

Column Mode Why
id (PK) plaintext Primary key, never PII. Needed for every lookup.
email deterministic (blind index) High-entropy, unique per user, needed for login-by-email. Frequency leak tolerable because cardinality ~= row count.
password_hash do not encrypt -- use bcrypt/argon2 in the application A password hash is already a one-way function. Encrypting it adds latency without adding security and breaks the auth flow. This is the #1 anti-pattern we see.
full_name random AES-GCM PII, retrieved by id. Never a search key in well-designed apps.
date_of_birth random AES-GCM PII. If you need age-range queries, store a non-encrypted age_bucket column (0-17, 18-65, 65+) -- the application computes it on insert.
ssn / bsn / tax_id random AES-GCM PII. Retrieved by id. Deterministic (blind index) acceptable ONLY if you have a legitimate WHERE ssn = ? query path and you accept the frequency-analysis risk.
full_address random AES-GCM PII. Address-matching (delivery lookup) should use a separate postcode plaintext column, not the full address.
order_status (enum: pending/shipped/delivered/cancelled) plaintext Low-cardinality operational state. Encrypting it is catastrophic for determinism (4 ciphertexts for the whole table) and breaks every dashboard query. Enforce access control at the row/RBAC level instead.
transaction_amount plaintext + RBAC Range queries and sums are core to the business function. OPE is unsafe; tokenization destroys the aggregate story. Encrypt the account_id, not the amount.
free_text_notes random AES-GCM High-entropy unstructured text. Never a search key.
iban deterministic (blind index) High-entropy unique identifier, common lookup key. Same reasoning as email.
credit_card_pan random AES-GCM (or FPE if schema demands) PCI-DSS sensitive. Retrieved for display (often only the last-4 are shown -- store a separate pan_last4 plaintext column for that).
gender plaintext Low-cardinality (typically 2-4 values). Deterministic encryption would leak frequency trivially. If the column is PII-sensitive under your regulator's definition, drop the column entirely or encrypt at the row level via RBAC.
country_code plaintext Same reasoning as gender -- cardinality <250 globally, frequency-analysis resistant only in very uniform populations.

The pattern that keeps repeating: encrypt high-entropy per-record values retrieved by ID; do not encrypt low-entropy operational columns; do not encrypt hashes.


Worked example -- customers(id, email, full_name, dob, country, notes)

Starting schema:

CREATE TABLE customers (
    id           SERIAL PRIMARY KEY,
    email        TEXT NOT NULL,
    full_name    TEXT,
    dob          DATE,
    country      CHAR(2),
    notes        TEXT
);

Application queries in production:

-- Login
SELECT id, email FROM customers WHERE email = $1;

-- Customer lookup (CS UI)
SELECT email, full_name, dob, country, notes FROM customers WHERE id = $1;

-- Dashboards
SELECT country, COUNT(*) FROM customers GROUP BY country;

Encrypted schema cloudtaser-style:

CREATE TABLE customers (
    id           SERIAL PRIMARY KEY,        -- plaintext
    email        BYTEA NOT NULL,            -- random AES-GCM (proxy)
    email_idx    BYTEA NOT NULL UNIQUE,     -- HMAC-SHA256 blind index (proxy)
    full_name    BYTEA,                     -- random AES-GCM (proxy)
    dob          BYTEA,                     -- random AES-GCM (proxy)
    country      CHAR(2),                   -- plaintext (low-cardinality; dashboards)
    notes        BYTEA                      -- random AES-GCM (proxy)
);

Proxy config (sidecar env, set by the operator from annotations):

CLOUDTASER_DBPROXY_ENCRYPT_COLUMNS=customers.email,customers.full_name,customers.dob,customers.notes
CLOUDTASER_DBPROXY_BLIND_INDEX_COLUMNS=customers.email:customers.email_idx
CLOUDTASER_DBPROXY_BLIND_INDEX_KEY=customers-email-hmac

Equivalent pod annotations:

cloudtaser.io/db-proxy: "true"
cloudtaser.io/db-proxy-upstream: "customer-db.eu-west3.rds.amazonaws.com:5432"
cloudtaser.io/db-proxy-transit-key: "customers"
cloudtaser.io/db-proxy-encrypt-columns: "customers.email,customers.full_name,customers.dob,customers.notes"
cloudtaser.io/db-proxy-blind-index-columns: "customers.email:customers.email_idx"
cloudtaser.io/db-proxy-blind-index-key: "customers-email-hmac"

What the application sees: no change. Queries run as before; the proxy rewrites the WHERE email = $1 into WHERE email_idx = HMAC(key, $1) transparently and decrypts email, full_name, dob, notes on the way back.

What the cloud provider sees:

  • id -- plaintext integer
  • email -- 130+ bytes of per-row-unique ciphertext, unlinkable across rows
  • email_idx -- 32-byte HMAC; enables JOIN/lookup, leaks equality (intentional)
  • full_name, dob, notes -- ciphertext blobs
  • country -- plaintext "NL", "DE", "FR"...

The dashboard GROUP BY country still works against the plaintext column. The login flow still works via email_idx. The CS-UI lookup still works via id. The only new DB-provider-visible signal is email_idx equality, which matches the uniqueness of the email column anyway (every row has a distinct email, and therefore a distinct email_idx) -- no frequency-analysis leverage.


What breaks and how to mitigate

WHERE encrypted_col = 'literal'

Breaks under random AES-GCM. Works via blind index (deterministic HMAC) when you've configured it.

If you need this but haven't configured a blind index, the options are:

  1. Add a blind-index column for that specific column (see config above)
  2. Rewrite the query to look up by a non-encrypted key (id, foreign key)
  3. If the column is low-cardinality, don't encrypt it -- use RBAC/RLS instead

ORDER BY encrypted_col

Breaks for random AES-GCM and deterministic (HMAC output order is unrelated to plaintext order).

Options:

  1. Fetch rows by a non-encrypted filter, sort application-side after decryption. Works for <10k rows; does not scale to ORDER BY ... LIMIT 10 over millions.
  2. Don't encrypt the sort column. If you need ORDER BY created_at on a timestamp, created_at is usually not sensitive -- leave it plaintext.
  3. Store a derived non-sensitive sort key alongside the encrypted column (e.g. an age_bucket next to an encrypted dob).

LIKE '%foo%'

Breaks entirely. No deployable client-side encryption scheme supports arbitrary substring search on encrypted text -- searchable symmetric encryption (SSE) is a live research area with impractical overhead and known leakage attacks at scale.

Options:

  1. Don't encrypt the column -- put an RBAC / row-level-security / application-policy layer in front instead. For customer-name search in a CS UI, this is usually the right answer: the column is visible to authenticated CS staff only.
  2. Store a separate sanitised "searchable" column -- e.g. full_name_search_tokens containing a tokenised/normalised n-gram set, plaintext, with application-side access control. Accept that you now have two surfaces of sensitive data; you've traded crypto enforcement for RBAC enforcement.
  3. Use tokenization (see §5 above) with a tokenizer that preserves the substring of interest -- rare, highly app-specific.

JOIN on encrypted columns

Works if BOTH sides use deterministic encryption (blind index) with the same HMAC key. Under random AES-GCM the ciphertexts don't line up and the join returns an empty set.

This is why the blind-index key is a named OpenBao Transit key, not a per-column random secret -- two tables joining on email_idx share CLOUDTASER_DBPROXY_BLIND_INDEX_KEY=customers-email-hmac and thus produce matching HMACs for the same email.

Indexes on encrypted columns

Work normally at the index level. A B-tree of ciphertext bytes is still a valid B-tree -- the engine can do point lookups by the full ciphertext value. What breaks is range and prefix queries, which the index cannot satisfy when the ciphertext ordering is unrelated to plaintext ordering.

So: CREATE INDEX ON customers(email_idx) is fast (exact-match lookups hit the index). CREATE INDEX ON customers(email) on the random-AES-GCM column is useless because you'll never look up by that ciphertext (you'd need to know the exact ciphertext, which changes on every insert for the same plaintext).

Breaks. Encrypted columns cannot feed a Postgres tsvector / MySQL FULLTEXT index in any useful way.

Our honest answer: full-text search on encrypted columns is an open research problem. The credible production answer is "don't encrypt the FTS column; add an OPA/access-policy layer in front of the query path instead, and accept that FTS results are visible to the cloud provider if they can subpoena the index." For most customer-facing search (search transactions by description, search tickets by content), the access-control story carries the sovereignty claim, not the encryption.

If the FTS column carries regulated PII and that isn't acceptable, the architectural answer is to host the FTS engine (Elasticsearch, Meilisearch, typesense) in the sovereign substrate itself -- your EU OpenBao host or a nearby sovereign VM -- and keep the managed-DB copy without the FTS columns.


Supported databases

As of the current proxy release:

Database Status Notes
PostgreSQL (self-hosted, Cloud SQL, RDS, Aurora PG, Azure Database for PostgreSQL, AlloyDB) Supported Wire protocol v3, prepared statements, blind indexes, jackc/pgx/v5/pgproto3 for parsing
MySQL (self-hosted, Cloud SQL MySQL, RDS MySQL, Aurora MySQL, Azure Database for MySQL) Supported MySQL protocol, prepared statements; set CLOUDTASER_DBPROXY_TYPE=mysql
MariaDB Not explicitly tested Wire-compatible with MySQL protocol; should work but not in CI. File an issue if you need it certified.
MongoDB (self-hosted, Atlas, DocumentDB) Not supported MongoDB wire protocol is BSON-based, not schema-ful at the protocol level; field-level encryption requires per-collection configuration and a different approach. Roadmap item; check cloudtaser/cloudtaser-db-proxy issues.
DynamoDB Not supported AWS-proprietary API; client-side field-level encryption requires application SDK integration (AWS offers its own FLE SDK) rather than a wire-protocol proxy.
Cosmos DB Not supported Same reasoning -- proprietary API + SDK model.
Redis / ElastiCache / MemoryStore / Azure Cache Not supported Key-value protocol; the application must encrypt values before SET and decrypt after GET. A proxy is technically possible but not on our roadmap -- the redis client libraries already support this pattern.
Elasticsearch / OpenSearch / CloudSearch Not supported Search indexes are plaintext by design; see the FTS discussion above.
Snowflake / Redshift / BigQuery / Databricks Not supported Analytics warehouses. Client-side encryption before load is the customer-app responsibility.
Kafka topics / MSK / Event Hubs / Pub/Sub / Kinesis Not supported Message payloads are opaque to the broker; encrypt in the producer, decrypt in the consumer. The Kafka ecosystem has conventions (schema registry envelope encryption) outside the DB-proxy scope.
ClickHouse Not supported Columnar engine; no column-level encryption proxy story today.
Trino / Presto / Athena Not supported Query engines over underlying storage; encryption is a property of the storage layer, not the engine.

The short version: PostgreSQL and MySQL today; MongoDB is the most-asked next target. Everything else is either a different data model (key-value, document, columnar analytics) that needs its own proxy design, or a SaaS API where client-side encryption is an application concern rather than a proxy concern.


Performance

Per-value overhead

AES-GCM is AES-NI-accelerated on every x86_64 server CPU shipped since 2010 and on ARMv8-A (ARM Cryptography Extensions) on every GCP Tau T2A / AWS Graviton / Azure Ampere Altra. Per-value encryption is in the 100-500 ns range for the sizes typical of PII columns (email, name, SSN -- tens to hundreds of bytes).

The cost that actually shows up on a profile is the OpenBao Transit roundtrip for DEK wrap/unwrap. Each INSERT or UPDATE of an encrypted column triggers one /v1/transit/encrypt/<key> call; each SELECT of N encrypted values triggers N /v1/transit/decrypt/<key> calls (batched where the proxy can detect batchability).

Realistic numbers on a warm OpenBao:

  • transit/encrypt latency: 1-3 ms p50, 5-10 ms p99 on a local VPC OpenBao
  • transit/decrypt latency: same order of magnitude
  • Batched decrypt of N values: amortises the HTTP roundtrip; ~10% overhead per additional value

The proxy connection-pools to OpenBao, keeps HTTP/2 streams warm, and caches the bao.Client token lifecycle. We do not cache DEKs -- every value has a fresh DEK wrapped under the KEK; caching would re-use DEKs across values and invalidate the security story.

Proxy hop overhead

The proxy itself adds ~1-3 ms per query on localhost sidecar (no network hop -- the app and proxy share the pod network namespace). On a separate-pod deployment (which we do not recommend) the overhead is dominated by the pod-to-pod hop.

End-to-end

For a typical API request that does one SELECT ... WHERE id = $1 returning 5 encrypted columns:

  • Direct-to-DB baseline: ~2 ms (managed-DB p50)
  • Via cloudtaser DB proxy: ~2 ms + 1-3 ms (proxy) + 2-8 ms (OpenBao batched-decrypt) = ~6-13 ms total

For bulk operations (list endpoints returning 100 rows × 5 encrypted columns), batching keeps OpenBao roundtrips to 1-2 total, so the additional overhead is approximately linear in value count at ~50-200 µs per value on top of the batch-wrap latency.

See Performance & Benchmarks for the full roadmap on benchmark coverage, cached-key-derivation optimisations, and the open work on reducing the transit/decrypt batch floor.


The one exception you will hit first -- login-by-email

The most common legitimate PII-column lookup in every regulated app is:

SELECT id, password_hash FROM users WHERE email = $1;

The clean answer is a blind index on email (mode 2 above). The companion email_idx BYTEA column holds HMAC-SHA256(key, email); the proxy rewrites the WHERE email = $1 into WHERE email_idx = HMAC(key, $1); the DB looks up by HMAC; the proxy decrypts the id and password_hash on the way back.

No application change, no schema gymnastics beyond adding the one extra column, and the cloud provider still only sees ciphertext + HMAC -- both unreadable without the EU-hosted Transit key.

If you can't roll blind indexes yet

Keep the email column unencrypted for the login path and encrypt the rest of the PII (SSN, full name, DOB, address, phone, card number). The sovereignty claim is weaker for email specifically but still covers the bulk of customer PII. This is the staged-rollout pattern most teams follow.


Summary

If your application... Impact Action needed
Looks up records by primary or foreign key None None
Filters by date, status, type, country None Leave those columns plaintext (they shouldn't be encrypted anyway)
Retrieves PII for display None Proxy decrypts transparently
Writes PII on registration/update None Proxy encrypts transparently
Searches by email for login Works with blind index Add an email_idx column and configure BLIND_INDEX_COLUMNS
Searches by SSN/tax_id/IBAN Works with blind index (accept frequency leak if cardinality is high enough) Same as email
Runs ORDER BY encrypted_col Breaks Redesign (non-encrypted sort column) or sort application-side
Runs LIKE '%...%' on encrypted_col Breaks Don't encrypt; use RBAC/RLS; or use tokenization pattern
Runs full-text search on PII Breaks Don't encrypt; move FTS engine into sovereign substrate
Joins two tables on an encrypted column Works with blind index + shared key Add matching blind indexes on both sides
Uses a non-PG / non-MySQL database Not currently supported File an issue for your DB; in the meantime do field-level encryption in the application

The proxy does not break applications that encrypt high-entropy per-record PII retrieved by ID, add blind indexes for the handful of columns that need equality lookup, and leave operational/low-cardinality columns plaintext with RBAC enforcement. It does break applications that try to encrypt everything indiscriminately -- that's a mis-use, not a product bug.