Embedding Stores at Scale: Using ClickHouse for Vector Search and Cost-Efficient Retrieval
Practical guide (2026) to using ClickHouse as an embedding store: indexing options, ANN strategies, memory tuning, and cloud cost tradeoffs.
Embed Stores at Scale: Why ClickHouse Now Matters for Vector Search
Hook: You need reproducible, high-throughput retrieval for thousands-to-millions of embeddings without paying top-dollar for purpose-built vector DBs or GPUs for every query. You also need predictable costs, tight access controls, and an easy way to integrate vector search into existing OLAP pipelines. In 2026, ClickHouse is a practical, production-ready option for many embedding-store patterns — if you architect and tune it correctly.
The context in 2026
By late 2025 and into 2026, two trends reshaped choices for embedding stores: (1) large-scale OLAP systems (ClickHouse among them) added richer array/vector support and were adapted to host embeddings; and (2) the economics of GPUs shifted — GPUs are still essential for training and index construction, but many inference/ANN serving workloads moved to optimized CPU or disk-based approaches to reduce costs. ClickHouse’s rapid growth and funding in 2025 accelerated community and vendor investments that make it realistic to use as a durable embedding store and part of a hybrid ANN architecture.
When ClickHouse is a good fit (and when it isn't)
- Good fit: High-throughput retrieval where embeddings are joined with large tabular metadata, analytics, or filters (e.g., filtering by tenant, time, or feature flags).
- Consider hybrid: Use ClickHouse for storage + pre-filtering + analytics, and a dedicated ANN engine for ultra-low-latency millisecond nearest-neighbor search at massive scale.
- Not ideal: Extremely low-latency (sub-ms) global nearest-neighbor at tens of millions of vectors where working set cannot be stored in memory and latency is the only metric — purpose-built vector DBs or in-memory ANN clusters may still be better.
Core design patterns for ClickHouse as an embedding store
1) Single-phase: brute-force vector search inside ClickHouse
Store vectors as Array(Float32) columns and compute distance/similarity in SQL. This pattern is simple and reproducible and works well up to a few million vectors if queries accept 10s-100s ms latency. It also requires no external indexing infrastructure.
Example table and query (simplified):
CREATE TABLE embeddings (
id UInt64,
tenant_id UInt32,
embedding Array(Float32),
created_at DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, id);
-- Cosine similarity query (pass query_vec as an Array(Float32))
SELECT id,
dot / (sqrt(sum_sq) * sqrt(query_sq)) AS cosine
FROM (
SELECT id,
arraySum(arrayMap((x,y) -> x*y, embedding, query_vec)) AS dot,
arraySum(arrayMap(x -> x*x, embedding)) AS sum_sq
FROM embeddings
WHERE tenant_id = 42
) ORDER BY cosine DESC
LIMIT 100;
Pros: simple, ACID-ish storage, consistent backups, easy analytics. Cons: full-scan cost unless you reduce candidate set with filters or smaller tables.
2) Two-stage hybrid: pre-filter in ClickHouse, exact/ANN on a dedicated search node
Use ClickHouse to store metadata and to do coarse filtering (tenant, timestamp, category), produce a candidate set (e.g., 10k–100k ids), then fetch vectors to an in-memory ANN (Faiss/HNSWlib) or run vector computation in a dedicated service. This is the most common cost-performance tradeoff in 2026.
Workflow:
- ClickHouse: run fast indexed filters to narrow candidates to N (configurable).
- Bulk-request vectors with a single batched query (SELECT id, embedding FROM embeddings WHERE id IN (...)).
- Perform ANN search in memory on those N vectors and return ranked results.
Why it works: ClickHouse excels at filtering and throughput; specialized ANN libs excel at nearest-neighbor math. Combining them reduces memory requirements for ANN nodes and lowers cloud costs while maintaining accuracy. Use established tooling and runbooks — for example, include Faiss builds into your CI/CD and IaC pipelines (see sample IaC templates and verification patterns).
3) Disk-backed ANN + ClickHouse metadata
Store a quantized, disk-optimized ANN index (IVF+PQ, HNSW with compressed vectors) on NVMe or cloud SSDs, and keep metadata and joinable attributes in ClickHouse. This approach minimizes RAM costs and is attractive when embeddings count grows into tens or hundreds of millions.
Tradeoffs: Higher lookup latency vs RAM-resident HNSW, but much lower cost. Use GPU only for index construction and heavy rebuilds; serve with CPUs and NVMe.
Indexing and ANN options: algorithms and tradeoffs
ANN has three dominant patterns to understand in 2026: HNSW (graph-based), IVF+PQ (inverted-file + product quantization), and LSH or random-projection filters. Each has different memory and throughput profiles.
- HNSW: Excellent recall at low latency when the index fits in RAM. Memory-intensive; graph edges add overhead ~ a few bytes per edge per vector. Best for low-latency, high-recall use cases.
- IVF+PQ: Lower RAM footprint because PQ compresses vectors aggressively. Slightly higher latency; build time and parameter tuning (nlist, nprobe) matter. Best when you want to trade a small recall drop for large cost savings.
- LSH/Random projection: Useful as a first-stage filter; cheap to compute and can be stored as compact bitmaps or bloom filters inside ClickHouse to prune candidates before exact scoring.
Implementation patterns using ClickHouse
- Store full-precision vectors in ClickHouse for provenance and batch analytics, but keep a compressed ANN index externally for serving.
- Use ClickHouse secondary indices (bloom filters, minmax for primary key ranges) to reduce candidate sets. ClickHouse’s MergeTree primary key creates a minmax index per mark that helps prune segments.
- Materialize periodic snapshots of vectors into an ANN-friendly binary format (Faiss mmap files or HNSW binary). Rebuild incrementally on GPUs or distributed CPU clusters.
Memory tuning and ClickHouse internals that matter
Three ClickHouse configuration knobs have outsized impact on vector workloads:
- max_memory_usage and max_memory_usage_for_user: Controls the maximum RAM per query and per user. Increase cautiously for vector scans, but combine with query-level timeouts to avoid runaway memory usage.
- mark_cache_size: Caches MergeTree mark files (index marks) and reduces random IO. A larger mark cache reduces disk seeks for filtered scans and speeds scanning segmented data. (See edge and serverless tuning notes for related cache considerations.)
- uncompressed_cache_size: Caches decompressed column blocks. Useful when scanning large vector columns compressed on disk — cache hot blocks to avoid decompressing repeatedly.
Additional MergeTree parameters:
- index_granularity (rows per mark): Lower values produce more marks → finer-grained pruning but larger mark files; larger values reduce index size but increase scanned rows per segment. For embedding tables, start with 8192 rows per mark, then test with 2048 and 16384 to measure tradeoffs. If you manage infrastructure with automated templates, incorporate index tuning into your IaC (see sample IaC templates).
- partitioning: Use time or tenant partitions to quickly drop old data and reduce scan scope.
- primary key: Choose a composite key (tenant_id, id) to make tenant-level pruning efficient and use minmax indexing.
Storage formats and compression
Store arrays as Array(Float32) to keep computation native. To save RAM/disk, consider quantization:
- Store 8-bit quantized vectors (Array(Int8) or Array(UInt8)) plus a per-vector scale factor. Decode and dequantize in the candidate set when you need higher precision.
- Use ClickHouse compression codecs (ZSTD, LZ4) tuned for CPU vs compression ratio. ZSTD with level 3–5 balances CPU and size.
Throughput, latency and benchmarking methodology
Measure three key metrics consistently:
- P95/P99 latency for single-query retrievals (end-to-end: ClickHouse + ANN stage).
- QPS at target latency (how many queries per second you can sustain within SLO).
- Cost per 1M queries (in cloud spend) including instance costs, storage, and index rebuild amortization.
Benchmark approach (recommended):
- Prepare a realistic dataset (same vector dimension, distribution, and metadata cardinality as production).
- Run single-threaded throughput tests to measure latency floor, then scale concurrency until latency SLO breaks.
- Compare strategies: pure ClickHouse scan, ClickHouse pre-filter + in-memory ANN, ClickHouse + disk-based IVF+PQ.
- Measure CPU, memory, disk IO, and network per query. Use system tables in ClickHouse to correlate resource metrics with query phases. For tools and benchmark kits that help run these tests, see relevant tool roundups and kits listed in industry reviews.
Cost tradeoffs and decision matrix
Cost decisions are driven by three levers in 2026:
- RAM size — big driver for HNSW; if your ANN must be RAM-resident, you pay for RAM-heavy instances.
- SSD vs NVMe — faster and more IOPS-friendly NVMe reduces latency for disk-backed PQ/IVF.
- GPU use — needed for fast index builds; build on-demand with transient GPU instances (spot/preemptible) to cut costs.
Decision matrix (simplified):
- Small dataset (< 5M vectors) + high recall: prefer HNSW in RAM; you can often serve without dedicated GPUs. ClickHouse handles metadata and pre-filtering.
- Medium dataset (5–50M): hybrid — ClickHouse for filtering + Faiss/IVF+PQ for serving; build indexes on GPUs, serve on CPUs with NVMe.
- Large dataset (>50–100M): disk-backed PQ or distributed ANN clusters. Keep ClickHouse for joins and analytics; serve via scalable ANN clusters with compressed vectors.
Practical implementation checklist
- Design schema: include tenant_id, partition key, timestamp, and Array(Float32) embedding. Pick a MergeTree ORDER BY tuned for your common filters.
- Configure ClickHouse memory and cache settings: set max_memory_usage per query and bump mark/uncompressed caches according to working set size.
- Decide index strategy: brute-force in CH (for small sets), or hybrid (ClickHouse filters → ANN service) for larger sets.
- Quantize if needed: create a pipeline to produce compressed indexes and optionally store a quantized column for cheaper scans.
- Plan rebuilds: use scheduled batch jobs and GPU spot instances for index construction. Snapshot ClickHouse tables and generate ANN files in a controlled deploy pipeline.
- Implement rate limiting, query-level memory caps, and timeouts to protect cluster stability.
- Instrument: collect p95/p99, bytes read, cpu secs, disk IO per query, and cost per 1M queries; iterate on parameters. See vendor and community benchmark kits and reviews for reproducible methodology (for example, industry roundups and toolkits referenced in reviews).
Example: End-to-end hybrid deployment (pattern)
Architecture steps:
- Ingestion: embeddings written to ClickHouse (Array(Float32)) + metadata. Use bulk inserts and partitioning to keep small partitions for hot data.
- Snapshot & index build: nightly snapshot exports to Parquet + Faiss build on GPU instances. Export incremental updates frequently (e.g., hourly) and maintain small incremental HNSW overlay for new items.
- Serving: API layer receives query -> ClickHouse pre-filters by tenant/filters -> returns candidate IDs and optionally vectors -> ANN layer ranks candidates -> API returns results with metadata joined from ClickHouse.
Operationally, this gives you:
- Predictable ClickHouse costs for durable storage and analytics.
- Flexible ANN serving costs: use GPUs for build windows and cheaper CPU/NVMe for serving.
- Reproducible experiments by keeping canonical vectors in ClickHouse and derived indexes in S3/registry.
Tuning examples and knobs to try (hands-on)
Tune index_granularity
Start at index_granularity = 8192. If scans read many marks but mark files are small, reduce to 2048 and measure reduced scanned rows vs increased mark cache pressure. If you manage infrastructure with templates, parameterize index_granularity in your IaC.
Configure caches
# server config (example values)
max_memory_usage = 80GiB
uncompressed_cache_size = 24GiB
mark_cache_size = 16GiB
Batching and vector transfer
When moving candidate vectors out of ClickHouse, fetch them in batches (e.g., 1k–10k ids per batch) to amortize network and serialization overhead. Use binary formats (Arrow/Parquet) for large transfers.
Security, governance and reproducibility
Because ClickHouse is an OLAP DB with established access controls and audit trails, it helps address compliance needs:
- Use roles and row-level filters for tenant isolation.
- Version or tag snapshot exports so experiments can reference the exact vector set used.
- Store provenance metadata with each vector: model_version, pipeline_hash, build_time.
2026 Trends & future predictions
Expect continued convergence of analytics DBs and vector workloads:
- More OLAP vendors will add first-class vector primitives and disk-optimized ANN indices by 2026–2027.
- Quantization and compression techniques will keep improving, shifting more production workloads from RAM to NVMe with little recall loss.
- Spot/GPU-as-a-service models will be adopted for on-demand index building, further lowering amortized index cost.
- ClickHouse’s ecosystem will continue to mature (as reflected by big 2025 funding rounds), meaning richer integrations and managed offerings through 2026.
"Hybrid architectures (OLAP store + dedicated ANN engines) are the dominant pragmatic pattern in 2026 — they balance cost, accuracy and observability."
Common pitfalls and how to avoid them
- Full-table scans on big tables: Always design partitioning and keys to minimize full scans. Add bloom filters or lightweight LSH tokens to the schema for initial pruning.
- Uncontrolled memory allocation: Set query memory caps and use timeouts. Use a staging cluster to test worst-case queries.
- Index rebuild cost surprises: Automate rebuilds and use spot GPU instances. Measure rebuild time and amortize cost into pricing.
- Mixing precision without tracking: When you quantize vectors, track the model_version and quantization parameters to reproduce results.
Actionable takeaways
- Start small, benchmark early: Implement the two-stage hybrid pattern first — ClickHouse for filters, single-node ANN for ranking — then scale to disk-backed or distributed ANN as needed.
- Tune index_granularity and caches: Small changes to index granularity, mark_cache_size and uncompressed_cache_size can flip latency and cost tradeoffs.
- Optimize index lifecycle: Build indexes on demand using transient GPUs, keep snapshots in object storage, and serve compressed indexes from NVMe to lower RAM needs.
- Measure cost per 1M queries: Use this as the core KPI when choosing between HNSW in RAM and IVF+PQ on disk.
Getting started: a minimal checklist
- Schema: implement embedding column as Array(Float32), add tenant and timestamp partitions.
- Config: set sensible max_memory_usage and allocate mark/uncompressed caches.
- Prototype: run a hybrid pipeline (ClickHouse filter -> batch vectors -> Faiss/HNSWlib) and measure baseline latency/QPS.
- Iterate: try quantization, tune index_granularity, and automate rebuild jobs.
Conclusion & Call to action
ClickHouse is a powerful component in cost-conscious, reproducible embedding stores in 2026 — especially when deployed as part of a hybrid strategy that leverages ClickHouse’s strengths in filtering, analytics, and governance while using specialized ANN tooling for the heavy lifting of nearest-neighbor math.
If you’re evaluating options for production vector search, run a short hybrid proof-of-concept: store canonical vectors in ClickHouse, build an ANN index offline with Faiss/HNSW, and measure p95/p99 latency and cost per 1M queries. If you want a jumpstart, try our reference implementation and benchmark kit at smart-labs.cloud/repo — it includes ClickHouse configs, sample ingestion pipelines, and Faiss/HNSW build scripts tuned for cloud spot GPUs.
Ready to prototype? Deploy a 2-stage POC today and compare cost/latency vs a managed vector DB. Contact our engineering team for an architecture review or request the benchmark kit — we’ll help you choose the right index strategy and cost model for your workload.
Related Reading
- Beyond Serverless: Designing Resilient Cloud‑Native Architectures for 2026
- IaC templates for automated software verification: Terraform/CloudFormation patterns for embedded test farms
- Field Review: Affordable Edge Bundles for Indie Devs (2026)
- Running Large Language Models on Compliant Infrastructure: SLA, Auditing & Cost Considerations
- What Convenience Stores (Like Asda Express) Keep on the Forecourt—And What Your Mobile Detail Stand Should Stock
- How to Carry a Hot-Water Bottle in Your Backpack Safely (and Why You Might Want To)
- Budget Travel in 2026: Combine Points, Miles and Market Timing to Stretch Your Trip
- Make Skiing Affordable: Combining Mega Passes with Budget Stays and Deals
- Yoga Teacher PR: How to Build Authority Across Social, Search and AI Answers
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
ChatGPT Translate in the Lab: Building a Multimodal Translation Microservice
Design Patterns for Agentic AI: From Qwen to Production
Building an NVLink-Enabled Inference Cluster with RISC-V Hosts
Integrating Timing Analysis into Model Compression Workflows for Embedded Devices
Operationalizing Micro-Apps at Scale: Multi-Tenant CI, Secrets Management, and Cost Controls
From Our Network
Trending stories across our publication group