databasesperformanceanalytics

ClickHouse for ML Analytics: Architecture Patterns, Indexing, and Embedding Storage

ssmart labs

2026-01-27

11 min read

Design ClickHouse schemas and ingestion pipelines to handle large-scale LLM embeddings, feature stores, and fast OLAP analytics with hybrid GPU strategies.

Speed up ML experiments and OLAP analytics with ClickHouse — without blowing your cloud GPU budget

If your team is struggling with slow, brittle experiment setups, exploding cloud costs for vector search, and nondeterministic feature pipelines, you aren’t alone. In 2026, production AI stacks demand both fast OLAP analytics and scalable embedding handling — and ClickHouse can be the anchor that unifies feature stores, metadata analytics, and high-throughput ingestion.

In this guide you’ll get concrete architecture patterns, schema examples, ingestion recipes, indexing tactics, and performance tuning advice to run large-scale LLM embeddings and feature stores on ClickHouse — while minimizing GPU/cloud spend and keeping queries snappy for analysts and models. For teams focused on cost-aware querying and alerts, see our operational note on cost-aware querying.

Key takeaways (quick)

Use ClickHouse for high-throughput metadata filtering, OLAP, and feature aggregation, and combine it with a GPU-backed vector index for kNN when you need sub-ms ANN at massive scale.
Design schemas with coarsely-quantized routing columns (cluster_id / shard_id) to limit candidate sets and avoid full-table vector scans; coarse quantization also reduces per-query cost when you follow cost-aware routing.
Choose storage formats intelligently: Array(Float32) for simplicity; Float16 or PQ for cost/size reduction; store heavy blob copies in S3 and only keep compressed vectors for hot datasets.
Tune MergeTree, compression, and partitioning for write-heavy ingestion and fast range queries; use TTL policies to offload cold data to S3.

Why ClickHouse in 2026 for ML analytics?

ClickHouse continues to gain enterprise traction as an OLAP powerhouse — buoyed by increased funding and a rush to integrate analytics and ML workflows. As reported in early 2026, ClickHouse raised a significant round that reflects its fast adoption among analytics teams and cloud-native stacks.

"ClickHouse ... raised $400M led by Dragoneer at a $15B valuation." — Bloomberg, January 2026

That momentum translated into product innovations: better vector integrations, tighter cloud-native storage policies, and richer indexing primitives. For teams building LLM-powered features and analytics, ClickHouse offers:

Extreme write and query throughput for billions of rows
Columnar compression and disk-tiering to reduce storage costs
Flexible table engines (MergeTree variants), data skipping indices and projections
Streaming ingestion via Kafka engine and low-latency materialized views

Architecture patterns: pick the right model

Below are three practical patterns you’ll see in production. Pick one or combine them depending on scale, cost, and latency needs.

Pattern A — ClickHouse-first (small-to-medium scale)

Store embeddings directly in ClickHouse (Array(Float32) or Float16) and rely on ClickHouse’s internal vector functions / HNSW (if available) for ANN. Best for up to low-hundreds of millions of vectors when you want single-system simplicity.

Pros: Simple architecture, single query engine for filtering + similarity.
Cons: Larger storage footprint and ANN may be less GPU-optimized compared to dedicated vector DBs.

Pattern B — Hybrid (recommended for most teams)

Use ClickHouse as the metadata/feature store and filter engine; use a GPU-backed vector index (FAISS on GPU, Milvus with GPUs, or a managed vector DB) to perform ANN. ClickHouse stores compressed embeddings or cluster_ids to prune candidates, and full embeddings live in the vector system.

Pros: Best cost/latency balance. Run bulk analytics and joins in ClickHouse, run kNN on GPUs only when needed.
Cons: Requires coordination (IDs, sync processes) between systems.

Pattern C — Sharded ClickHouse + Local ANN per shard (extreme scale)

Shard embeddings by tenant/time and run lightweight ANN indices co-located with each ClickHouse shard. Use cluster-level cataloging in ClickHouse to route queries to candidate shards; this approach benefits from edge-aware distribution and operational patterns similar to portfolio ops & edge distribution.

Pros: Scales horizontally and reduces cross-node traffic for ANN searches.
Cons: More complex operationally (shard balancing, placement).

Schema design: practical DDL and patterns

Schema choices determine cost and query speed. Below are production-ready table definitions for embedding storage and a feature store design with upsert semantics.

Embedding table (hybrid-ready)

Store lightweight routing columns (cluster_id), metadata filters, and either a compressed vector or a pointer to an external vector index.

CREATE TABLE embeddings (
  tenant_id UInt32,
  doc_id UInt64,
  created_at DateTime,
  cluster_id UInt32,
  embedding Array(Float32), -- or FixedString for quantized bytes
  emb_quant_key String,     -- e.g. pointer to external vector DB or bucket id
  metadata JSON,
  is_active UInt8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, cluster_id, doc_id)
SETTINGS index_granularity = 8192;

Notes: partition by time for retention, ORDER BY tenant + cluster_id improves locality for filtered queries, and cluster_id is computed at ingest (coarse quantization).

Feature store table (Replacing/upsert-friendly)

CREATE TABLE feature_store (
  entity_id UInt64,
  feature_name String,
  feature_value Float64,
  version UInt64,
  event_time DateTime,
  updated_at DateTime
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (entity_id, feature_name, updated_at)
SETTINGS index_granularity = 4096;

Pattern: use ReplacingMergeTree with a version or updated_at timestamp for idempotent upserts. For high ingest rates, ingest into a Buffer table or Kafka engine and materialize into this table.

Ingestion pipelines — reliable, low-latency flows

ClickHouse has first-class streaming and batch ingestion paths. For production ML use-cases you’ll want:

Fast, ordered ingest for features & embeddings (Kafka or bulk S3).
Materialized Views to transform events into feature tables.
Batch jobs to compute cluster_ids or quantize embeddings before insert.

Kafka -> ClickHouse (streaming) example

CREATE TABLE kafka_embeddings_engine (
  tenant_id UInt32,
  doc_id UInt64,
  created_at DateTime,
  raw_embedding String, -- received as base64 or binary
  metadata JSON
) ENGINE = Kafka('kafka:9092', 'embeddings-topic', 'group-1', 'JSONEachRow');

CREATE MATERIALIZED VIEW mv_ingest_to_embeddings
TO embeddings
AS
SELECT
  tenant_id,
  doc_id,
  parseDateTimeBestEffort(created_at) AS created_at,
  compute_cluster_id(decodeRawEmbedding(raw_embedding)) AS cluster_id,
  decodeRawEmbeddingToArrayFloat32(raw_embedding) AS embedding,
  '', -- emb_quant_key if using external index
  metadata,
  1 as is_active
FROM kafka_embeddings_engine;

Use an external pre-processing step (Kafka Connect or a Python microservice) to compute cluster_id and lightweight quantization to avoid doing heavy CPU work inside ClickHouse.

Embedding storage strategies (tradeoffs)

How you store embeddings affects cost and query patterns. Here are 3 practical approaches:

1) Array(Float32) (simple, accurate)

Pros: Direct, simple, precise.
Cons: Large: a 1536-d Float32 vector ~ 6KB row overhead; multiplies with billions of vectors.

2) Float16 / binary FixedString (reduced size)

Quantize vectors to Float16 or pack into binary and store as FixedString(3072) or String. This roughly halves storage and I/O.

-- example for binary/float16 column
ALTER TABLE embeddings ADD COLUMN embedding_bin FixedString(3072);

3) Product Quantization / Int8 (max compression)

Compute PQ codes offline and store as FixedString. Keep the PQ centroids outside ClickHouse (S3 or vector DB) for reconstruction when needed. Use cluster_id to avoid full reconstruction for analytics.

Recommendation: Use Float16 or PQ for historical/cold embeddings and keep a hot small set (e.g., most recent or most-relevant embeddings) in Array(Float32) on NVMe for low-latency hybrid scans.

Indexing and vector search strategies

ClickHouse is optimized for filtering via data skipping indices. For vector similarity, combine multiple layers:

Coarse quantization column: (cluster_id) computed at ingest using k-means or HNSW prologue. This reduces candidate sets by 10x–1000x.
Metadata indices: use BloomFilter/Set indices for boolean and small-dimension filters to prune quickly.
External ANN for heavy lifting: use GPU-backed FAISS/Milvus for top-k retrieval, then join back into ClickHouse for analytics and joins on metadata. For GPU strategy and data-center considerations, see guidance on designing data centers for GPU pods.

Example: filter by tenant + cluster_id before computing cosine similarity in SQL:

SELECT doc_id, dotProduct(embedding, query_emb) / (norm(embedding) * norm(query_emb)) AS cos_sim
FROM embeddings
WHERE tenant_id = 42 AND cluster_id IN (101,102,103)
ORDER BY cos_sim DESC
LIMIT 100;

Note: computing dot products across many rows in SQL is CPU-intensive. Use the filtered candidate set approach to keep work small. For sub-100ms top-k at large scale, offload ANN to a GPU-backed index and use ClickHouse only for enrichment and aggregation.

Performance tuning checklist (ClickHouse + cloud)

Partitioning: by time (monthly) to make TTL-based cleanup efficient.
ORDER BY: include tenant_id and cluster_id to localize reads.
index_granularity: lower values for selective queries; default 8192 is a good starting point.
Compression: ZSTD(level) or LZ4 for write-heavy workloads. ZSTD(10) balances density and CPU.
Disks and tiers: NVMe SSDs for hot nodes; S3-backed long-term cold storage with TTL move. Use ClickHouse storage policy to transparently tier data.
MergeTree tuning: adjust max_parts_in_total, background_pool_size, and merges to reduce write amplification.
Memory and threads: tune max_memory_usage and max_threads per-query; favor higher parallelism for analytics queries but cap per-user for multi-tenant clusters.
Batching: ingest in compact batches (1–10k rows) to avoid small write overhead.

GPU & cloud resource optimization

With 2026's cheaper GPU availability and better virtualization, teams still must optimize to avoid waste:

Only run GPU ANN for queries needing sub-100ms top-k; otherwise use CPU with aggressive pruning.
Use spot/preemptible GPU instances for asynchronous re-indexing and offline PQ/cluster computation.
Co-locate vector indices with GPUs, and use ClickHouse on CPU instances optimized for NVMe I/O.
Leverage serverless inference (for embeddings) to scale compute elastically without persistent GPU costs.

Feature store patterns in ClickHouse

ClickHouse excels at time-series aggregations, so it’s a natural fit for feature stores with real-time and batch paths.

Pattern:

Stream events -> Kafka -> Materialized Views producing raw features.
Compute rolling aggregations in materialized views or projections (e.g., 7-day mean).
Upsert summarized features into ReplacingMergeTree table for online lookup.

CREATE MATERIALIZED VIEW mv_user_7d_features TO user_features AS
SELECT
  user_id,
  avg(click_value) AS avg_click_7d,
  max(click_value) AS max_click_7d,
  now() as updated_at
FROM events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY user_id;

Use this for model serving pipelines that query ClickHouse online (low-latency gateway) or export snapshot tables to feature provisioners (e.g., Feast) for model training. If you’re building edge-aware pipelines or local retraining loops, consider edge-first model serving patterns.

Operational considerations & reliability

Backups & disaster recovery: snapshot key parts to S3 and verify restore procedures; test full restore regularly. For release and DR playbooks, see zero-downtime release pipelines guidance.
Schema migrations: use rolling migrations and default values; avoid wide table rewrites by adding new columns with defaults and migrating gradually.
Monitoring: track parts count, merge queue, disk utilization, and query latencies. Alert on slow merges or high read amplification.
Security: use RBAC, Network policies, and column-level encryption for PII; integrate with your IAM for auditing.
Reproducibility: version cluster_id/quantizer models, store quantizer centroids in S3 with manifest versions, and snapshot them alongside data for reproducible nearest-neighbor results. For data provenance patterns, see responsible web data bridges.

Example end-to-end flow (hybrid)

Here’s a practical flow for a production LLM recommender with 200M embeddings:

Offline: compute k-means with k=10k centroids, store centroids on S3 and version them.
Ingest pipeline: embed text -> quantize to cluster_id -> send to Kafka with base64 embedding and metadata.
Materialized view writes into ClickHouse, storing cluster_id, metadata, and a Float16 compressed embedding for hot candidates.
When doing online recommendation: query ClickHouse for top candidates by cluster_id + metadata filter -> fetch candidate ids -> call GPU FAISS service for top-k exact ANN -> enrich results with ClickHouse joins for analytics and A/B metrics.

This minimizes GPU calls and keeps ClickHouse the single source of truth for metadata, analytics, and feature joining.

Concrete tuning examples

Two quick knobs that often yield big wins:

Lower index_granularity from 8192 to 4096 for tenant-level queries that scan small ranges — this reduces false-positive IO during filter scanning.
Set compression codecs per column: large binary embeddings -> ZSTD(10), small categorical -> LZ4 to balance CPU and IO.

Checklist: production readiness

Compute and store cluster_id at ingest to enable cheap pruning.
Use materialized views for deterministic feature enriching from event streams.
Tier cold embeddings to S3 and keep only a hot working set on NVMe.
Offload large ANN computations to GPU-backed vector DB; use ClickHouse as orchestrator for filtering and enrichment.
Version quantizers and PQ centroids in S3 for reproducible embeddings pipelines.

Future trends (late 2025 → 2026): what to expect

Broader adoption of hybrid architectures where OLAP and vector search co-exist; vendors are shipping tighter connectors between ClickHouse and vector stores.
More CPU vector acceleration (AVX-512 / SVE) and vectorized SQL functions for dot products in ClickHouse to speed small-scale in-DB similarity checks.
Improved tiering primitives and storage policies that make it trivial to keep hot embeddings on NVMe and archival copies on object stores.
Stronger multi-tenant features and RBAC for enterprise feature stores built on ClickHouse.

Closing: actionable next steps

If you manage an ML platform or run experiments with LLM embeddings, start by building a small hybrid proof-of-concept:

Ingest 10M embeddings into ClickHouse with cluster_id quantization and a hot Float32 slice for the top 1M entries.
Run a GPU-backed FAISS/managed vector DB for ANN on the 10M set and measure end-to-end latency and cost.
Experiment with Float16 and PQ on a historic dataset to estimate storage cost savings.
Implement TTL policies to push older data to S3 and validate restore and query patterns on cold data.

These steps will show you how ClickHouse can anchor analytics, feature engineering, and scalable embeddings without unsustainable GPU spend.

Ready to prototype?

Contact our team at smart-labs.cloud for a hands-on lab: we’ll help you design an ingest pipeline, test quantization strategies, and deploy a hybrid ClickHouse + GPU ANN proof-of-concept in your cloud account.

Actionable takeaway: start with coarse quantization and metadata filtering in ClickHouse, offload heavy ANN to GPUs, and use tiered storage + materialized views to keep costs low while enabling fast analytics.

smart labs

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Optimizing Cloud Gaming Experiences with Samsung's New Hub

field-review•10 min read

Field Review: Building a Low‑Cost Device Diagnostics Dashboard — Lessons from 2026 Pilots

assistant•9 min read

Siri is a Gemini: What the Apple-Google Deal Means for Assistant Integration and SDKs

From Our Network

Trending stories across our publication group

How Autonomous Trucking APIs Could Transform Last-Mile Logistics — A Developer's View

aicode.cloud

logistics•10 min read

How Autonomous Trucking APIs Could Transform Last-Mile Logistics — A Developer's View

Benchmark: Creator Time Saved Using Desktop Autonomous Agents vs Traditional Tools

aiprompts.cloud

benchmark•10 min read

Benchmark: Creator Time Saved Using Desktop Autonomous Agents vs Traditional Tools

From Salescopy to Evidence: How Publishers Should Vet AI-Generated Health Product Claims

alltechblaze.com

editorial•9 min read

From Salescopy to Evidence: How Publishers Should Vet AI-Generated Health Product Claims

2026-02-04T00:42:42.025Z