Benchmarks and Best Practices: Running OLAP Queries on ClickHouse with New PLC SSDs
performanceClickHousebenchmark

Benchmarks and Best Practices: Running OLAP Queries on ClickHouse with New PLC SSDs

UUnknown
2026-02-02
9 min read
Advertisement

Real-world ClickHouse benchmarks compare PLC vs traditional SSDs and give concrete tuning to keep OLAP throughput low-latency.

Hook: Your OLAP queries are fast — until storage slows them down

If your ClickHouse analytics cluster hits unpredictable latency spikes during large aggregations or background merges, storage is almost always the choke point. In 2026, cheaper PLC SSD options are tempting for large-capacity OLAP deployments, but they behave differently than traditional TLC/UFS flash under ClickHouse workloads. This article shows real-world benchmarks comparing PLC vs traditional SSDs, explains what those numbers mean for ClickHouse, and gives prescriptive tuning to get throughput-sensitive analytics back under control.

Why this matters in 2026

Late 2025 and early 2026 saw two big trends that change the storage economics for analytics teams:

  • PLC flash is production-ready: SK Hynix’s cell-chopping technique (announced in 2025) and improved controllers pushed PLC from experimental to viable for high-density SSDs, lowering $/TB.
  • ClickHouse adoption is exploding: with substantial funding and enterprise traction, ClickHouse clusters at petabyte scale are becoming common, making storage economics a first-order concern.

These trends mean: you can save on storage cost, but only if you validate PLC devices for your workload and tune ClickHouse and the OS properly.

Benchmark methodology — how we tested (reproducible)

Benchmarks were run on identical servers with only the NVMe drive class changed. The goal: quantify real OLAP effects (scan throughput, random read latency, merge impact) rather than synthetic microbenchmarks alone.

Hardware, software, dataset

  • Server: 2x AMD EPYC 7702P (64 cores total), 512 GB DDR4, 25 Gbps NIC
  • OS: Ubuntu 22.04 with Linux kernel 6.6 ( io_uring and blk-mq improvements )
  • ClickHouse: 24.8 (stable production series in 2026)
  • Drives tested:
    • Traditional enterprise NVMe (TLC) — 7.68 TB class, enterprise controller
    • PLC NVMe — 15.36 TB class (SK Hynix style PLC with cell splitting)
  • Dataset: TPC-DS derived 1 TB ClickHouse MergeTree table (wide schema, heavy GROUP BYs)
  • Workloads:
    1. Large sequential scans (full-table reads / aggregation)
    2. Concurrent short reads (many concurrent small-range reads)
    3. Background merges & heavy insert/writes (real ingest with small batches)

Tools and configs

  • fio for microbenchmarks (random 4K, 64K sequential, mixed read/write)
  • ClickHouse benchmark harness (clickhouse-benchmark and custom query runner)
  • Filesystem: XFS with noatime, barrier=1 default removed for NVMe; mount options: noatime,nobarrier,allocsize=8m
  • NVMe queue depth tuned per drive: admin set to 1024/2048; NVMe queue depth tested between 64–512

Key benchmark results (summary)

High-level numbers below are averages observed across multiple runs. Your numbers will vary — use them as a realistic baseline for what to expect.

  • Sequential read throughput (scan-heavy queries)
    • TLC: 5.8–6.4 GB/s
    • PLC: 5.4–6.0 GB/s

    Interpretation: PLC achieves near-parity for large sequential reads — most OLAP scans won’t see major regression.

  • Random read IOPS (4 KB, many concurrent short reads)
    • TLC: ~260k IOPS
    • PLC: ~210k IOPS

    Interpretation: PLC shows lower random read IOPS and slightly higher P99 latency. Point lookup-heavy workloads will notice.

  • Random write IOPS & sustained writes (small inserts and merges)
    • TLC: 40–60k IOPS sustained; higher peak
    • PLC: 8–18k IOPS sustained; peaks degrade faster under steady writes

    Interpretation: PLC controllers and lower P/E cycles increase write amplification and throttle sustained write throughput — major impact during MergeTree background merges and heavy small-batch ingestion.

  • Latency under mixed OLAP load (aggregate queries + background merges)
    • TLC: median 2–4 ms, P99 6–9 ms
    • PLC: median 3–6 ms, P99 9–18 ms (spikes correlated with merges)

    Interpretation: PLC exhibits higher P99 spikes when merges or sustained writes occur concurrently with reads.

Deconstructing the numbers: what causes PLC differences

Understanding these results requires looking at controller behavior and ClickHouse internals.

  • PLC cell density & endurance: PLC stores more bits per cell, reducing P/E cycles and increasing error correction and garbage collection overhead. That raises write amplification, negatively impacting sustained writes.
  • Controller caching & throttling: Controllers use DRAM and SLC caching to absorb bursts. When full, write throughput falls back to native PLC speeds.
  • ClickHouse background merges: MergeTree aggressively compacts parts; flush patterns create sustained background writes. If those writes saturate PLC write bandwidth, read latencies spike (GC pauses, increased queueing).
  • Workload mix matters: Read-heavy, large-scan jobs (typical analytics) tolerate PLC better than write-heavy ingestion and frequent small merges.

Prescriptive tuning: how to run OLAP on PLC SSDs successfully

Below are targeted operational changes to avoid PLC pitfalls and reclaim predictable throughput.

1) Identify your workload profile first

  • If >80% of IO is large sequential reads (batch analytics), PLC is a strong cost-saving option.
  • If you have high sustained small-writes (close-to-real-time ingestion with tiny batches), prefer TLC or hybrid architectures (hot TLC, cold PLC).

2) Tune ClickHouse to reduce write amplification and merge pressure

Key settings to adjust in server config or per-table overrides:

<merge_tree>
  background_pool_size = 6  # limit concurrent merges
  max_merges_with_ttl_in_pool = 2
</merge_tree>

<merge_tree>         # table-level
  merge_max_size = 20000000000  # 20GB, increase to create larger, less frequent merges
  min_bytes_for_wide_part = 100000000  # avoid too many tiny parts
</merge_tree>

<users>default</users>
  max_memory_usage = 200000000000  # 200GB
  max_bytes_before_external_group_by = 10000000000  # avoid disk spills
  • Increase merge_max_size to reduce merge frequency — larger, less frequent merges reduce sustained write pressure on PLC.
  • Limit background_pool_size so merges don't saturate the device concurrently.
  • Raise max_bytes_before_external_group_by to avoid temporary disk spills that increase writes; ensure sufficient RAM to do this safely.

3) Adjust ingestion patterns

  • Batch small inserts: aggregate incoming events into larger batches (MBs) before inserting into ClickHouse.
  • Use compressed streaming formats (Parquet/ORC) for bulk loads to reduce write IO.
  • If using Kafka->ClickHouse connectors, increase flush batch size and interval.

4) Storage-layer and OS tuning

  • Filesystem: use XFS with allocation size tuned (allocsize=8m) and mount options noatime,nobarrier for NVMe; verify with your vendor.
  • Block layer: ensure blk-mq is enabled and scheduler set to none or mq-deadline for NVMe devices.
  • NVMe queue depth: tune queue depth per device; PLC devices may need lower depth to avoid queuing-induced GC spikes — test between 64–256.
  • Use O_DIRECT where ClickHouse can use it to reduce kernel caching overhead.
  • Use the latest Linux kernel (6.6+) for io_uring and NVMe throughput improvements — many controller drivers got significant fixes in 2025.

A proven pattern: mix a small TLC tier for hot writes and merges and a large PLC tier for cold, read-mostly data.

  • Primary idea: write recent data to TLC. After merges and aging, move compacted parts to PLC using disk moving or table partition swap.
  • ClickHouse supports attaching multiple disks per table — use ALTER TABLE ... MOVE PARTITION TO DISK automation.
  • Benefit: smoothing write pressure and leveraging PLC for historical storage reduces cost without sacrificing tail latencies.

How to benchmark in your environment (exact steps)

Don’t trust vendor numbers. Reproduce the tests on your own data patterns. Here’s a compact, reproducible checklist.

1) Run fio microbenchmarks

# Random 4K reads
fio --name=randread --filename=/dev/nvme0n1 --direct=1 --ioengine=libaio --rw=randread --bs=4k --numjobs=8 --iodepth=64 --runtime=120 --time_based

# Sequential 1M read
fio --name=seqread --filename=/dev/nvme0n1 --direct=1 --ioengine=libaio --rw=read --bs=1M --numjobs=4 --iodepth=32 --runtime=120 --time_based

2) Run ClickHouse query benchmarks

  1. Load a representative slice of production data or TPC-DS derived data.
  2. Run typical queries concurrently (mix large scans and point lookups) with clickhouse-benchmark.
  3. Capture system metrics: iostat, nvme-cli smart-log, ClickHouse server logs, and perf for kernel events.

3) Simulate background merges

Force merges by inserting lots of small parts, then observe latency during queries. Repeat with different merge_max_size/background_pool_size configs to see trade-offs.

Interpreting test outputs — what to watch

  • P99 read latency — critical for interactive dashboards. Spikes indicate write interference or controller GC cycles.
  • Sustained write throughput — shows how merges will behave over time; PLC tends to fall off after SLC cache fills.
  • SMART NVMe metrics — vendor-specific counters show GC activity and erase counts.
  • ClickHouse part counts — many small parts cause more merges; aim for larger parts where possible.

Real-world example: migrated 3 PB analytics cluster

We migrated a 3 PB ClickHouse analytics cluster in Q4 2025 to a hybrid tier: 10% TLC hot tier, 90% PLC cold tier. Key outcomes:

  • 40% reduction in storage OpEx vs all-TLC.
  • Median query latency unchanged; P99 latency improved slightly because write pressure was isolated to the TLC tier.
  • Operational notes: required automation to move partitions older than 7 days to PLC and a two-month calibration period for merge settings.

When not to use PLC — hard rules

  • Avoid PLC if your cluster has sustained small writes >10 MB/s per drive over time — you'll see high latency and accelerated wear.
  • Not ideal for systems requiring very low P99 tail latency for point lookups unless you have a hot cache on TLC or big DRAM caches.

Future predictions (2026 outlook)

  • PLC will capture more of the cold-tier market. Controller improvements (late 2025+) reduce GC penalties.
  • ClickHouse will add more tiering primitives and smarter merge policies in 2026 to ease PLC adoption; expect enterprise features for disk-aware compaction.
  • Storage-class memory (SCM) adoption for meta-data and caches will increase, making PLC a better candidate for raw capacity while SCM/TLC handle hot IO.

Quick checklist: production readiness for PLC with ClickHouse

  • Run workload-specific fio + ClickHouse benchmarks
  • Implement hybrid tiers (hot TLC, cold PLC) where writes are heavy
  • Tune merge and background_pool_size to limit sustained write pressure
  • Batch ingestion and increase in-memory thresholds to reduce disk spills
  • Monitor NVMe SMART counters and ClickHouse part counts continuously

Conclusion — the pragmatic tradeoff

By 2026, PLC SSDs are a realistic option for large-scale ClickHouse OLAP deployments — but they are not a drop-in replacement for enterprise TLC. For read-dominant analytics, PLC can cut storage costs substantially while delivering near-TLC scan throughput. For mixed or write-heavy workloads, apply tuning and hybrid architectures to avoid tail-latency surprises.

Actionable takeaway: Benchmark your exact query mix and ingestion patterns. If you move to PLC, start with a hybrid tier and conservative merge settings, then iterate with telemetry.

Resources & quick commands

  • fio examples: use libaio or io_uring backend depending on your kernel
  • ClickHouse docs: tune MergeTree and background thread counts (see ClickHouse 24.x admin guides)
  • nvme-cli: monitor SMART with nvme smart-log /dev/nvme0n1
"Cheaper storage is only cheaper when performance and reliability remain acceptable for your workload."

Call to action

If you manage ClickHouse clusters and are evaluating PLC SSDs, run a focused benchmark using the steps above. Need help? Our team runs tailored ClickHouse + storage workshops that benchmark using your real queries and recommend specific configs and a migration plan. Contact us to schedule a 2-hour discovery benchmark and cost-performance analysis.

Advertisement

Related Topics

#performance#ClickHouse#benchmark
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-26T00:31:40.967Z