Benchmarking PLC vs TLC/QLC SSDs for ML Workloads

Design and run PLC vs TLC/QLC SSD benchmarks for dataset loads, checkpoints, and shuffle-heavy ML workloads—practical scripts, interpretation, and infra guidance.

Hook: Why your ML experiments stall at storage — and how to fix it

Slow dataset loads, checkpoint stalls, and unpredictable shuffle performance are some of the most common productivity killers for teams running GPU-backed ML workflows in 2026. As model sizes and dataset volumes balloon, infrastructure teams face a trade-off: buy higher-density, lower-cost NAND (like SK Hynix's new PLC approach) or stick with more mature TLC/QLC devices that offer better endurance and predictable performance. This article shows you how to design and run practical benchmarks that compare SK Hynix PLC flash against TLC/QLC SSDs for the three most critical ML workload patterns: dataset loading, checkpointing, and shuffle-heavy training. You’ll come away with actionable criteria for infrastructure decisions and concrete benchmarking artifacts you can run in your lab.

The 2026 context: density vs endurance amid a memory squeeze

Late 2025 and early 2026 saw two trends converge. First, NAND manufacturers accelerated higher-density designs (SK Hynix's PLC innovations that effectively "chop cells" to improve PLC viability were widely reported), unlocking lower $/GB options for data centers. Second, AI-driven demand kept memory and SSD pricing elevated following supply pressures highlighted at CES 2026. The upshot: teams can now consider PLC-class devices for affordable capacity, but must weigh this against performance under sustained writes and endurance—factors that directly affect ML training throughput and operational risk.

What to benchmark and why

Different ML I/O patterns stress storage differently. A single benchmark won’t capture trade-offs. Focus on three targeted tests that map to day-to-day ML operations:

Dataset loading (read-heavy) — large sequential reads and many small random reads from many files (image/TFRecord/Parquet loads).
Checkpointing (write-heavy, often sequential but occasionally random) — periodic large writes and fsyncs, durability and tail latency matter.
Shuffle-heavy training (mixed random read/write) — heavy random I/O, small block sizes, frequent fsync or barrier operations depending on framework.

Design principles for fair SSD comparisons

To make results actionable for procurement and architecture, follow these rules:

Compare like-for-like: same capacity class and form factor (e.g., U.2 or U.3 NVMe), same controller generation where possible. If controllers differ, note that results reflect device+controller performance.
Test both burst and sustained scenarios: many high-density NAND SSDs (QLC/PLC) rely on SLC caches that mask poor sustained write performance—measure both.
Measure across queue depths: ML loaders often use low QD (1-8), but cloud I/O stacks and parallel dataloaders can create higher QDs; sample QD 1, 4, 16, 32.
Include fill-level and warmed-up states: run tests at fresh-out-of-box and after the drive is ~70-90% full and after sustained writes to observe GC-induced degradations.
Collect latency percentiles: not just average IOPS/throughput but P50, P95, P99, and tail latencies (important for checkpoint tail stalls).

Testbed and tooling (practical setup)

Use a consistent, reproducible environment. Example testbed:

Host: Dual-socket x86 server or a standard cloud GPU node (e.g., 8–64 vCPU, 512GB RAM) depending on target deployment.
GPUs: optional for mixed tests; most storage tests can be done CPU-only.
SSDs: one SK Hynix PLC device (same capacity class) and comparable TLC and QLC SSDs from other vendors.
Software: Linux 6.x (or a recent stable kernel), fio, iostat/sysstat, blktrace or bpftrace for deep analysis, nvme-cli for SMART logs. Use Python and PyTorch to create realistic dataset-shuffle workloads.

Essential commands

Install tools:

sudo apt update && sudo apt install -y fio sysstat nvme-cli blktrace python3-pip
pip3 install torch torchvision

Benchmark recipes

1) Dataset loading (sequential and many-small-files)

Goal: measure throughput and latency for large sequential reads (e.g., TFRecords/parquet) and many small file reads (image datasets).

fio config for large sequential reads

[seq-read-1m]
ioengine=libaio
rw=read
bs=1m
direct=1
size=100G
runtime=300
io_depth=16
filename=/dev/nvme0n1
group_reporting=1
stonewall

[seq-read-4k-randmeta]
ioengine=libaio
rw=randread
bs=4k
direct=1
size=20G
runtime=300
io_depth=4
filename=/mnt/dataset_many_small_files/*
group_reporting=1

Also run a PyTorch data-loader style exercise to simulate parallel workers and random access:

from torch.utils.data import Dataset, DataLoader
import os

class FileDataset(Dataset):
    def __init__(self, root):
        self.files = [os.path.join(root, f) for f in os.listdir(root)]
    def __len__(self):
        return len(self.files)
    def __getitem__(self, idx):
        with open(self.files[idx], 'rb') as f:
            return f.read()

loader = DataLoader(FileDataset('/mnt/dataset_many_small_files'), batch_size=32, num_workers=8)
for i, batch in enumerate(loader):
    if i>100: break

2) Checkpointing

Checkpointing is write-dominated and sensitive to fsync and tail latency: a single stalled checkpoint can delay iterations, wasting GPU time.

fio config for sustained sequential writes

[seq-write-1m]
ioengine=libaio
rw=write
bs=1m
direct=1
size=200G
runtime=600
io_depth=8
filename=/dev/nvme0n1
group_reporting=1

Python loop to simulate periodic model.save with fsync

import time, os
buf = b'0' * (64*1024*1024)  # 64MB chunk
for ckpt in range(50):
    with open('/mnt/checkpoints/model.ckpt', 'ab') as f:
        f.write(buf)
        f.flush()
        os.fsync(f.fileno())
    time.sleep(10)  # typical checkpoint interval

3) Shuffle-heavy training (random reads and writes)

Shuffle stress combines small random writes and reads; frameworks like DataLoader with persistent workers, Lightning’s shuffling, or DALI create this load pattern.

[randrw-4k]
ioengine=libaio
rw=randrw
rwmixread=60
bs=4k
direct=1
size=50G
runtime=600
io_depth=32
filename=/dev/nvme0n1
group_reporting=1

What metrics to collect

Record these per test:

Throughput (MB/s) and IOPS at different block sizes and QDs.
Latency percentiles (P50, P95, P99, P99.9) in ms.
Tail stalls (observed long fsyncs or write stalls >100ms).
Performance over time — plots showing SLC-cache depletion and sustained behavior.
NAND health / SMART (use nvme-cli to read firmware-reported metrics and TBW counters).
Power usage (if available) — energy per GB transferred can be material in large clusters.

Interpreting likely outcomes — what to expect

Below are outcome patterns we’ve seen when comparing high-density NAND (PLC/QLC) to TLC in 2025–2026 lab runs. Use these as a decision rubric.

Dataset loading (read-heavy)

PLC/QLC devices generally deliver competitive sequential read throughput, making them attractive for storing large archived datasets or read-mostly shards. If your workload is mostly large sequential reads (e.g., preprocessed TFRecords), PLC can cut $/GB significantly.
Small-file random reads depend on controller and DRAM cache. TLC often outperforms PLC in random-read latency and tail percentiles, affecting small-batch loading. If many small reads with low QD matter, TLC remains safer.

Checkpointing (write-heavy)

PLC devices typically expose a large SLC cache mechanism for high burst throughput, but sustained write performance collapses once the cache is exhausted. That can turn checkpointing from a 100ms operation into seconds or worse, stalling training.
TLC SSDs usually provide more stable sustained write performance and higher endurance (TBW), reducing operational risk for frequent checkpoints.

Shuffle-heavy workloads (random mixed I/O)

Random write performance and latency consistency are critical. PLC/QLC devices can show widely variable tail latency under mixed loads. If your shuffle pattern is high-frequency and write-heavy, favor TLC or use an architecture that avoids sustained writes to the same device.

In short: PLC buys capacity and lower $/GB for read-dominant dataset storage. For write-dominant or latency-sensitive operations—checkpoints and shuffle—TLC (or tiered approaches) is often the safer choice.

Actionable infrastructure strategies

Use benchmark results to design a cost-optimized, robust storage architecture for ML workloads:

Tier data by access pattern
- Store cold or read-mostly datasets on PLC (cheap capacity). Validate sequential read behavior through your dataset-loading benchmark.
- Place active training volumes (checkpoints, shuffle scratch) on TLC or enterprise NVMe to avoid tail latencies and endurance surprises.
Use ephemeral SLC-like burst buffers — either node-local NVMe with guaranteed SLC-style performance or in-memory (ramdisk) for very write-heavy shuffle operations; persist asynchronously to cheaper PLC storage.
Adjust checkpointing frequency and sharding — fewer, incremental checkpoints or chunked checkpointing reduces write amplification. Use atomic multi-file schemes to avoid large fsyncs.
Overprovision and tune firmware — increase overprovisioning (reserve capacity) on PLC/QLC drives to improve sustained performance and longevity; talk to vendors about firmware modes geared to data center ML workloads.
Monitor SMART/TBW closely — integrate nvme-cli based checks into your node health pipeline to catch endurance cliffs and schedule replacements before failures affect experiments.

How to convert benchmarks into buying decisions

Procurement should evaluate SSDs across three axes: performance consistency, endurance, and cost. Here’s a simple decision matrix derived from benchmark outputs:

If dataset loading is >80% of I/O and sequential throughput is dominant: prioritize lowest $/GB (PLC ok), but set strict SLAs around sustained read tests and perform fill-level tests.
If checkpointing or shuffle latency directly blocks GPUs: prioritize predictable P99 latency and TBW (TLC/enterprise class). Cost-per-GPU-hour saved by avoiding stalls often outweighs raw $/GB savings.
If mixed and variable workloads: consider mixed deployments with per-node caching and automated tiering—use PLC for long-term storage, TLC for hot volumes.

Sample analysis: reading benchmark outputs (what to look for)

When you run fio and your PyTorch simulation, look for these red flags:

Sudden drops in throughput mid-test — likely SLC cache exhaustion on PLC/QLC devices.
Large divergence between average and P99 latencies — indicates tail latency risk that affects checkpoints.
SMART-reported media errors or rapid TBW accumulation — indicates poor endurance for your write pattern.

Operational playbook — example checklist

Run cold W/O warm-up and then sustained write tests for 1–2 hours to reveal SLC exhaustion behavior.
Load realistic datasets with your actual dataloader configuration and measure epoch wall-clock.
- Measure end-to-end epoch time with and without local caching to estimate practical impact.
Simulate checkpoint intervals and measure GPU stall time. If checkpoint P99 > checkpoint interval, you must adjust architecture.
Estimate TBW extrapolated from observed write rate: project replacement cadence and factor into TCO.

Future-looking guidance (2026 and beyond)

Higher-density NAND trends (PLC and beyond) will continue to change the cost calculus. Expect the following through 2026–2027:

More variants and firmware classes: vendors will ship PLC with different firmware profiles (read-optimized, mixed-use, endurance-boost), offering better choices for ML workloads.
Tiering and software-defined storage will become standard: automated hot/cold movement reduces the need to compromise between capacity and latency manually.
Hardware-accelerated compression and smarter controllers will change real-world throughput and TBW behavior; always test the final vendor SKU.

Quick reference: recommended configuration snippets

Mount options and sysctl tweaks to reduce write amplification and improve predictability:

# Mount read-heavy dataset with noatime
sudo mount -o noatime,nodiratime /dev/nvme0n1p1 /mnt/dataset

# For NVMe log monitoring (add to cron/Prometheus exporter)
nvme smart-log /dev/nvme0n1

Closing recommendations

Use PLC where it makes sense: cost-effective storage for large, read-dominant datasets. But for checkpointing and shuffle-heavy operations that directly interrupt GPU throughput, favor TLC or an architecture that provides low-latency burst buffers. Benchmarks are the only way to know how a given SSD model performs in your particular stack—controller firmware differences matter a lot.

Actionable next steps

Run the three recipes above on representative hardware and collect throughput, IOPS and latency percentiles.
Compare TBW projections to expected write rates to compute replacement cadence and TCO.
Deploy hybrid tiers: PLC for dataset capacity + TLC for hot volumes. Validate by running two-week soak tests under real training schedules.

Want help executing these benchmarks at scale? Our team at smart-labs.cloud helps engineering teams design lab experiments, run reproducible benchmarks across node fleets, and translate results into procurement and architecture decisions. Contact us for a pilot that quantifies the real cost of SSD choices on your GPU utilization and experiment velocity.

Call to action

Run a controlled PLC vs TLC/QLC benchmark in your environment this week. If you want a repeatable suite and analysis dashboard pre-built for your cluster, request a smart-labs.cloud pilot — we’ll help you choose the right mix of capacity and performance so your teams spend more time iterating models and less time waiting on storage.