Engineering

Scaling generation to millions of samples per hour

How our distributed scheduler turns 100 hours of generation into 6 — and what we learned profiling diffusion at scale.

All posts
EngineeringJan 12, 2026

When a customer asks for 10 million synthetic medical images, the generation itself isn't the hard problem — the scheduling is. A single diffusion model on a single GPU produces roughly 200 images per minute at clinical resolution (512×512, 16-bit). At that rate, 10 million images would take 35 days. Our customers need them in hours.

The naive approach — spinning up hundreds of GPUs and running independent generation jobs — works until it doesn't. Without coordination, you get duplicate seeds, unbalanced class distributions, inconsistent quality metrics, and a storage pipeline that chokes on bursty writes. Scaling generation is a distributed systems problem as much as a machine learning one.

We learned this the hard way. Our first large-scale generation run — 2 million synthetic chest X-rays for a healthcare partner — took 11 days instead of the projected 3. The culprit wasn't GPU throughput; it was everything around the GPUs. Storage writes were contending with checkpoint saves. Quality validation was running synchronously, blocking generation. The class distribution drifted because independent workers weren't coordinating on which classes still needed samples. We hit every scaling pathology in the distributed systems playbook.

Our scheduler, internally called Cascade, was born from that failure. It breaks generation requests into work units defined by three axes: the target distribution (what classes and proportions to generate), the generation parameters (model checkpoint, guidance scale, seed ranges, conditioning signals), and the validation criteria (quality thresholds, distributional alignment bounds, privacy budget constraints). Each work unit is small enough to complete on a single GPU in under 60 seconds.

The work unit abstraction is key. A 10-million-image dataset becomes roughly 200,000 work units. Each unit specifies exactly what to generate (class, conditioning, seeds), what quality bar to hit, and where to write the results. Units are independent and idempotent — if a GPU fails mid-unit, the unit is simply re-issued to another node with no side effects. This makes the system naturally fault-tolerant without complex distributed consensus.

Cascade distributes work units across a heterogeneous GPU pool — A100s (80GB), H100s (80GB), and occasionally spot L40S instances (48GB) — using a priority queue that accounts for per-unit-type throughput, current pool utilisation, memory requirements, and the customer's deadline. The scheduler knows that an H100 completes a medical imaging work unit in 38 seconds while an A100 takes 52 seconds, and allocates accordingly. Failed units are automatically retried on a different node with a different seed.

Memory management turned out to be surprisingly tricky. Diffusion models at clinical resolution consume 30-40GB of GPU memory during generation. On a shared cluster, this means we can't simply load the model and generate forever — other jobs need access to the same GPUs. Cascade implements a warm-pool system: frequently-used model checkpoints stay loaded on dedicated GPU subsets, while less common configurations are loaded on demand with a 90-second warm-up penalty. The scheduler accounts for this warm-up cost when assigning work units.

The hardest engineering challenge was validation at scale. Every generated sample passes through a lightweight quality gate before being committed to the output dataset. For medical images, this includes anatomical plausibility checks (is the image structurally coherent?), artifact detection (are there generation artefacts like mode collapse patches or high-frequency noise?), distributional alignment scoring (does this sample fit the target distribution?), and privacy verification (does this sample pass a membership inference challenge against the training data?).

Running quality checks synchronously would bottleneck generation — a single validation step takes 200ms per image, which would halve throughput. Running them fully asynchronously risks committing bad samples and having to reprocess entire batches later. We needed something in between.

Our solution: a two-phase commit with streaming validation. Samples are written to a staging buffer in batches of 512. A separate pool of CPU workers (we use c6i.8xlarge instances, 32 vCPUs each) runs quality gates in parallel across the batch. Only after an entire batch passes all quality gates is it promoted to the final dataset with an atomic move operation. If a batch fails — typically 2-5% of batches have at least one failing sample — the failing samples are identified, discarded, and replacement work units are issued.

The storage architecture deserves its own post, but the short version: we write to a distributed object store (S3-compatible) with a custom sharding scheme that avoids hotspots. Each shard corresponds to a class/condition combination, and within each shard, files are named by content hash to enable deduplication. A metadata index (stored in DynamoDB) tracks which work units have been committed and enables precise distributional accounting at any point during generation.

Monitoring at this scale required building custom tooling. Our dashboard shows real-time progress across all active generation jobs: samples generated per second (broken down by class), quality gate pass rates, GPU utilisation across the cluster, estimated time to completion, and distributional coverage maps showing which regions of the target distribution are complete versus still generating. When something goes wrong — a batch of GPUs producing low-quality samples due to a thermal issue, for instance — we can identify and isolate the problem in seconds rather than discovering it hours later in a quality report.

The result: we routinely generate datasets of 5-10 million samples in under 6 hours of wall-clock time, with full distributional guarantees and per-sample quality validation. Our largest single run to date produced 47 million synthetic frames for an autonomous vehicle customer in 19 hours — a task that would have taken over 160 days on a single GPU.

What used to take our partners weeks of manual curation now happens overnight — and the output is more consistent than anything a human curator could produce at this scale. The limiting factor is no longer generation capacity; it's deciding what to generate. Which brings us back to the beginning: the hard problem isn't generating synthetic data. It's knowing exactly what synthetic data your model needs.