SemiAnalysis: Reinforcement Learning Training Efficiency Comes Down to Matching Throughput, Not Just Scaling Compute

Experiments on open-source RL frameworks reveal the critical bottleneck in scaling model capabilities - June 16, 2026

Reinforcement learning post-training has emerged as the secret sauce behind the most capable AI models, but scaling RL is enormously expensive. SemiAnalysis conducted extensive experiments on open-source RL frameworks to understand what truly drives system efficiency in RL training. The surprising answer: it's not about throwing more compute at the problem, but rather carefully matching the throughput between two key components - the generator that creates training data and the trainer that learns from it.

The research team ran experiments using models like Qwen3-235B and GLM-5 across different RL frameworks including Prime RL, slime, and verl. What they discovered fundamentally challenges conventional thinking about RL infrastructure design.

The Queue Health Problem That Nobody Talks About

SemiAnalysis frames RL training efficiency through an elegant mental model: a queue where the generator produces rollouts and the trainer consumes them. When the generator is slower, the trainer starves and sits idle. When the generator is faster, samples age in the queue, creating what the team calls "policy staleness" - a phenomenon where the model trains on outputs generated by older versions of itself, which degrades learning quality.

In their first major experiment with Qwen3-235B-Thinking on 64 H200 GPUs for training and 192 GPUs for generation, the system became severely generation-bound. The trainer consumed samples at 2.75 samples per second but waited 30% of wall-clock time, running at just 10.5% model FLOPs utilization. The generator delivered only 1.95 samples per second despite using 3x the trainer's compute. The culprit: the model produced extremely long responses with extended reasoning traces, and the variance in response length created severe tail latency problems.

To handle this, the team had to discard 60% of dispatched rollouts through a technique called oversampling - launching more concurrent rollouts than needed and discarding unfinished ones. This wasteful approach underscores how critical inference efficiency becomes during RL training, a point that appears underappreciated in current discussions about RL infrastructure.

Model Behavior Drift Creates Moving Targets

A second experiment with GLM-5 on 128 H200 GPUs revealed another dimension of the problem that makes RL system design uniquely challenging: the model's behavior changes during training in ways that shift system constraints. Over the course of training, average response length per turn and the number of tool calls tripled from 20 to 51. This pushed sequence lengths up and shifted the workload toward a prefill-heavy profile, fundamentally changing the optimal infrastructure configuration mid-training.

Making matters worse, the curriculum proved too easy for the model - 55% of problems had a 100% solve rate where every rollout in the group passed. When every rollout receives the same reward, the advantage calculation produces zero and the group contributes no training signal. As SemiAnalysis explains, this happens when "the solve rate is near 100% or near 0%" - the task is either too easy or too hard. Average reward stayed flat despite compute expenditure.

The combined effects resulted in a heavily generation-bound system where the trainer spent 74% of wall-clock time waiting, and its consumption rate was 5x the generator's delivered production rate. The effective generator production rate collapsed due to filtered samples that provided no learning signal.

The Sandbox Scaling Wall

A third experiment on GB300 hardware pushed concurrent rollouts from 96 to 960 and hit a hard infrastructure wall that isn't discussed enough: sandbox scaling. Each rollout requires at least one containerized sandbox to execute code and provide rewards. At 960 concurrent rollouts, the team encountered "sandbox initialization dead errors and straggler 1-hour sandbox spin up latency." They had to scale back down to 96 concurrent rollouts, but then observed low rollout efficiency.

This reveals a fundamental constraint in RL training for coding assistants that SemiAnalysis values at over $30 billion in annual recurring revenue today across six major players, on track to exceed $100 billion by year end. The sandbox infrastructure must scale linearly with concurrent rollouts, and sandbox service companies like Modal face unique challenges including startup latency, fluctuating demand, and robustness against unexpected model behaviors like creating a million files that exhaust memory.

Policy Staleness: The Hidden Tax of Asynchronous Training

Classic policy gradient algorithms assume all rollouts in a group come from the same model weights. This forces synchronous execution where the generator cannot update weights until finishing the current batch, creating massive inefficiency. The industry has moved to asynchronous techniques, particularly PipelineRL, which allows weight updates in-flight while rollouts are still being generated.

But this creates policy staleness - samples are generated by a mixture of old and new policies. SemiAnalysis identifies three levels of staleness: trajectory-level (gap between policy version that started the rollout and current version), token-level (weight updates happening mid-rollout so different tokens come from different policy versions), and environment state-level (particularly relevant for stateful environments).

The framework slime implements a "partial rollout" feature that saves straggler rollouts to a buffer and restarts them in later batches, mitigating tail latency. But this introduces environment state-level staleness that's particularly insidious. As the team explains, "the sandbox it wakes up in is not a fresh repo. The sandbox holds the half-applied edits, created files, and working-directory state that the old policy produced over its earlier turns. The newer policy now must continue from a situation it didn't create and wouldn't necessarily have created itself." This corrupts the training signal during advantage attribution.

The Economics Tell a Brutal Story

SemiAnalysis conducted a total cost of ownership analysis comparing their experiments to Tinker from Thinking Machines Lab, a managed RL training platform. For H200 infrastructure, they calculate $1.59 per GPU per hour in total ownership cost, with capital cost contributing 72.5%. Server cost remains the dominant factor at $258,000 per server or 71% of total upfront capex of $361,000 per server.

On their Qwen3-235B experiment using slime, they landed at $16.23 per million tokens - 4.86x higher than Tinker's published $4.86 per million tokens target. On Prime RL, the gap narrowed to 2.01x at $6.90 per million tokens versus Tinker's $3.43 target. The stark difference between their slime and Prime RL costs underscores how much inference efficiency determines total cost.

SemiAnalysis hypothesizes that Tinker achieves its cost advantage primarily through multi-tenancy. Tinker provides a Low-Rank Adaptation training API where multiple users train models sharing most weights. "On the trainer side, Tinker can greatly increase efficiency by batching training requests across users, with some LoRA-specific modifications. On the generation side, multi-tenancy mitigates the straggler effect by backfilling idle slots with other tenants' rollouts when a run stalls on a straggler."

The team expects Thinking Machines Lab also applies inference optimizations like prefill-decode disaggregation and may be running Blackwell GPUs, which offer significant inference gains over Hopper according to their InferenceX analysis. The multi-tenancy advantage compounds with infrastructure and hardware improvements to create the dramatic cost gap.

Anthropic's Bet on RL Scaling

The report provides important context for why this matters. Anthropic CEO Dario Amodei has described RL as showing "the same kind of scaling pre-training once did, where performance climbs log-linearly with how long you train." But that scaling is enormously expensive, making RL system efficiency critical in determining how much RL can be afforded and therefore how far model capabilities can advance.

Concretely, Claude Opus 4.8 scores 69.2% on SWE-bench Pro and 74.6% on Terminal-Bench 2.1, and RL training is described as "a major part of what drives the score." These agentic coding capabilities don't emerge from pre-training alone - they require extensive and expensive post-training through reinforcement learning.

The open-source community has made remarkable progress. SemiAnalysis traces the lineage from OpenRLHF, one of the early efforts following DeepSeek R1's release, through to popular frameworks like slime and verl. Numerous OpenRLHF maintainers later developed these frameworks, creating vibrant Chinese communities around RL training that the team believes "positively contributed to recent advances of Chinese models." The frameworks have also enabled academic researchers to develop new algorithms and techniques, bringing RL research within reach of academia.

Framework User Experience Matters More Than Expected

The team provides candid assessments of the frameworks they tested. Prime RL receives praise for user ergonomics - most commands work through uv with configuration in toml files, plus agent skill files for smoother AI agent integration. The hub of RL environments and support for prefill-decode disaggregation stand out as strengths. But heavy reliance on uv created friction, with the team spending significant time "compiling and re-installing flash attention 3 because we couldn't figure out why uv uninstalled it."

Prime Sandbox, still in beta, generated many failed runs late in execution. "Errors include dangling sandboxes using up sandbox quota, out-of-resource errors, out-of-credit issues, many of which can be detected before launching a run."

Slime earns praise for "clean and minimal abstractions" and particularly its hook abstractions that make customization straightforward. The development team receives high marks for responsiveness. The main criticism: focus on co-located mode resulted in sparse documentation of asynchronous modes, forcing the team to figure out mechanisms "mostly through trial and error."

Modal's sandbox API receives praise for documentation quality and service robustness at small scale. Challenges emerged at high concurrency with dead initialization errors and long-tail startup latencies. This turned out to be resource limits on the account rather than hard platform limits - Modal raised limits and the team verified stability at high concurrency. Still, the experience highlights the need for better sandbox observability tools and scaling documentation.

The brutal honesty about rough edges in open-source tooling stands in contrast to typical vendor marketing, but it serves the institutional audience that needs to understand real-world implementation challenges before committing capital to RL training infrastructure.

DruckFin