OpenEnv · AgentX OpenEnv Track

ReasoningEconomicsEnv

An OpenEnv Benchmark Where LLMs Learn to Budget Their Own Thinking Across a Shared-Budget Episode.

AgentX OpenEnv Track · UC Berkeley RDI | Yashaswi Sharma, Harshawn Singh, Ryu Lun, Prabhu Pugalenthi, Khushi Kumari, Defu Cao, Muyan Weng

Live Environment Space → GitHub Repo

Motivation: reasoning LLMs do not allocate tokens; they spend them

Frontier reasoning models — DeepSeek-R1, QwQ, Qwen3 thinking mode, the o-series — over-spend tokens on easy items and under-spend on hard ones. Chain-of-thought length is only weakly correlated with ground-truth difficulty: trivial arithmetic consumes thousands of thinking tokens, and genuinely hard items get truncated before a \boxed{}. The finding is repeated across Han et al.'s Token-Budget-Aware LLM (arXiv:2412.18547), Xu et al.'s Chain-of-Draft (arXiv:2502.18600), and Moonshot AI's Kimi K1.5 Long2Short ablations (arXiv:2501.12599).

Inference tokens are not a per-query resource. In any real deployment — an exam battery, an eval suite, a multi-turn tool loop, a long-horizon agent — they are a shared, capped resource across a sequence of prompts. Misallocation is not just slow; it is lost accuracy per dollar. A deployed reasoner has to infer difficulty from text alone, decide what a prompt is worth given what's left, conserve on easy items so it can invest on hard ones, and pace itself under irrecoverable depletion.

What existing work does not solve — the Long2Short delta

The four families in Prior Work each cover one axis of reasoning-length control, and each leaves the same axis empty.

Family	What it optimizes	Axis still empty
Prompt-Guided Token-Budget, Chain-of-Draft, CCoT, Token Complexity	Shorten a single chain via prompting	No cross-prompt budget; no learning
RL with Length Reward L1/LCPO, O1-Pruner, Kimi K1.5 Long2Short, DAST, SelfBudgeter	RL-trained per-response length control. Long2Short distills a long-reasoning teacher into a shorter policy; the reward is conditioned on one chain's length.	The policy cannot trade tokens from Q1 to Q7 — it is never shown a shared budget
SFT / Distillation CoT-Valve, TokenSkip, Z1	Bake shorter reasoning into the weights	Still per-prompt; no episode state
Dynamic Early Exit Dynasor-CoT, Budget Forcing / s1, DTSR	Decoding-time termination within one prompt	The policy has no knowledge that another prompt downstream will also need tokens

The delta in one sentence. Kimi K1.5 Long2Short asks "when should I stop this chain?"; ReasoningEconomicsEnv asks "how should I split my budget across these N chains?" — a different action space (portfolio over prompts) under a different reward surface (joint accuracy × utilization across a battery). Long2Short has no notion of a shared episode budget; it cannot express the trade "save 400 tokens on Q1 so I have them on Q4" because Q1 and Q4 never share state in its MDP.

Long2Short is the offline, single-chain limit of this MDP (N=1, no shared budget); a numerical comparison would require degenerating our env to N=1, at which point the two methods become equivalent by construction. We therefore frame Long2Short as a special case of our formulation rather than a competing baseline.

What ReasoningEconomicsEnv is

ReasoningEconomicsEnv is an OpenEnv-native RL environment where the LLM is both the budget allocator (by choosing how long to think) and the solver (by producing the answer). The environment is a stateless grader and budget accountant. Over a multi-turn episode, the agent learns meta-reasoning: when to think long, when to think cheaply, how to trade correctness against compute under a single shared cap — with no difficulty labels.

To our knowledge, this is the first sequential multi-turn MDP that jointly incentivizes reasoning-trace reduction and answer accuracy under a shared, session-level budget. Every prior family — prompt-guided caps, RL with length rewards including Kimi K1.5 Long2Short, SFT/distillation on compressed traces, dynamic early exit — optimizes compression within a single query. None learn pacing across a sequence of queries.

Why a shared budget changes the problem

Per-query budgeting is a local optimization. Shared-budget reasoning is a sequential resource-allocation problem with partial observability over future question difficulty.

Axis	Prior work (R1 / QwQ / Xu 2024 / Kimi K1.5)	ReasoningEconomicsEnv
Budget scope	Per-query, isolated	Shared across N questions
Difficulty signal	Explicit label or classifier	Inferred from text only
Horizon	Single step	Sequential (N steps / episode)
Pacing pressure	None	Irrecoverable depletion
Training cost	Live API per rollout	Grader-only env (CPU) + local vLLM
Decision learned	How short can this answer be?	How should I spend what I have left?

The failure modes we want to surface are distinctly sequential:

Failure	What goes wrong
Over-invest early	Budget gone before the last (possibly hard) question arrives
Over-conserve	Easy questions answered; hard questions starved, cap under-used
Fixed pacing	Uniform spend ignores difficulty variance across items
Thinking-mode blowup	`<think>…</think>` runs past `max_completion_length`; answer truncated, grading returns zero
Unit drift	Budget cap and spend tallied in different tokenizers — phantom budget

Every row in that table is a real failure mode we hit and diagnosed end-to-end (see Engineering Lessons and Training Runs).

Prior work & novelty

The reasoning-economics literature falls into four families. All four optimize per-prompt reasoning length; none expose a shared cross-prompt token budget. ReasoningEconomicsEnv is the missing fifth regime — portfolio allocation under a joint budget.

Prompt-Guided

Inference-time prompting asks the model to self-regulate. No training signal, per-query scope.

Method	Mechanism	Link
Token-Budget (Han et al., 2024)	LLM self-estimates a token budget per query and embeds it in the prompt to constrain CoT length; reports ~68% token reduction with minimal accuracy loss	arXiv:2412.18547
Chain-of-Draft (Xu et al., 2025)	Prompts the model to write ≤5 words per reasoning step; matches CoT accuracy at ~7.6% of the tokens	arXiv:2502.18600
CCoT (Renze & Guven, 2024)	Appends "be concise" to CoT prompts; reduces length with a minor accuracy penalty on weaker models	arXiv:2401.05618
Token Complexity (Lee et al., 2025)	Benchmarks compression prompts (word limits, bullets, abbreviations); finds LLMs natively adjust length to difficulty even without sophisticated prompting	arXiv:2503.01141

RL with Length Reward

Training-time methods that shape a reward around reasoning length on single-prompt responses.

Method	Mechanism	Link
L1 / LCPO (Aggarwal & Welleck, 2025)	GRPO with a length-penalty reward; controls reasoning length via a "Think for N tokens" prompt prefix	arXiv:2503.04697
O1-Pruner (Luo et al., 2025)	PPO with a reward that penalizes token usage relative to a target length; applied to Marco-o1 and QwQ	arXiv:2501.12570
Kimi K1.5 / Long2Short (Moonshot AI, 2025)	Length-conditioned RL distillation of a long-reasoning teacher into a shorter policy. The paper ReasoningEconomicsEnv most directly contrasts with — Long2Short shortens one chain, we allocate across many.	arXiv:2501.12599
DAST (Shu et al., 2025)	SimPO-based preference optimization on constructed short/long preference pairs	arXiv:2503.04472
SelfBudgeter (Li et al., 2025)	Model prepends a self-predicted token budget before reasoning and is trained to respect it	arXiv:2505.11274

SFT / Distillation

Supervised fine-tuning on shortened CoT traces. No RL, no budget state.

Method	Mechanism	Link
CoT-Valve (Ma et al., 2025)	Single model trained on CoT of varying lengths; inference-time "valve" parameter controls reasoning depth	arXiv:2502.09601
TokenSkip (Xia et al., 2025)	Compresses existing CoT by skipping non-essential tokens, then fine-tunes on the compressed traces	arXiv:2502.12067
Z1 (Zhang et al., 2025)	SFT on compressed-thought data that shortens each reasoning step	arXiv:2504.00810

Dynamic Early Exit

Decoding-time heuristics that terminate a single chain early. The policy has no knowledge of downstream prompts.

Method	Mechanism	Link
Dynasor-CoT (Fu et al., 2025)	Probes intermediate answers at fixed intervals; terminates when consecutive answers agree	arXiv:2412.20993
Budget Forcing / s1 (Muennighoff et al., 2025)	Forces end-of-thinking + "Final Answer:" at the max token budget; simple and strong baseline	arXiv:2501.19393
DEER (Yang et al., 2025)	Detects reflection signals (e.g., "Wait,", "Let me check") in the output as dynamic exit points and terminates reasoning if deemed sufficient	arXiv:2504.15895

Framework we inherit

Component	What we inherit	Link
OpenEnv	Gym-style `reset`/`step` over WebSocket; HF Space deployment; per-session state; concurrent sessions	HF Blog: Introducing OpenEnv

Every method above optimizes "how long should this chain be". ReasoningEconomicsEnv optimizes "how should I split my budget across these N chains".

A different action space (portfolio over prompts) and a different reward surface (joint accuracy × utilization across a battery). To our knowledge, this is the first OpenEnv-native RL environment — and the first sequential MDP of any kind — where a single reward function jointly incentivizes reasoning-trace reduction and answer accuracy across a multi-turn, shared-budget episode.

Scope caveat. The novelty is the MDP, reward coupling, and budget accounting. The RL method (GRPO on verifiable math) is shared with the DeepSeekMath / Kimi K1.5 lineages; we reuse those techniques rather than propose new ones.

What ReasoningEconomicsEnv is

A stateless grader plus a budget accountant, served over OpenEnv's WebSocket protocol. The LLM is the policy. The reward is verifiable. The MDP is multi-turn. Nothing else is invented.

Each episode samples 10 math questions (configurable, num_questions) from meta-math/MetaMathQA — keyed by type (GSM_SV, MATH_FOBAR, …) and drawn from the first 5 000 rows of the dataset (subset_start_idx=0, subset_size=5000) so every run samples from the same fixed window. The agent receives one question at a time alongside its remaining budget, chooses how long to reason, and emits a single response containing its chain-of-thought and a \boxed{…} final answer. The environment grades the answer against ground truth and returns a reward and the next question — until the episode terminates (10 questions completed or budget exhausted, depending on mode).

Dataset scope today: MetaMathQA only, first 5 000 rows. AI-MO/NuminaMath-TIR is wired into the sampler (NUMINA_PROBLEM_TYPE = "NuminaMath_TIR", numina_subset_size) but kept out of the current training mix; enabling the Numina channel for an even MetaMath + Numina mix is tracked in Future Work.

The agent's action interface is deliberately minimal: raw text / JSON output, no tool-call protocol, no markdown parsing fragility. The LLM outputs a response string; the env parses it. Crucially, the training client (ReasoningEconomicsPT) never imports env Pydantic types — it speaks dict shapes over the wire, matching OpenEnv's client/server contract.

Environment design

MDP

The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:

# Observation (env → agent)
class ReasonBudgetObservation(Observation):
    question: str                    # raw problem text
    remaining_budget: int            # tokens left in the episode
    questions_remaining: int
    budget_per_remaining_question: float   # pacing signal
    accuracy_so_far: float
    episode_history: list[HistoryItem]     # in-context Q/A memory
    done: bool
    reward: Optional[float]
    metadata: dict                   # problem_type, total_budget, budget_source,
                                     # budget_mode, min_tokens, max_tokens

# Action (agent → env)
class ReasonBudgetAction(Action):
    response: str                    # thinking trace + \boxed{answer}
    metadata: dict                   # optional tokenizer_name override,
                                     # optional grading_response (visible tail)

Reward

Two-component, both grounded in the OpenEnv per-step plus terminal-bonus pattern. Both terms are alive at once — the per-step cost penalty rewards shorter traces, the correctness term rewards right answers, and the terminal bonus couples the two multiplicatively so neither can be sacrificed to optimize the other.

Per-step (accumulated every turn):

r_{\text{step}} = \text{correctness} + \text{efficiency\_bonus} - \text{cost\_penalty} - \text{overspend\_penalty}

correctness: +1.0 if extract_boxed_answer + SymPy equality matches ground truth, else −0.1 — incentivizes answer accuracy (wrong answers carry a small per-step penalty; see Two-repo cheat sheet / compute_reward in env/reward.py).
efficiency_bonus: reward for cheap correct answers on easy items — incentivizes trace reduction when correctness is preserved.
cost_penalty: linear in tokens spent this turn — direct trace-length pressure.
overspend_penalty: active only in soft-budget mode; 0 under hard cap.

Terminal (added to the final step's reward):

r_{\text{episode}} = \lambda_{\text{ep}} \cdot \bigl(\text{episode\_accuracy} \,\times\, \text{budget\_utilization\_score}\bigr)

Where budget_utilization_score = max(0, 1 − |spent/total_budget − target_utilization|) rewards finishing close to, but not over, the target utilization. fair_share = total_budget / num_questions is used in both efficiency_bonus and cost_penalty.

Why the product form. Compressing every trace to zero tokens (accuracy = 0) and answering correctly but wasting the budget (utilization bad) are both punished. The only way to maximize r_episode is to spend the budget well and be right.

Reward hyperparameters

All runs in this blog use the repo defaults from ReasoningEconomicsEnv/env/config.py and ReasoningEconomicsEnv/env/reward.py — nothing tuned per-run. They are:

Symbol	Name	Default	Role
`β`	`beta` (cost-penalty weight)	0.05	Linear per-step token cost: `β · max(0, tokens_used / fair_share − 1)`. Only fires when the step overspends its fair share.
`γ`	`gamma` (efficiency-bonus weight)	0.1	Reward for solving under fair share: `γ · (1 − spend_ratio)`, correct steps only.
`λ_ep`	`lambda_ep` (terminal weight)	0.5	Scales terminal `episode_accuracy × budget_utilization_score`. Product form prevents unilateral optimization of either factor.
—	`target_utilization`	0.9	Utilization peak for `budget_utilization_score`; rewards finishing close to 90% of the total budget.
—	`correctness` reward	+1.0 / −0.1	Per-step: `+1` on SymPy match, `−0.1` on wrong — a small negative signal so trivial "don't answer" policies lose reward.
—	`soft_overspend_penalty`	0.25	Active only in soft-budget mode: `0.25 · (overspend_tokens / fair_share)`. Hard-cap mode zeroes this term.
—	`budget_ratio`	2.0	Fallback total-budget multiplier when no `total_budget` and no tokenizer are passed (budget priority table).
—	`num_questions` / `min_tokens` / `max_tokens` / `max_tokens_per_step`	10 / 10 / 800 / 2048	Episode length and per-step token window; `min_tokens` also sets hard-cap early termination.

Values are the EnvConfig dataclass defaults and the default kwargs on compute_reward / compute_episode_bonus. They were not swept in this submission; tuning β, γ, and λ_ep jointly against baseline runs is part of Future Work.

Budget modes

Mode	Behavior	Use
Hard-cap (default)	Per-step spend clipped to `remaining_budget`; episode terminates early when `remaining_budget < min_tokens`	Final evaluation, competition scoring
Soft-budget	No clipping, no early termination; overspend smoothly penalized	Training curriculum — lets the policy experience the whole episode before discipline is enforced

Dual modes are not a convenience. Hard-cap's early termination produces zero-advantage groups in GRPO: uniform truncation across all generations → std(r)=0 → zero gradient. Soft-budget bridges that window until the policy learns to finish. This is the same pathology we diagnose in full under Engineering Lessons (Pathology 2).

Tokenizer-native budgets

A subtle bug we surfaced and fixed: per-step spend was counted with a live AutoTokenizer, but the episode cap (total_budget) was computed from an abstract config formula in an entirely different unit system. Caps and spends did not share units. The environment now resolves total_budget at reset() in priority order:

Priority	Condition	Formula	`budget_source`
1	Client passes `total_budget`	Exact integer	`"client"`
2	Client passes `tokenizer_name`	`budget_ratio × Σ tokenize(q_i)` over all questions	`"tokenizer_native"`
2b	Tokenizer load fails	Config formula + warning	`"config"`
3	Neither passed	`budget_ratio × N × (min_tokens + max_tokens) / 2`	`"config"`

Observation metadata returns total_budget and budget_source so the client can verify the path taken. Cap and spend now live in the same policy-token unit system. The fix is exactly the tokenizer-mismatch mitigation described in cross-chat handoff Issue 2b: aligning the env's AutoTokenizer id with the policy tokenizer via --env_tokenizer_name (or the Hub id rewritten into REPT_MODEL_HUB_ID when the checkpoint is a local path).

Why OpenEnv

Base types only. EnvClient, Environment, Pydantic Observation / Action. No invented abstractions.
Per-WebSocket session state. One environment per socket. The invariant is load-bearing — violating it silently produced zero-reward episodes (see Engineering Pathology 3).
Concurrent sessions. SUPPORTS_CONCURRENT_SESSIONS = True; max_concurrent_envs = 64 — built to be hammered by DDP ranks.
Metadata channel. All extensions (total_budget, budget_source, budget_mode, problem_type, per-step tokenizer_name, grading_response) ride on Observation.metadata and Action.metadata. No new method signatures.
Verifiable, bounded reward. extract_boxed_answer + SymPy. Per-step reward numerically bounded; episode bonus is a clamped product. No LLM judge, no circularity.

Scoring: per-step + terminal, coupled multiplicatively

The reward has two layers. Per-step reward fires every turn and sums into an episode total. The terminal bonus is added only at the final step and couples accuracy with budget utilization through a product. An optional scalar --alpha multiplies the per-step reward before episode accumulation (raw_step_reward * alpha inside EpisodeSession); beta is reserved for future shaping and currently has no effect.

Component	When	What it rewards
Correctness	per step	Boxed answer matches ground truth under SymPy equality
Efficiency bonus	per step	Right answer on an easy item with few tokens
Cost penalty	per step	Linear in tokens spent this turn (full decoded response, not just visible tail)
Overspend penalty	per step (soft-budget only)	Smooth penalty for going over target utilization
Terminal bonus	last step only	`λ_ep × (episode_accuracy × budget_utilization_score)`

Why multiplicative coupling. Sum-of-components rewards reward hacking: the policy can drop accuracy to zero as long as it aces utilization, or vice versa. The product kills both shortcuts: if either factor is zero, the terminal bonus is zero, regardless of how well the other is optimized. The agent has to be right and pace itself — which is the entire learning problem.

Architecture & training pipeline

The project is two strictly separated packages: ReasoningEconomicsEnv (the OpenEnv environment) and ReasoningEconomicsPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports. The PT repo subclasses EnvClient as ReasonBudgetClient with plain dict actions and observations, so it never touches env Pydantic types.

Two-repo cheat sheet

Every code reference in this blog lives in exactly one of the two repos below. When a symbol is mentioned in later sections (Engineering Lessons, Stack Split, Quick Start), this table is the canonical resolution.

Symbol / file	Repo	Path	Role
`ReasonBudgetEnvironment`	Env	`env/reason_budget_env.py`	FastAPI + OpenEnv environment served over WebSocket; one instance per session.
`ReasonBudgetObservation` / `ReasonBudgetAction`	Env	`env/models.py`	Pydantic wire types; PT never imports these, only dict shapes.
`EnvConfig`	Env	`env/config.py`	Episode + budget defaults; overridden by `REASON_BUDGET_*` env vars.
`compute_reward` / `compute_episode_bonus`	Env	`env/reward.py`	Per-step and terminal reward math (Scoring).
`EpisodeSampler` / dataset loaders	Env	`env/episode_sampler.py`, `data/loaders.py`	MetaMathQA window (`subset_size=5000`); Numina wired but disabled.
`start_openenv_server.sh`	Env	`scripts/start_openenv_server.sh`	Spawns the FastAPI WebSocket on `127.0.0.1:8000`.
`ReasonBudgetClient` / `EpisodeSession`	PT	`clients/reason_budget_client.py`	Dict-typed `EnvClient` subclass; context manager holds one WebSocket for the full episode.
`rollout_func`	PT	`training/rollout.py`	TRL 1.0 multi-turn rollout driver; emits `env_reward` / `env_mask`.
`resolve_env_tokenizer_name`	PT	`training/tokenizer_sync.py`	Aligns env tokenizer with policy tokenizer via `REPT_MODEL_HUB_ID`.
`_sync_fsdp2_params_to_vllm`	PT	`training/weight_sync.py`	Weight-sync path to `trl vllm-serve` under Branch A (FSDP2).
`REPT_*` env vars	PT	`scripts/run_grpo_lambda.sh`	`REPT_MODEL`, `REPT_NUM_GPUS`, `REPT_VLLM_MODE`, `REPT_VLLM_TP`, `REPT_VLLM_PORT`, `REPT_MODEL_HUB_ID`.
`REASON_BUDGET_*` env vars	Env	`env/config.py`	`REASON_BUDGET_NUM_QUESTIONS`, `REASON_BUDGET_HARD_CAP_MODE`, `REASON_BUDGET_BUDGET_RATIO`, `REASON_BUDGET_TOKENIZER_NAME`.
`reward_logs.jsonl`	PT writes, mirrors Env schema	`runs/<run_id>/reward_logs.jsonl`	Per-step reward audit (`step_index`, `remaining_budget_before`, `visible_response`, `raw_step_reward`, `scaled_step_reward`, `done_after_step`, `episode_reward`).

Shorthand used below: Env-side = sharma-yash01/ReasoningEconomicsEnv; PT-side = sharma-yash01/ReasoningEconomicsPT.

flowchart LR
    Policy["Policy GPUs 0-5, ZeRO-3, CPU optimizer offload"]
    VLLM["vLLM server GPUs 6-7, tensor parallel 2"]
    Env["OpenEnv FastAPI WebSocket localhost 8000"]
    RewardLog["reward_logs.jsonl"]
    Policy -->|rollout_func| VLLM
    VLLM -->|generations| Policy
    Policy -->|reset, step| Env
    Env -->|Observation, reward, done| Policy
    Env --> RewardLog

Figure 1. 8×A100 production topology for the headline Qwen3-14B run (Branch B). GPUs 0–5 run the DeepSpeed ZeRO-3 trainer with CPU optimizer offload and Unsloth-integrated LoRA; GPUs 6–7 run trl vllm-serve with tensor_parallel_size=2; the OpenEnv server runs on a separate process listening on ws://127.0.0.1:8000. The Qwen2.5-3B tranche (3B results) runs on Branch A instead — FSDP2 sharding, full fine-tune, no LoRA. Configurable via env vars REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME; reproducible launch via start_openenv_server.sh.

Training uses GRPO (Group Relative Policy Optimization) via TRL 1.0.0's rollout_func contract, which gives us explicit control over the generate → parse → step loop. We chose rollout_func over environment_factory specifically to avoid TRL 1.0.0's Qwen3-only add_response_schema allowlist — that path is biased toward Qwen3/Qwen3.5 chat-template parsing, while we want to run Qwen2.5 and other families too.

The critical invariant is one WebSocket per episode. _rollout_one_episode runs inside with EpisodeSession(...) as session:, so reset and every step share the same socket. Per turn: trainer.vllm_generation.generate() → decode → session.apply_response(text, …) → remote step({"response": …}). The function returns prompt_ids, completion_ids, logprobs, env_mask, and env_reward; the reward hook reward_from_env(…, **kwargs) simply reads kwargs["env_reward"]. The env's tokenizer id on reset comes from resolve_env_tokenizer_name: --env_tokenizer_name if set, otherwise the tokenizer's name_or_path, falling back to --model. When the checkpoint is a local NFS path, the launcher saves the Hub id into REPT_MODEL_HUB_ID so the remote env receives an HF-resolvable id.

Hybrid-thinking models need per-family wiring. training/model_profiles.json + training/model_profiles.py provide a ModelProfileRegistry keyed on model id (exact match first, then longest-prefix), supplying chat_template_kwargs (e.g. enable_thinking), output_parser (qwen3_think or null), think-tag delimiters, and grading_use_visible_only. Env-side invariant: budget always counts _count_tokens(action.response) on the full string, while grading uses metadata["grading_response"] (visible tail) when non-empty. Budget stays honest; grading stays robust to think traces.

NCCL padding. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_object → broadcast_object_list. Our rollout is a while not session.done loop, so different ranks make different numbers of generate() calls per episode — permanent collective desync. Fix: a fixed DIST_SERVER_GENERATES_PER_EPISODE = 8 cap, with dummy 1-token generates padding each episode to exactly 8 calls. Dummies are discarded; env_reward, completion_ids, and logprobs are byte-identical to the unpadded case. This is active only in server mode with DDP; colocate + TP=1 does not enter the gather_object path.

Training pathology & zero-advantage collapse

We observed three named, reproducible pathologies when shipping GRPO + OpenEnv + vLLM on hybrid-thinking models under a shared-budget MDP: NCCL desync under variable-length rollouts in server mode; truncation-induced zero-advantage collapse when every completion in a GRPO group hits the same clip boundary; and stack non-composition across TRL + vLLM + Unsloth + FSDP. Before the headline 14B run, the dominant learning-signal failure was truncation collapse — WebSocket, padding, and tokenizer alignment could all be green while the policy still received a structurally zero gradient.

Full telemetry tables, evidence links, a log-backed truncation episode, root-cause chains, and structural fixes are documented once in Engineering Lessons (Pathology 1, Pathology 2 including an expandable log excerpt, Pathology 3, then Takeaways). This section stays short so the blog does not narrate the same three failures twice.

Results: what we found

Runs are organized by what they contribute: (1) the headline Qwen3-14B 8×A100 completed run, our strongest positive-mean-reward evidence; (2) the Qwen2.5-3B tranche on 1×H100 that validates the pipeline end-to-end; (3) boundary / failure runs that delimit the tractable region of the TRL 1.0 / vLLM stack.

Headline — Qwen3-14B, 8×A100, ZeRO-3 + vLLM TP=2, true-4q

Run 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe (see Engineering Pathology 3 for topology). 20/20 optimizer steps completed with artifacts saved. First completed multi-question shared-budget GRPO training run on 14B.

Metric	Value
Episodes	480
Env turns	1920
Mean episode reward	+0.4692 ± 0.9758
Min / Max episode reward	−0.40 / +4.5205
Accuracy (per-question)	17.76%
Cap-hit rate	13.54%
`env_step_error`	0

Why true-4q (four questions per episode), not the designed ten. The headline run sets num_questions=4, not 10, because Qwen3-14B + Unsloth LoRA + vLLM tensor-parallel inference + multi-turn thinking-mode rollouts saturated 40 GB A100s at four questions per episode with two generations per GRPO group. Pushing to ten questions per episode triggered OOM in the rollout cache on the same hardware. That reduction is a VRAM ceiling, not a claim that the full MDP is solved; the designed 10-question battery remains the target for future hardware or a smaller policy (see Pathology 3 for why LoRA + ZeRO-3 + CPU offload entered the recipe).

Three things this run shows. (1) The multiplicatively-coupled per-step + terminal reward designed in Scoring can produce positive mean under a shared budget — previously the headline 3B runs were all negative-mean. (2) The max episode reward of +4.52 means at least one episode hit both high accuracy and good utilization; the product is only large when both factors are. (3) env_step_error = 0 over 1920 turns means the OpenEnv WebSocket invariant, tokenizer alignment, and DeepSpeed ZeRO-3 / vLLM TP=2 split all held under a full training run, not just a smoke.

Contextualizing the 17.76% accuracy — baselines not yet run

We have not yet run baselines against the headline 14B checkpoint. The per-question accuracy (17.76%) is therefore uncalibrated. It should be read only against the designed performance targets for each baseline in Baselines (planned), reproduced here for the reader who doesn't want to scroll:

Baseline (planned, not run)	Designed performance target	What a converged RL policy should beat
uniform-random-split	Mean reward ≈ `0`, accuracy distributed by dataset-difficulty mean; utilization unshaped	Lower bound — any structured policy should clear this
greedy-first	Accuracy capped at `≤ 1/N` (25% for 4q, 10% for 10q) because later questions are starved; utilization poor on the depleted tail	Our policy must show cross-prompt pacing, not just solve Q1
always-same-budget	Accuracy approaches the dataset-difficulty mean at the given `total_budget / N`; zero utilization shaping because allocation is oblivious to difficulty	Our policy's marginal value comes from difficulty-aware allocation, not just from having a budget at all
zero-shot LLM (same Qwen3-14B, no RL)	Measurable ceiling achievable without RL; expected to be strong on per-question accuracy but poor on `budget_utilization_score` because the base model does not pace	RL contribution = product gain (`accuracy × utilization`) the base model cannot reach

17.76% per-question accuracy over the true-4q battery is meaningful only once the zero-shot LLM row above is filled in. Until then it is a raw rate, not a claim about RL uplift. Baseline evaluation against the released checkpoint is the next result in Future Work.

Supporting — Qwen2.5-3B tranche (1×H100)

Validates the full pipeline end-to-end. Reward mean improves monotonically as max_completion_length and max_tokens_per_step grow — the interesting signal is the std and the positive tail, not the centered mean.

Run	Model	Episodes	Key settings	Mean (max)
qwen25_3b_best_single_h100	Qwen2.5-3B-Instruct	87	`batch=2`, `gens=2`, `lr=7e-7`, `mcl=640`, turns=4	−0.4598 (+1.29)
qwen25_3b_better	Qwen2.5-3B-Instruct	285	2 epochs, same shape	−0.3156 (+1.33)
qwen25_3b_more_context	Qwen2.5-3B-Instruct	368	2 epochs, `mcl=1024`, `tokens/step=512`	−0.2279 (+2.34)
qwen25_3b_best_v2	Qwen2.5-3B-Instruct	687	4 epochs, `mcl=1024`, `tokens/step=1024`, `lr=5e-7`	−0.1476 (+2.29)

Plumbing smokes on Qwen2.5-0.5B confirm the trainer path (loss 0.00889 / 0.006714; wall-clock 1246 s / 1279 s). These are infrastructure validation, not research signal.

Boundary / failure runs

Where today's stack breaks. Each row is a training-stack limit, not an env or reward-shape limit.

Attempt	Setup	Outcome	Blocker
Unsloth 14B, 1×A100-40GB	`vllm_mode=colocate`, `max_model_len=3500`, `gens=2`	1/120 steps; `grad_norm=NaN`; `selective_log_softmax` `RuntimeError`	Unsloth truncation × thinking-mode completions (Pathology 2)
qwen25_7b_4q_ultralow	1×H100 colocate, `mcl=256`	Repeated CUDA OOM	7B thinking context doesn't fit colocate headroom
Qwen3-8B sharded	8×A100 FSDP + vLLM server	No stable reward summary	TRL/FSDP weight-sync (Pathology 3)
Qwen3-30B-A3B-Instruct-2507	8×A100 FSDP + server-mode vLLM	No completed summary	vLLM KV-cache / communicator startup
Qwen2.5-32B-Instruct	8×A100 FSDP + server-mode vLLM	No completed summary	Prompt length + vLLM startup + NCCL init
Early Qwen2.5-14B smoke	8×A100 FSDP + server	2 episodes (reward −0.07, −0.30)	Superseded by the headline ZeRO-3 run

Baselines (planned)

No baselines have been computed against the headline 14B checkpoint yet. We planned four, in the spirit of LotteryElicitationEnv's baseline set, each isolating one axis of the allocation decision.

Baseline	Policy	What it isolates
always-same-budget	Allocate `total_budget / N` per question; greedy decode within each per-question cap	Difficulty-awareness gain (is it just the total budget that matters, or how it's split?)
greedy-first	Spend up to cap on Q1, Q2, … until budget runs out; truncate remainder	Pacing cost (what's lost by no-foresight allocation?)
uniform-random-split	Dirichlet-sampled per-question allocations at the same total budget	Lower bound — beating it proves the policy learned anything structured
zero-shot LLM	Same base model (Qwen3-14B) with no RL, greedy decode against the same battery at the same budget	RL contribution — measurable upper bound achievable without learning

Baseline performance targets. By construction we expect uniform-random-split → mean reward ≈ 0 (difficult to ace utilization by accident); greedy-first → accuracy capped at ≤ 1/N because later questions are starved; always-same-budget → mean reward approaches the env-difficulty mean with zero utilization shaping; zero-shot LLM is the measurable ceiling without RL shaping. A converged RL policy should beat all four on the product (accuracy × utilization), not any single factor.

Status

We have: a completed 14B training run with positive mean reward (+0.4692), a validated 3B tranche end-to-end, a mapped set of stack boundaries, and three named pathologies (NCCL desync, truncation collapse, stack non-composition) with reproducible signatures and structural fixes. We do not yet have: baselines evaluated against the headline checkpoint, or cross-family model comparisons. Those are Future Work.

Engineering lessons

Shipping a real GRPO + OpenEnv + vLLM pipeline on a multi-turn verifiable-reward environment surfaced three major pathologies. Each one is a named learning-signal failure with a reproducible signature and a structural fix. We document them so the next OpenEnv submission can avoid the same dead ends.

Pathology 1 — NCCL desync under variable-length rollouts

Signature. all_gather_object / broadcast_object_list hang on the second optimizer step; heartbeat timeout at ~7200 s; all ranks report env_step_error=0 and ConnectionClosedError=0 on the rollout but block at the post-rollout barrier. NCCL flight recorder shows mismatched last enqueued / last completed sequence numbers across ranks — structural, not configurational.

Root cause. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_object → broadcast_object_list. Our rollout is a while not session.done loop — different ranks make different numbers of generate() calls per episode. NCCL is sequence-numbered; different call counts per rank → permanent desync → eventual _pickle.UnpicklingError or a BROADCAST timeout. A secondary trigger: stale __pycache__ on NFS kept loading the pre-fix .pyc, reintroducing the desync after git pull reported the repo clean.

Evidence. Full write-ups in impl-context/dist-train-desync-issue.md and impl-context/dist-train-issue-hung-gpu.md.

Fix stack.

Clear bytecode caches in every launch script:

find $REPT_ROOT -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
find $REPT_VENV -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
PYTHONDONTWRITEBYTECODE=1 torchrun --nproc_per_node=7 ...

Fixed-count generate() padding per episode. Every rank performs exactly DIST_SERVER_GENERATES_PER_EPISODE = 8 calls — real generates for active turns, 1-token dummies (discarded) for the rest. Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case.
On the 8×A100 headline path we sidestep the server-mode collective entirely: DeepSpeed ZeRO-3 with vLLM on dedicated GPUs keeps the rollout collective off the hot path (see Pathology 3). This is what unblocked 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe.

Why this matters beyond our repo. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent. Heartbeat and batch tuning only hide it longer.

Pathology 2 — Truncation-induced zero-advantage collapse

Signature. From the most recent Lambda run (terminal 36, Qwen3-4B + Unsloth, step 1/120):

Unsloth: Input IDs of shape torch.Size([2, 12986]) with length 12986
         > the model's max sequence length of 3500.
         We shall truncate it ourselves.
RuntimeError: Size does not match at dimension 1
  expected index [2, 12760, 1] to be no larger than
  self [2, 3499, 151936] apart from dimension 2   # in selective_log_softmax

Upstream telemetry in the same step: completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, importance_sampling_ratio/mean = 0.072, grad_norm = NaN. Every completion is clipped, zero episodes terminate with </think> + \boxed{}, and 37% of GRPO groups have reward_std = 0 → advantages collapse to a near-constant → gradient is noise → NaN.

Root cause. Unsloth silently patches the policy tokenizer to max_seq_length = 3500 even when vLLM is configured at 32768. GRPO computes completion logits against the truncated 3499-token sequence, then tries to gather at the untruncated 12760 completion indices — the shapes disagree and the forward breaks. More broadly: when truncation is uniform across the GRPO group, every completion lands at the same wrong answer → std(r) = 0 → A_i = 0 → zero gradient. The policy also learns to emit filler (repeated !) because that is what survives the hard cap with the least cost penalty.

Fix. Two changes together, not either alone. (1) Keep Unsloth, but only inside the DeepSpeed ZeRO-3 branch we document in Pathology 3 — in our runs, ZeRO-3's all-gather window empirically aligned with Unsloth's forward where FSDP's per-module summon_full_params did not, removing the selective_log_softmax shape mismatch we hit on FSDP. (2) Independently, enforce vllm_max_model_length = tokenizer.model_max_length = actual episode budget at startup, validated via an assertion, and raise max_completion_length to give the closing </think> and \boxed{} room. Start in soft-budget mode; warm into hard-cap. Increase num_generations to break group-level uniformity. LR tuning is the wrong instinct — the gradient is structurally zero, not noisy. Branch A (FSDP, full fine-tune) sidesteps the Unsloth composition entirely and does not hit this pathology.

Why this matters beyond our repo. Any GRPO run on a hybrid-thinking model under a budget-constrained MDP with a partially-right-but-cheap shortcut has this bug latent. The clipped_ratio = 1 + reward_std ≈ 0 pair is the fingerprint.

Expand: telemetry table, log-backed truncation episode, and root-cause chain

The telemetry that reveals collapse

Signal	Observed value	What it means
`reward`	≈ −0.1 to −0.47 (flat)	Uniform negative reward across GRPO groups
`completions/clipped_ratio`	1.0	Every completion hits `max_new_tokens`
`completions/mean_terminated_length`	0	Nothing terminates naturally — no `</think>`, no `\boxed{}`
`frac_reward_zero_std`	0.375 – 1.0	Partial to full GRPO group collapse
`importance_sampling_ratio/mean`	≈ 0.07	IS ratio collapsing under policy drift
`entropy`, `grad_norm`	low / NaN	Near-zero (or explicitly broken) gradient signal

Numbers above are drawn from the most recent Lambda train.log (Qwen3-4B, single A100-40GB, Unsloth LoRA, step 1/120) and reconciled against earlier 2×H100 Qwen3-4B trainer metrics. The headline Qwen3-14B 8×A100 run in Results resolved this signature — it is what a stack that doesn't truncate looks like.

A truncation-collapsed episode, log-verified

Training emits a fixed reward_logs.jsonl schema on every step (step_index, question, remaining_budget_before, visible_response, raw_step_reward, scaled_step_reward, done_after_step, episode_reward). The sanitized excerpt below — from the single-GPU Unsloth + Qwen3-4B stack — shows episode-level telemetry when truncation collapse is in full effect.

Step 1 of 10 · remaining_budget_before = 1256

Question: The matrices … are inverses. Enter the ordered pair (a,b). The answer is 14. What is the value of unknown variable X?
visible_response: "!!!!!!!!!!!! … (truncated) !!!!!!!!!!!!"
reasoning_trace: "" (empty)
raw_step_reward: −0.11 · done_after_step: false
Per Scoring / compute_reward: correctness = −0.1 (wrong answer) plus a small cost_penalty from mild overspend past fair_share (≈ −0.01); efficiency_bonus = 0 because the step is incorrect.

Step 9 · remaining_budget_before = 451.0

visible_response: same repeated-! pattern; was_correct = false.
raw_step_reward: −0.1
Here correctness = −0.1 alone; cost_penalty ≈ 0 (no meaningful overspend past fair_share on this step).

Step 10 · episode terminal

final_observation.accuracy_so_far: 0.0
episode_reward: −1.0174
Arithmetic: ten wrong steps each carry correctness = −0.1 (≈ −1.0 summed), plus small per-step cost_penalty terms where spend_ratio > 1; the terminal term λ_ep · episode_accuracy · budget_utilization_score is zero because episode_accuracy = 0. The logged −1.0174 is therefore dominated by wrong-answer penalties, not "ten pure cost penalties."
Trainer-side consequence: completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, grad_norm = NaN. Step 2 never runs.

Ten degenerate completions, zero correctness, zero terminal bonus by construction. The multiplicative coupling refuses to reward a policy that didn't solve anything — doing exactly what it was designed to do. The failure is upstream: the policy can't produce a syntactically valid answer because Unsloth truncated the input out from under it.

Caption: truncation-collapsed Unsloth episode on 1×A100, episode_reward = −1.0174. Grader behaves correctly; the failure is upstream in the stack.

Root cause chain

max_completion_length = 4096 default + Qwen3 thinking → <think>…</think> span alone consumes 1500–4000 tokens.
Completion gets truncated before the closing </think> → no \boxed{…} answer.
With grading_use_visible_only = True, the empty visible tail grades as incorrect; in the latest Lambda run the visible tail is a degenerate string of ! characters (no \boxed{}, reasoning_trace empty).
Uniform truncation across all completions in the GRPO group → identical rewards → std(r) = 0 → A_i = 0 → zero gradient.
Policy divergence over reuse epochs pulls the IS ratio down (mean ≈ 0.07, min = 0) — noise on top of zero signal.
Hard-cap multi-turn budget amplifies the effect: verbose truncated completions terminate episodes early, shrinking the learning signal further.

The most recent Lambda run (2026-04-14) adds a second failure mode on top: with Unsloth enabled and max_model_len = 3500, the accumulated prompt + completion reached 12 986 tokens. Unsloth warned "We shall truncate it ourselves" and the trainer then crashed inside _get_per_token_logps_and_entropies → selective_log_softmax with RuntimeError: Size does not match at dimension 1 expected index [2, 12760, 1] to be no larger than self [2, 3499, 151936] apart from dimension 2. Truncated indices and full-length logits disagreed, the forward broke, and grad_norm became NaN. The fix is the same as the root truncation chain above: raise max_completion_length (and the vLLM max_model_len) to match what thinking-mode actually emits, or disable thinking mode outright. LR tuning cannot fix this — the gradient is structurally zero (or undefined).

This failure mode is general, not specific to ReasoningEconomicsEnv. Any GRPO run on a hybrid-thinking model under a budget-constrained multi-turn MDP with a partially-right-but-cheap shortcut has this bug latent. We believe every future sequential-budget RL run on reasoning models needs to start from this diagnosis.

Pathology 3 — Stack non-composition; two validated branches

Signature. Unsloth for GRPO is not yet implemented! Just ignore this function. (terminal 36); FSDP2 shard-gather blocking rollout_func; vLLM colocate stealing VRAM from the policy on 40 GB A100s; repeated CUDA OOM at 7B on single H100 (qwen25_7b_4q_ultralow); Qwen3-30B-A3B and Qwen2.5-32B failures at vLLM KV-cache / communicator startup.

Root cause. TRL 1.0.0 + vLLM (colocate) + Unsloth + FSDP do not compose as a four-way intersection. Each pairwise composition has a known sharp edge (TRL PR #3582 FSDP weight-sync, Unsloth's FastLanguageModel.get_peft_model × FSDP, FSDP1 _is_root assertion under TRL's summon_full_params per child module, GuidedDecodingParams movement across vLLM versions). Specifically, Unsloth + FSDP does not work in this stack — FSDP's parameter sharding and Unsloth's fused kernels disagree on tensor shapes during the GRPO log-prob forward pass.

Resolution — two branches, chosen by model scale and LoRA need.

Branch	Sharding / optimizer	LoRA	Where it's used	What it unlocks
Branch A	FSDP2 (or FSDP1) via `model-sharding-fsdp2.yaml` / `model-sharding.yaml`, no CPU optimizer offload	None — full fine-tune	Multi-GPU server-mode vLLM; Qwen2.5-3B / Qwen3-4B tranches	Cleanest weight-sync to `trl vllm-serve` (`_sync_fsdp2_params_to_vllm`); no Unsloth interaction risk.
Branch B	DeepSpeed ZeRO-3 (stage 3 parameter + optimizer sharding) with CPU optimizer offload	Unsloth-integrated LoRA (4-bit QLoRA, Unsloth fused kernels)	Headline Qwen3-14B 8×A100 run (`14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe`)	Our finding: this is the only configuration we found in which TRL 1.0 + GRPO + Unsloth completes end-to-end at 14B under 40 GB A100s with trainable adapters. ZeRO-3's all-gather window empirically aligned with Unsloth's forward in our runs, where FSDP2's per-module `summon_full_params` did not. Neither pairing has an upstream-blessed recipe; we report it as an engineering contribution, not a library guarantee.

Shared infra across both branches.

vLLM split. Server-mode on the last REPT_VLLM_TP GPUs (e.g. GPUs 6–7 with TP=2 on the 14B headline); policy trains on the remaining ranks. No colocate stealing VRAM from the policy.
OpenEnv process. Separate process, start_openenv_server.sh, WebSocket on 127.0.0.1:8000, configurable via REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME. Same env binary across every run; only env vars change.
Port matrix. OpenEnv on 8000, trl vllm-serve on 8001 (REPT_VLLM_PORT), NCCL rendezvous on 51216, Accelerate on 29500.
Dependency pin. Both branches use the same core stack: trl==1.0.0, vllm==0.10.2, transformers>=5.2,<5.4, torch==2.8.* (requirements.lambda.txt / requirements.carc-cu121.txt). The branches differ only in sharding recipe and LoRA presence.

Evidence. Branch B is what 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe ran on — 20/20 optimizer steps, 480 episodes, 1920 env turns, mean reward +0.4692, env_step_error = 0. Branch A is the path the Qwen2.5-3B tranche (Supporting — 3B tranche) ran on, end-to-end without LoRA.

Per-WebSocket invariant (prerequisite for both branches). The pathologies above rest on one OpenEnv invariant: one Environment instance per WebSocket session. EpisodeSession is a context manager held for the full multi-turn episode. Violating it — e.g. opening client.sync() inside reset/step — silently collapses reward to zero (episodes come back with num_steps=1, done_after_step=true, empty final_observation). Tokenizer id alignment between env and policy, via resolve_env_tokenizer_name and REPT_MODEL_HUB_ID, is the second half of that invariant.

Takeaways (single synthesis)

The engineering story of this submission reduces to three lessons. Each maps to one of the three pathologies above; together they cover every material failure we encountered, and every subsequent design decision in Environment Design, Scoring, and Architecture follows from one of them.

Variable-length rollouts break NCCL server-mode by default (Pathology 1). Fixed-count padding or a dedicated-inference topology is the only structural fix.
Truncation is not a plumbing problem; it is a learning-signal problem (Pathology 2). clipped_ratio = 1 + reward_std ≈ 0 ⇒ zero gradient. No LR tune can fix that; fix it with max_completion_length, soft-budget warmup, and num_generations.
The reasoning-RL stack is a non-composition, but it splits cleanly into two branches (Pathology 3). Branch A = FSDP + full fine-tune (no LoRA) for small/mid policies. Branch B = DeepSpeed ZeRO-3 + Unsloth-integrated LoRA for 14B+. In our experiments, Unsloth + GRPO completed only under ZeRO-3, not under FSDP — an empirical split, not an upstream guarantee.

These three lessons are the only things from this blog worth copying into the next OpenEnv + TRL 1.0 + multi-turn submission before writing a single line of env code.

Positioning: online, sequential, shared-budget

quadrantChart
    title LLMs + Reasoning Economics · Budget Scope vs Horizon
    x-axis "Per-query budget" --> "Shared, session-level budget"
    y-axis "Single-turn / inference-time" --> "Sequential multi-turn MDP"
    quadrant-1 "Our target"
    quadrant-2 "Multi-turn, no shared budget"
    quadrant-3 "Most prior work"
    quadrant-4 "Emerging"
    "Token-Budget / CCoT": [0.12, 0.12]
    "Chain-of-Draft (Xu 2024)": [0.18, 0.2]
    "L1 / LCPO / O1-Pruner": [0.25, 0.25]
    "Kimi K1.5 Long2Short": [0.3, 0.3]
    "SelfBudgeter": [0.4, 0.3]
    "CoT-Valve / TokenSkip": [0.15, 0.18]
    "Budget Forcing / s1": [0.22, 0.32]
    "Dynasor-CoT": [0.2, 0.15]
    "ReasoningEconomicsEnv": [0.85, 0.88]

Figure 2. Positioning relative to prior reasoning-economics work. Our contribution occupies the shared-budget, sequential multi-turn quadrant; every prior system compresses within a single query.

If the quadrant chart fails to render (Mermaid quadrantChart is marked experimental), the intent is: horizontal axis runs from per-query budget (left) to shared session-level budget (right); vertical axis runs from single-turn inference-time methods (bottom) to sequential multi-turn MDPs (top). Prior work clusters in the bottom-left; ReasoningEconomicsEnv is the top-right.

Why the y-axis is "Sequential multi-turn MDP" — the online-learning angle

The vertical axis is kept as a sequential multi-turn MDP deliberately: our problem is an online-decision problem, not an offline length-control problem. Every prior family in the bottom-left picks a reasoning length once per prompt, in isolation — prompt-guided caps, RL length rewards (including Kimi K1.5 Long2Short), SFT/distillation on shorter traces, dynamic early exit. That is an offline decision in the RL-theory sense: the policy never sees the consequence of its earlier spend affecting what's available for later prompts.

Under a shared session-level budget, the policy must revise pacing after every single observation. The remaining-budget state is non-stationary by construction: every answer shrinks the feasible set of future allocations, and a wrong call on Q1 can starve Q10 irreversibly. This is exactly the online-learning setting — a stream of observations with no resets between decisions, where the cost of a bad action compounds across the episode rather than being absorbed by an independent prompt.

Three concrete consequences of the online framing:

The MDP has no offline equivalent. A supervised-length-control dataset cannot encode "Q1 spent X because the agent expected Q4 to be hard" — that counterfactual only exists in sequential rollouts.
The reward is temporally coupled. Terminal accuracy × utilization ties every intermediate allocation to a single episode-level outcome; the agent cannot learn local policies in isolation (see Scoring).
The training algorithm must be on-policy enough to track non-stationary state. GRPO's group-relative advantages give us that without a separate critic, which is why the MDP and the optimizer compose — and why zero-advantage collapse under truncation (Pathology 2) is catastrophic: it removes the only signal tying per-step decisions to episode outcomes.

Everything else in the bottom-left quadrant can be served by a length-annotated fine-tune or a decoding-time heuristic. The top-right quadrant — our target — cannot; it needs online sequential learning under a shared budget, which is the setting ReasoningEconomicsEnv is built to expose.

Foundations & citations

Foundation	Role in this project	Citation
GRPO	Critic-free RL objective with group-relative advantages; ideal for terminal-only and sparse verifiable rewards	Shao et al., arXiv:2402.03300 (DeepSeekMath)
OpenEnv	Gym-style reset/step, WebSocket transport, HF Space deployment, per-session state, concurrent sessions	HF Blog: Introducing OpenEnv
TRL 1.0 + `rollout_func`	Explicit multi-turn stepping; `env_reward` / `env_mask` contract; avoids `add_response_schema` Qwen3/3.5 allowlist	TRL × OpenEnv docs
vLLM	High-throughput inference; colocate and server modes; `trl vllm-serve` weight sync	vllm-project/vllm
Kimi K1.5 / Long2Short	State-of-the-art per-query length RL; the strongest "compress within a single trace" baseline we compare against	Moonshot AI, arXiv:2501.12599 (2025)
DeepSpeed ZeRO-3	Stage-3 parameter + optimizer sharding with CPU offload; the sharding backbone of Branch B (14B + Unsloth LoRA)	Rajbhandari et al., arXiv:1910.02054
Unsloth	Fused kernels for QLoRA training; we integrated it only on Branch B (ZeRO-3). Composition with TRL GRPO is an empirical finding in our stack, not a documented upstream pairing.	unslothai/unsloth
FSDP2	Per-parameter fully-sharded data parallel; the sharding backbone of Branch A (full fine-tune, no LoRA)	PyTorch, docs
MetaMathQA (active) / NuminaMath-TIR (planned)	Source dataset for episode question sampling — public, verifiable, SymPy-gradable. Current runs draw from the first 5 000 rows of MetaMathQA; NuminaMath-TIR is wired but disabled (Future Work).	`meta-math/MetaMathQA`, `AI-MO/NuminaMath-TIR` on Hugging Face
LotteryElicitationEnv/PT	Sibling project — structural template for two-repo split, `rollout_func`, DDP padding	Same monorepo

Quick start

Supported quick-start path: single-GPU colocate. Multi-GPU paths (FSDP full fine-tune, and DeepSpeed ZeRO-3 + Unsloth LoRA for 14B) are described in Engineering Pathology 3; this section sticks to the simplest working configuration.

# 1. Environment (HF Space or local Docker)
# Local Docker is the most robust; point ENV_BASE_URL at http://127.0.0.1:8000.
# If you prefer the HF Space, use its direct host (https://<owner>-<space>.hf.space), not the hf.co/spaces page.
export ENV_BASE_URL="http://127.0.0.1:8000"

# 2. Training client (ReasoningEconomicsPT) — single-GPU colocate only
export REPT_ROOT="$PWD"
export REPT_VENV="$PWD/.venv"
export REPT_MODEL="Qwen/Qwen3-4B"
export CUDA_VISIBLE_DEVICES=0
export REPT_NUM_GPUS=1
export REPT_VLLM_MODE=colocate           # vLLM and policy share one GPU
export REPT_VLLM_TP=1

bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh --dry-run
bash scripts/run_grpo_lambda.sh

All episodes are seeded. Grading is deterministic (extract_boxed_answer with last-match semantics + SymPy equality). Budget resolution is fully specified by the four-priority table in Environment Design, with budget_source returned in observation metadata for audit.

Dependency pin: trl==1.0.0 + vllm==0.10.2 + transformers>=5.2,<5.4 + torch==2.8.* (see requirements.lambda.txt / requirements.carc-cu121.txt). Multi-GPU variants and the two engineering branches (FSDP / DeepSpeed ZeRO-3) are covered in Engineering Pathology 3.

Baseline execution (planned — not yet run for our checkpoint)

Mirrors the LotteryElicitationEnv baseline CLI pattern; evaluates the trained policy against the four planned baselines in Baselines.

python -m reasoning_economics_pt.eval.evaluate \
    --policy hf --model ./outputs/ckpt-last \
    --episodes 200 \
    --baselines always_same_budget,greedy_first,uniform_random_split,zero_shot_llm \
    --num_questions 4 --budget_ratio 0.8 --hard_cap_mode strict

The harness reports reward_mean, accuracy_mean, budget_utilization_clamped, overspend_tokens, average tokens per question, and questions completed, per baseline and per policy. Episodes are seeded identically across baselines so allocation deltas are directly comparable.

When every token costs, can the model learn when to think?

Shared-budget reasoning under a verifiable reward is the test. The pipeline is built. The 14B headline run lands at mean +0.47 across 480 episodes; convergence, baselines, and 10q training are next.

Future work

Run the prescribed fix stack to convergence. Raise max_completion_length to 8192, cap thinking via max_thinking_tokens, start in soft-budget mode and warm into hard-cap, monitor the IS ratio, add a small partial-credit term for emitting </think>. The pathology in Training Pathology tells us exactly where to intervene.
TRL + Unsloth + FSDP did not complete in our stack; TRL + Unsloth + DeepSpeed ZeRO-3 did (Branch B). That is the only 14B path we validated end-to-end. Getting Unsloth 4-bit QLoRA to compose with a working FSDP2 recipe, if and when upstream lands, would collapse Branches A and B into one; for now, Branch A (FSDP, full fine-tune) stays the fall-back without LoRA.
Domain is math-only today. Grading is \boxed{} + SymPy equality. The MDP generalizes to any verifiable-answer domain, but we have not validated that claim empirically.
Cross-model comparison. Baselines currently run on Qwen2.5-3B/7B and Qwen3-1.7B/4B; a Qwen vs DeepSeek-R1-distill vs open-model matrix is the natural next result.
Large models remain infra-limited. Qwen3-8B, Qwen3-30B-A3B, Qwen2.5-32B will stay infra-limited until FSDP2 is fixed in TRL or we add tensor parallelism to the training path.
Push the NCCL padding pattern upstream into TRL. The bug is general, the fix is simple, and every TRL rollout_func user on variable-length rollouts + server mode has it latent.
Domain-transfer sibling. LotteryElicitationEnv targets prospect-theory preference recovery on the same training harness — validating that the allocator generalizes beyond math is future work.

Conclusion

ReasoningEconomicsEnv reframes the reasoning-economics question from how short can this answer be? to how should I spend what I have left? A stateless grader, a tokenizer-native budget accountant, and a multiplicatively-coupled per-step-plus-terminal reward give us a sequential MDP where every component is auditable and every dollar of compute is accounted for in the unit system the policy actually sees.

The engineering contribution is summarized in one place: Takeaways. We do not re-enumerate it here.

The research question remains open: can a GRPO-trained LLM learn to pace its own reasoning across a shared-budget episode? The headline Qwen3-14B 8×A100 run — 480 episodes, mean reward +0.4692, max +4.52, env_step_error=0 — is the first evidence we have that the answer is yes, subject to baselines landing against the same checkpoint (designed targets in Baselines). Baseline runs against the released 14B checkpoint and the planned NuminaMath-TIR channel are the next results on the roadmap.