OpenEnv · AgentX OpenEnv Track

ReasoningEconomicsEnv

An OpenEnv Benchmark Where LLMs Learn to Budget Their Own Thinking Across a Shared-Budget Episode.

GitHub Env GitHub PT HF Space Qwen3-14B · 8xA100 · ZeRO-3 + Unsloth LoRA + vLLM TP=2 Mean reward +0.47, 480 eps OpenEnv GRPO | TRL 1.0 Budget Modes
Live Environment Space → GitHub Repo

Motivation: reasoning LLMs do not allocate tokens; they spend them

Frontier reasoning models — DeepSeek-R1, QwQ, Qwen3 thinking mode, the o-series — over-spend tokens on easy items and under-spend on hard ones. Chain-of-thought length is only weakly correlated with ground-truth difficulty: trivial arithmetic consumes thousands of thinking tokens, and genuinely hard items get truncated before a \boxed{}. The finding is repeated across Han et al.'s Token-Budget-Aware LLM (arXiv:2412.18547), Xu et al.'s Chain-of-Draft (arXiv:2502.18600), and Moonshot AI's Kimi K1.5 Long2Short ablations (arXiv:2501.12599).

Inference tokens are not a per-query resource. In any real deployment — an exam battery, an eval suite, a multi-turn tool loop, a long-horizon agent — they are a shared, capped resource across a sequence of prompts. Misallocation is not just slow; it is lost accuracy per dollar. A deployed reasoner has to infer difficulty from text alone, decide what a prompt is worth given what's left, conserve on easy items so it can invest on hard ones, and pace itself under irrecoverable depletion.

What existing work does not solve — the Long2Short delta

The four families in Prior Work each cover one axis of reasoning-length control, and each leaves the same axis empty.

FamilyWhat it optimizesAxis still empty
Prompt-Guided
Token-Budget, Chain-of-Draft, CCoT, Token Complexity
Shorten a single chain via promptingNo cross-prompt budget; no learning
RL with Length Reward
L1/LCPO, O1-Pruner, Kimi K1.5 Long2Short, DAST, SelfBudgeter
RL-trained per-response length control. Long2Short distills a long-reasoning teacher into a shorter policy; the reward is conditioned on one chain's length.The policy cannot trade tokens from Q1 to Q7 — it is never shown a shared budget
SFT / Distillation
CoT-Valve, TokenSkip, Z1
Bake shorter reasoning into the weightsStill per-prompt; no episode state
Dynamic Early Exit
Dynasor-CoT, Budget Forcing / s1, DTSR
Decoding-time termination within one promptThe policy has no knowledge that another prompt downstream will also need tokens

The delta in one sentence. Kimi K1.5 Long2Short asks "when should I stop this chain?"; ReasoningEconomicsEnv asks "how should I split my budget across these N chains?" — a different action space (portfolio over prompts) under a different reward surface (joint accuracy × utilization across a battery). Long2Short has no notion of a shared episode budget; it cannot express the trade "save 400 tokens on Q1 so I have them on Q4" because Q1 and Q4 never share state in its MDP.

Long2Short is the offline, single-chain limit of this MDP (N=1, no shared budget); a numerical comparison would require degenerating our env to N=1, at which point the two methods become equivalent by construction. We therefore frame Long2Short as a special case of our formulation rather than a competing baseline.

What ReasoningEconomicsEnv is

ReasoningEconomicsEnv is an OpenEnv-native RL environment where the LLM is both the budget allocator (by choosing how long to think) and the solver (by producing the answer). The environment is a stateless grader and budget accountant. Over a multi-turn episode, the agent learns meta-reasoning: when to think long, when to think cheaply, how to trade correctness against compute under a single shared cap — with no difficulty labels.

To our knowledge, this is the first sequential multi-turn MDP that jointly incentivizes reasoning-trace reduction and answer accuracy under a shared, session-level budget. Every prior family — prompt-guided caps, RL with length rewards including Kimi K1.5 Long2Short, SFT/distillation on compressed traces, dynamic early exit — optimizes compression within a single query. None learn pacing across a sequence of queries.

Why a shared budget changes the problem

Per-query budgeting is a local optimization. Shared-budget reasoning is a sequential resource-allocation problem with partial observability over future question difficulty.

AxisPrior work (R1 / QwQ / Xu 2024 / Kimi K1.5)ReasoningEconomicsEnv
Budget scopePer-query, isolatedShared across N questions
Difficulty signalExplicit label or classifierInferred from text only
HorizonSingle stepSequential (N steps / episode)
Pacing pressureNoneIrrecoverable depletion
Training costLive API per rolloutGrader-only env (CPU) + local vLLM
Decision learnedHow short can this answer be?How should I spend what I have left?

The failure modes we want to surface are distinctly sequential:

FailureWhat goes wrong
Over-invest earlyBudget gone before the last (possibly hard) question arrives
Over-conserveEasy questions answered; hard questions starved, cap under-used
Fixed pacingUniform spend ignores difficulty variance across items
Thinking-mode blowup<think>…</think> runs past max_completion_length; answer truncated, grading returns zero
Unit driftBudget cap and spend tallied in different tokenizers — phantom budget

Every row in that table is a real failure mode we hit and diagnosed end-to-end (see Engineering Lessons and Training Runs).

Prior work & novelty

The reasoning-economics literature falls into four families. All four optimize per-prompt reasoning length; none expose a shared cross-prompt token budget. ReasoningEconomicsEnv is the missing fifth regime — portfolio allocation under a joint budget.

Prompt-Guided

Inference-time prompting asks the model to self-regulate. No training signal, per-query scope.

MethodMechanismLink
Token-Budget (Han et al., 2024)LLM self-estimates a token budget per query and embeds it in the prompt to constrain CoT length; reports ~68% token reduction with minimal accuracy lossarXiv:2412.18547
Chain-of-Draft (Xu et al., 2025)Prompts the model to write ≤5 words per reasoning step; matches CoT accuracy at ~7.6% of the tokensarXiv:2502.18600
CCoT (Renze & Guven, 2024)Appends "be concise" to CoT prompts; reduces length with a minor accuracy penalty on weaker modelsarXiv:2401.05618
Token Complexity (Lee et al., 2025)Benchmarks compression prompts (word limits, bullets, abbreviations); finds LLMs natively adjust length to difficulty even without sophisticated promptingarXiv:2503.01141

RL with Length Reward

Training-time methods that shape a reward around reasoning length on single-prompt responses.

MethodMechanismLink
L1 / LCPO (Aggarwal & Welleck, 2025)GRPO with a length-penalty reward; controls reasoning length via a "Think for N tokens" prompt prefixarXiv:2503.04697
O1-Pruner (Luo et al., 2025)PPO with a reward that penalizes token usage relative to a target length; applied to Marco-o1 and QwQarXiv:2501.12570
Kimi K1.5 / Long2Short (Moonshot AI, 2025)Length-conditioned RL distillation of a long-reasoning teacher into a shorter policy. The paper ReasoningEconomicsEnv most directly contrasts with — Long2Short shortens one chain, we allocate across many.arXiv:2501.12599
DAST (Shu et al., 2025)SimPO-based preference optimization on constructed short/long preference pairsarXiv:2503.04472
SelfBudgeter (Li et al., 2025)Model prepends a self-predicted token budget before reasoning and is trained to respect itarXiv:2505.11274

SFT / Distillation

Supervised fine-tuning on shortened CoT traces. No RL, no budget state.

MethodMechanismLink
CoT-Valve (Ma et al., 2025)Single model trained on CoT of varying lengths; inference-time "valve" parameter controls reasoning deptharXiv:2502.09601
TokenSkip (Xia et al., 2025)Compresses existing CoT by skipping non-essential tokens, then fine-tunes on the compressed tracesarXiv:2502.12067
Z1 (Zhang et al., 2025)SFT on compressed-thought data that shortens each reasoning steparXiv:2504.00810

Dynamic Early Exit

Decoding-time heuristics that terminate a single chain early. The policy has no knowledge of downstream prompts.

MethodMechanismLink
Dynasor-CoT (Fu et al., 2025)Probes intermediate answers at fixed intervals; terminates when consecutive answers agreearXiv:2412.20993
Budget Forcing / s1 (Muennighoff et al., 2025)Forces end-of-thinking + "Final Answer:" at the max token budget; simple and strong baselinearXiv:2501.19393
DEER (Yang et al., 2025)Detects reflection signals (e.g., "Wait,", "Let me check") in the output as dynamic exit points and terminates reasoning if deemed sufficientarXiv:2504.15895

Framework we inherit

ComponentWhat we inheritLink
OpenEnvGym-style reset/step over WebSocket; HF Space deployment; per-session state; concurrent sessionsHF Blog: Introducing OpenEnv
Every method above optimizes "how long should this chain be". ReasoningEconomicsEnv optimizes "how should I split my budget across these N chains".
A different action space (portfolio over prompts) and a different reward surface (joint accuracy × utilization across a battery). To our knowledge, this is the first OpenEnv-native RL environment — and the first sequential MDP of any kind — where a single reward function jointly incentivizes reasoning-trace reduction and answer accuracy across a multi-turn, shared-budget episode.

Scope caveat. The novelty is the MDP, reward coupling, and budget accounting. The RL method (GRPO on verifiable math) is shared with the DeepSeekMath / Kimi K1.5 lineages; we reuse those techniques rather than propose new ones.

What ReasoningEconomicsEnv is

A stateless grader plus a budget accountant, served over OpenEnv's WebSocket protocol. The LLM is the policy. The reward is verifiable. The MDP is multi-turn. Nothing else is invented.

Each episode samples 10 math questions (configurable, num_questions) from meta-math/MetaMathQA — keyed by type (GSM_SV, MATH_FOBAR, …) and drawn from the first 5 000 rows of the dataset (subset_start_idx=0, subset_size=5000) so every run samples from the same fixed window. The agent receives one question at a time alongside its remaining budget, chooses how long to reason, and emits a single response containing its chain-of-thought and a \boxed{…} final answer. The environment grades the answer against ground truth and returns a reward and the next question — until the episode terminates (10 questions completed or budget exhausted, depending on mode).

Dataset scope today: MetaMathQA only, first 5 000 rows. AI-MO/NuminaMath-TIR is wired into the sampler (NUMINA_PROBLEM_TYPE = "NuminaMath_TIR", numina_subset_size) but kept out of the current training mix; enabling the Numina channel for an even MetaMath + Numina mix is tracked in Future Work.

The agent's action interface is deliberately minimal: raw text / JSON output, no tool-call protocol, no markdown parsing fragility. The LLM outputs a response string; the env parses it. Crucially, the training client (ReasoningEconomicsPT) never imports env Pydantic types — it speaks dict shapes over the wire, matching OpenEnv's client/server contract.

Environment design

MDP

The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:

# Observation (env → agent)
class ReasonBudgetObservation(Observation):
    question: str                    # raw problem text
    remaining_budget: int            # tokens left in the episode
    questions_remaining: int
    budget_per_remaining_question: float   # pacing signal
    accuracy_so_far: float
    episode_history: list[HistoryItem]     # in-context Q/A memory
    done: bool
    reward: Optional[float]
    metadata: dict                   # problem_type, total_budget, budget_source,
                                     # budget_mode, min_tokens, max_tokens

# Action (agent → env)
class ReasonBudgetAction(Action):
    response: str                    # thinking trace + \boxed{answer}
    metadata: dict                   # optional tokenizer_name override,
                                     # optional grading_response (visible tail)

Reward

Two-component, both grounded in the OpenEnv per-step plus terminal-bonus pattern. Both terms are alive at once — the per-step cost penalty rewards shorter traces, the correctness term rewards right answers, and the terminal bonus couples the two multiplicatively so neither can be sacrificed to optimize the other.

Per-step (accumulated every turn):

\[ r_{\text{step}} = \text{correctness} + \text{efficiency\_bonus} - \text{cost\_penalty} - \text{overspend\_penalty} \]

Terminal (added to the final step's reward):

\[ r_{\text{episode}} = \lambda_{\text{ep}} \cdot \bigl(\text{episode\_accuracy} \,\times\, \text{budget\_utilization\_score}\bigr) \]

Where budget_utilization_score = max(0, 1 − |spent/total_budget − target_utilization|) rewards finishing close to, but not over, the target utilization. fair_share = total_budget / num_questions is used in both efficiency_bonus and cost_penalty.

Why the product form. Compressing every trace to zero tokens (accuracy = 0) and answering correctly but wasting the budget (utilization bad) are both punished. The only way to maximize r_episode is to spend the budget well and be right.

Reward hyperparameters

All runs in this blog use the repo defaults from ReasoningEconomicsEnv/env/config.py and ReasoningEconomicsEnv/env/reward.py — nothing tuned per-run. They are:

SymbolNameDefaultRole
βbeta (cost-penalty weight)0.05Linear per-step token cost: β · max(0, tokens_used / fair_share − 1). Only fires when the step overspends its fair share.
γgamma (efficiency-bonus weight)0.1Reward for solving under fair share: γ · (1 − spend_ratio), correct steps only.
λeplambda_ep (terminal weight)0.5Scales terminal episode_accuracy × budget_utilization_score. Product form prevents unilateral optimization of either factor.
target_utilization0.9Utilization peak for budget_utilization_score; rewards finishing close to 90% of the total budget.
correctness reward+1.0 / −0.1Per-step: +1 on SymPy match, −0.1 on wrong — a small negative signal so trivial "don't answer" policies lose reward.
soft_overspend_penalty0.25Active only in soft-budget mode: 0.25 · (overspend_tokens / fair_share). Hard-cap mode zeroes this term.
budget_ratio2.0Fallback total-budget multiplier when no total_budget and no tokenizer are passed (budget priority table).
num_questions / min_tokens / max_tokens / max_tokens_per_step10 / 10 / 800 / 2048Episode length and per-step token window; min_tokens also sets hard-cap early termination.

Values are the EnvConfig dataclass defaults and the default kwargs on compute_reward / compute_episode_bonus. They were not swept in this submission; tuning β, γ, and λep jointly against baseline runs is part of Future Work.

Budget modes

ModeBehaviorUse
Hard-cap (default)Per-step spend clipped to remaining_budget; episode terminates early when remaining_budget < min_tokensFinal evaluation, competition scoring
Soft-budgetNo clipping, no early termination; overspend smoothly penalizedTraining curriculum — lets the policy experience the whole episode before discipline is enforced

Dual modes are not a convenience. Hard-cap's early termination produces zero-advantage groups in GRPO: uniform truncation across all generations → std(r)=0 → zero gradient. Soft-budget bridges that window until the policy learns to finish. This is the same pathology we diagnose in full under Engineering Lessons (Pathology 2).

Tokenizer-native budgets

A subtle bug we surfaced and fixed: per-step spend was counted with a live AutoTokenizer, but the episode cap (total_budget) was computed from an abstract config formula in an entirely different unit system. Caps and spends did not share units. The environment now resolves total_budget at reset() in priority order:

PriorityConditionFormulabudget_source
1Client passes total_budgetExact integer"client"
2Client passes tokenizer_namebudget_ratio × Σ tokenize(q_i) over all questions"tokenizer_native"
2bTokenizer load failsConfig formula + warning"config"
3Neither passedbudget_ratio × N × (min_tokens + max_tokens) / 2"config"

Observation metadata returns total_budget and budget_source so the client can verify the path taken. Cap and spend now live in the same policy-token unit system. The fix is exactly the tokenizer-mismatch mitigation described in cross-chat handoff Issue 2b: aligning the env's AutoTokenizer id with the policy tokenizer via --env_tokenizer_name (or the Hub id rewritten into REPT_MODEL_HUB_ID when the checkpoint is a local path).

Why OpenEnv

Scoring: per-step + terminal, coupled multiplicatively

The reward has two layers. Per-step reward fires every turn and sums into an episode total. The terminal bonus is added only at the final step and couples accuracy with budget utilization through a product. An optional scalar --alpha multiplies the per-step reward before episode accumulation (raw_step_reward * alpha inside EpisodeSession); beta is reserved for future shaping and currently has no effect.

ComponentWhenWhat it rewards
Correctnessper stepBoxed answer matches ground truth under SymPy equality
Efficiency bonusper stepRight answer on an easy item with few tokens
Cost penaltyper stepLinear in tokens spent this turn (full decoded response, not just visible tail)
Overspend penaltyper step (soft-budget only)Smooth penalty for going over target utilization
Terminal bonuslast step onlyλ_ep × (episode_accuracy × budget_utilization_score)
Why multiplicative coupling. Sum-of-components rewards reward hacking: the policy can drop accuracy to zero as long as it aces utilization, or vice versa. The product kills both shortcuts: if either factor is zero, the terminal bonus is zero, regardless of how well the other is optimized. The agent has to be right and pace itself — which is the entire learning problem.

Architecture & training pipeline

The project is two strictly separated packages: ReasoningEconomicsEnv (the OpenEnv environment) and ReasoningEconomicsPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports. The PT repo subclasses EnvClient as ReasonBudgetClient with plain dict actions and observations, so it never touches env Pydantic types.

Two-repo cheat sheet

Every code reference in this blog lives in exactly one of the two repos below. When a symbol is mentioned in later sections (Engineering Lessons, Stack Split, Quick Start), this table is the canonical resolution.

Symbol / fileRepoPathRole
ReasonBudgetEnvironmentEnvenv/reason_budget_env.pyFastAPI + OpenEnv environment served over WebSocket; one instance per session.
ReasonBudgetObservation / ReasonBudgetActionEnvenv/models.pyPydantic wire types; PT never imports these, only dict shapes.
EnvConfigEnvenv/config.pyEpisode + budget defaults; overridden by REASON_BUDGET_* env vars.
compute_reward / compute_episode_bonusEnvenv/reward.pyPer-step and terminal reward math (Scoring).
EpisodeSampler / dataset loadersEnvenv/episode_sampler.py, data/loaders.pyMetaMathQA window (subset_size=5000); Numina wired but disabled.
start_openenv_server.shEnvscripts/start_openenv_server.shSpawns the FastAPI WebSocket on 127.0.0.1:8000.
ReasonBudgetClient / EpisodeSessionPTclients/reason_budget_client.pyDict-typed EnvClient subclass; context manager holds one WebSocket for the full episode.
rollout_funcPTtraining/rollout.pyTRL 1.0 multi-turn rollout driver; emits env_reward / env_mask.
resolve_env_tokenizer_namePTtraining/tokenizer_sync.pyAligns env tokenizer with policy tokenizer via REPT_MODEL_HUB_ID.
_sync_fsdp2_params_to_vllmPTtraining/weight_sync.pyWeight-sync path to trl vllm-serve under Branch A (FSDP2).
REPT_* env varsPTscripts/run_grpo_lambda.shREPT_MODEL, REPT_NUM_GPUS, REPT_VLLM_MODE, REPT_VLLM_TP, REPT_VLLM_PORT, REPT_MODEL_HUB_ID.
REASON_BUDGET_* env varsEnvenv/config.pyREASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME.
reward_logs.jsonlPT writes, mirrors Env schemaruns/<run_id>/reward_logs.jsonlPer-step reward audit (step_index, remaining_budget_before, visible_response, raw_step_reward, scaled_step_reward, done_after_step, episode_reward).

Shorthand used below: Env-side = sharma-yash01/ReasoningEconomicsEnv; PT-side = sharma-yash01/ReasoningEconomicsPT.

flowchart LR
    Policy["Policy GPUs 0-5, ZeRO-3, CPU optimizer offload"]
    VLLM["vLLM server GPUs 6-7, tensor parallel 2"]
    Env["OpenEnv FastAPI WebSocket localhost 8000"]
    RewardLog["reward_logs.jsonl"]
    Policy -->|rollout_func| VLLM
    VLLM -->|generations| Policy
    Policy -->|reset, step| Env
    Env -->|Observation, reward, done| Policy
    Env --> RewardLog
      

Figure 1. 8×A100 production topology for the headline Qwen3-14B run (Branch B). GPUs 0–5 run the DeepSpeed ZeRO-3 trainer with CPU optimizer offload and Unsloth-integrated LoRA; GPUs 6–7 run trl vllm-serve with tensor_parallel_size=2; the OpenEnv server runs on a separate process listening on ws://127.0.0.1:8000. The Qwen2.5-3B tranche (3B results) runs on Branch A instead — FSDP2 sharding, full fine-tune, no LoRA. Configurable via env vars REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME; reproducible launch via start_openenv_server.sh.

Training uses GRPO (Group Relative Policy Optimization) via TRL 1.0.0's rollout_func contract, which gives us explicit control over the generate → parse → step loop. We chose rollout_func over environment_factory specifically to avoid TRL 1.0.0's Qwen3-only add_response_schema allowlist — that path is biased toward Qwen3/Qwen3.5 chat-template parsing, while we want to run Qwen2.5 and other families too.

The critical invariant is one WebSocket per episode. _rollout_one_episode runs inside with EpisodeSession(...) as session:, so reset and every step share the same socket. Per turn: trainer.vllm_generation.generate() → decode → session.apply_response(text, …) → remote step({"response": …}). The function returns prompt_ids, completion_ids, logprobs, env_mask, and env_reward; the reward hook reward_from_env(…, **kwargs) simply reads kwargs["env_reward"]. The env's tokenizer id on reset comes from resolve_env_tokenizer_name: --env_tokenizer_name if set, otherwise the tokenizer's name_or_path, falling back to --model. When the checkpoint is a local NFS path, the launcher saves the Hub id into REPT_MODEL_HUB_ID so the remote env receives an HF-resolvable id.

Hybrid-thinking models need per-family wiring. training/model_profiles.json + training/model_profiles.py provide a ModelProfileRegistry keyed on model id (exact match first, then longest-prefix), supplying chat_template_kwargs (e.g. enable_thinking), output_parser (qwen3_think or null), think-tag delimiters, and grading_use_visible_only. Env-side invariant: budget always counts _count_tokens(action.response) on the full string, while grading uses metadata["grading_response"] (visible tail) when non-empty. Budget stays honest; grading stays robust to think traces.

NCCL padding. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_objectbroadcast_object_list. Our rollout is a while not session.done loop, so different ranks make different numbers of generate() calls per episode — permanent collective desync. Fix: a fixed DIST_SERVER_GENERATES_PER_EPISODE = 8 cap, with dummy 1-token generates padding each episode to exactly 8 calls. Dummies are discarded; env_reward, completion_ids, and logprobs are byte-identical to the unpadded case. This is active only in server mode with DDP; colocate + TP=1 does not enter the gather_object path.

Training pathology & zero-advantage collapse

We observed three named, reproducible pathologies when shipping GRPO + OpenEnv + vLLM on hybrid-thinking models under a shared-budget MDP: NCCL desync under variable-length rollouts in server mode; truncation-induced zero-advantage collapse when every completion in a GRPO group hits the same clip boundary; and stack non-composition across TRL + vLLM + Unsloth + FSDP. Before the headline 14B run, the dominant learning-signal failure was truncation collapse — WebSocket, padding, and tokenizer alignment could all be green while the policy still received a structurally zero gradient.

Full telemetry tables, evidence links, a log-backed truncation episode, root-cause chains, and structural fixes are documented once in Engineering Lessons (Pathology 1, Pathology 2 including an expandable log excerpt, Pathology 3, then Takeaways). This section stays short so the blog does not narrate the same three failures twice.

Results: what we found

Runs are organized by what they contribute: (1) the headline Qwen3-14B 8×A100 completed run, our strongest positive-mean-reward evidence; (2) the Qwen2.5-3B tranche on 1×H100 that validates the pipeline end-to-end; (3) boundary / failure runs that delimit the tractable region of the TRL 1.0 / vLLM stack.

Headline — Qwen3-14B, 8×A100, ZeRO-3 + vLLM TP=2, true-4q

Run 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe (see Engineering Pathology 3 for topology). 20/20 optimizer steps completed with artifacts saved. First completed multi-question shared-budget GRPO training run on 14B.

MetricValue
Episodes480
Env turns1920
Mean episode reward+0.4692 ± 0.9758
Min / Max episode reward−0.40 / +4.5205
Accuracy (per-question)17.76%
Cap-hit rate13.54%
env_step_error0

Why true-4q (four questions per episode), not the designed ten. The headline run sets num_questions=4, not 10, because Qwen3-14B + Unsloth LoRA + vLLM tensor-parallel inference + multi-turn thinking-mode rollouts saturated 40 GB A100s at four questions per episode with two generations per GRPO group. Pushing to ten questions per episode triggered OOM in the rollout cache on the same hardware. That reduction is a VRAM ceiling, not a claim that the full MDP is solved; the designed 10-question battery remains the target for future hardware or a smaller policy (see Pathology 3 for why LoRA + ZeRO-3 + CPU offload entered the recipe).

Three things this run shows. (1) The multiplicatively-coupled per-step + terminal reward designed in Scoring can produce positive mean under a shared budget — previously the headline 3B runs were all negative-mean. (2) The max episode reward of +4.52 means at least one episode hit both high accuracy and good utilization; the product is only large when both factors are. (3) env_step_error = 0 over 1920 turns means the OpenEnv WebSocket invariant, tokenizer alignment, and DeepSpeed ZeRO-3 / vLLM TP=2 split all held under a full training run, not just a smoke.

Contextualizing the 17.76% accuracy — baselines not yet run

We have not yet run baselines against the headline 14B checkpoint. The per-question accuracy (17.76%) is therefore uncalibrated. It should be read only against the designed performance targets for each baseline in Baselines (planned), reproduced here for the reader who doesn't want to scroll:

Baseline (planned, not run)Designed performance targetWhat a converged RL policy should beat
uniform-random-splitMean reward ≈ 0, accuracy distributed by dataset-difficulty mean; utilization unshapedLower bound — any structured policy should clear this
greedy-firstAccuracy capped at ≤ 1/N (25% for 4q, 10% for 10q) because later questions are starved; utilization poor on the depleted tailOur policy must show cross-prompt pacing, not just solve Q1
always-same-budgetAccuracy approaches the dataset-difficulty mean at the given total_budget / N; zero utilization shaping because allocation is oblivious to difficultyOur policy's marginal value comes from difficulty-aware allocation, not just from having a budget at all
zero-shot LLM (same Qwen3-14B, no RL)Measurable ceiling achievable without RL; expected to be strong on per-question accuracy but poor on budget_utilization_score because the base model does not paceRL contribution = product gain (accuracy × utilization) the base model cannot reach

17.76% per-question accuracy over the true-4q battery is meaningful only once the zero-shot LLM row above is filled in. Until then it is a raw rate, not a claim about RL uplift. Baseline evaluation against the released checkpoint is the next result in Future Work.

Supporting — Qwen2.5-3B tranche (1×H100)

Validates the full pipeline end-to-end. Reward mean improves monotonically as max_completion_length and max_tokens_per_step grow — the interesting signal is the std and the positive tail, not the centered mean.

RunModelEpisodesKey settingsMean (max)
qwen25_3b_best_single_h100Qwen2.5-3B-Instruct87batch=2, gens=2, lr=7e-7, mcl=640, turns=4−0.4598 (+1.29)
qwen25_3b_betterQwen2.5-3B-Instruct2852 epochs, same shape−0.3156 (+1.33)
qwen25_3b_more_contextQwen2.5-3B-Instruct3682 epochs, mcl=1024, tokens/step=512−0.2279 (+2.34)
qwen25_3b_best_v2Qwen2.5-3B-Instruct6874 epochs, mcl=1024, tokens/step=1024, lr=5e-7−0.1476 (+2.29)

Plumbing smokes on Qwen2.5-0.5B confirm the trainer path (loss 0.00889 / 0.006714; wall-clock 1246 s / 1279 s). These are infrastructure validation, not research signal.

Boundary / failure runs

Where today's stack breaks. Each row is a training-stack limit, not an env or reward-shape limit.

AttemptSetupOutcomeBlocker
Unsloth 14B, 1×A100-40GBvllm_mode=colocate, max_model_len=3500, gens=21/120 steps; grad_norm=NaN; selective_log_softmax RuntimeErrorUnsloth truncation × thinking-mode completions (Pathology 2)
qwen25_7b_4q_ultralow1×H100 colocate, mcl=256Repeated CUDA OOM7B thinking context doesn't fit colocate headroom
Qwen3-8B sharded8×A100 FSDP + vLLM serverNo stable reward summaryTRL/FSDP weight-sync (Pathology 3)
Qwen3-30B-A3B-Instruct-25078×A100 FSDP + server-mode vLLMNo completed summaryvLLM KV-cache / communicator startup
Qwen2.5-32B-Instruct8×A100 FSDP + server-mode vLLMNo completed summaryPrompt length + vLLM startup + NCCL init
Early Qwen2.5-14B smoke8×A100 FSDP + server2 episodes (reward −0.07, −0.30)Superseded by the headline ZeRO-3 run

Baselines (planned)

No baselines have been computed against the headline 14B checkpoint yet. We planned four, in the spirit of LotteryElicitationEnv's baseline set, each isolating one axis of the allocation decision.

BaselinePolicyWhat it isolates
always-same-budgetAllocate total_budget / N per question; greedy decode within each per-question capDifficulty-awareness gain (is it just the total budget that matters, or how it's split?)
greedy-firstSpend up to cap on Q1, Q2, … until budget runs out; truncate remainderPacing cost (what's lost by no-foresight allocation?)
uniform-random-splitDirichlet-sampled per-question allocations at the same total budgetLower bound — beating it proves the policy learned anything structured
zero-shot LLMSame base model (Qwen3-14B) with no RL, greedy decode against the same battery at the same budgetRL contribution — measurable upper bound achievable without learning

Baseline performance targets. By construction we expect uniform-random-split → mean reward ≈ 0 (difficult to ace utilization by accident); greedy-first → accuracy capped at ≤ 1/N because later questions are starved; always-same-budget → mean reward approaches the env-difficulty mean with zero utilization shaping; zero-shot LLM is the measurable ceiling without RL shaping. A converged RL policy should beat all four on the product (accuracy × utilization), not any single factor.

Status

We have: a completed 14B training run with positive mean reward (+0.4692), a validated 3B tranche end-to-end, a mapped set of stack boundaries, and three named pathologies (NCCL desync, truncation collapse, stack non-composition) with reproducible signatures and structural fixes. We do not yet have: baselines evaluated against the headline checkpoint, or cross-family model comparisons. Those are Future Work.

Engineering lessons

Shipping a real GRPO + OpenEnv + vLLM pipeline on a multi-turn verifiable-reward environment surfaced three major pathologies. Each one is a named learning-signal failure with a reproducible signature and a structural fix. We document them so the next OpenEnv submission can avoid the same dead ends.

Pathology 1 — NCCL desync under variable-length rollouts

Signature. all_gather_object / broadcast_object_list hang on the second optimizer step; heartbeat timeout at ~7200 s; all ranks report env_step_error=0 and ConnectionClosedError=0 on the rollout but block at the post-rollout barrier. NCCL flight recorder shows mismatched last enqueued / last completed sequence numbers across ranks — structural, not configurational.

Root cause. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_objectbroadcast_object_list. Our rollout is a while not session.done loop — different ranks make different numbers of generate() calls per episode. NCCL is sequence-numbered; different call counts per rank → permanent desync → eventual _pickle.UnpicklingError or a BROADCAST timeout. A secondary trigger: stale __pycache__ on NFS kept loading the pre-fix .pyc, reintroducing the desync after git pull reported the repo clean.

Evidence. Full write-ups in impl-context/dist-train-desync-issue.md and impl-context/dist-train-issue-hung-gpu.md.

Fix stack.

  1. Clear bytecode caches in every launch script:
    find $REPT_ROOT -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
    find $REPT_VENV -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
    PYTHONDONTWRITEBYTECODE=1 torchrun --nproc_per_node=7 ...
  2. Fixed-count generate() padding per episode. Every rank performs exactly DIST_SERVER_GENERATES_PER_EPISODE = 8 calls — real generates for active turns, 1-token dummies (discarded) for the rest. Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case.
  3. On the 8×A100 headline path we sidestep the server-mode collective entirely: DeepSpeed ZeRO-3 with vLLM on dedicated GPUs keeps the rollout collective off the hot path (see Pathology 3). This is what unblocked 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe.

Why this matters beyond our repo. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent. Heartbeat and batch tuning only hide it longer.

Pathology 2 — Truncation-induced zero-advantage collapse

Signature. From the most recent Lambda run (terminal 36, Qwen3-4B + Unsloth, step 1/120):

Unsloth: Input IDs of shape torch.Size([2, 12986]) with length 12986
         > the model's max sequence length of 3500.
         We shall truncate it ourselves.
RuntimeError: Size does not match at dimension 1
  expected index [2, 12760, 1] to be no larger than
  self [2, 3499, 151936] apart from dimension 2   # in selective_log_softmax

Upstream telemetry in the same step: completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, importance_sampling_ratio/mean = 0.072, grad_norm = NaN. Every completion is clipped, zero episodes terminate with </think> + \boxed{}, and 37% of GRPO groups have reward_std = 0 → advantages collapse to a near-constant → gradient is noise → NaN.

Root cause. Unsloth silently patches the policy tokenizer to max_seq_length = 3500 even when vLLM is configured at 32768. GRPO computes completion logits against the truncated 3499-token sequence, then tries to gather at the untruncated 12760 completion indices — the shapes disagree and the forward breaks. More broadly: when truncation is uniform across the GRPO group, every completion lands at the same wrong answer → std(r) = 0A_i = 0 → zero gradient. The policy also learns to emit filler (repeated !) because that is what survives the hard cap with the least cost penalty.

Fix. Two changes together, not either alone. (1) Keep Unsloth, but only inside the DeepSpeed ZeRO-3 branch we document in Pathology 3 — in our runs, ZeRO-3's all-gather window empirically aligned with Unsloth's forward where FSDP's per-module summon_full_params did not, removing the selective_log_softmax shape mismatch we hit on FSDP. (2) Independently, enforce vllm_max_model_length = tokenizer.model_max_length = actual episode budget at startup, validated via an assertion, and raise max_completion_length to give the closing </think> and \boxed{} room. Start in soft-budget mode; warm into hard-cap. Increase num_generations to break group-level uniformity. LR tuning is the wrong instinct — the gradient is structurally zero, not noisy. Branch A (FSDP, full fine-tune) sidesteps the Unsloth composition entirely and does not hit this pathology.

Why this matters beyond our repo. Any GRPO run on a hybrid-thinking model under a budget-constrained MDP with a partially-right-but-cheap shortcut has this bug latent. The clipped_ratio = 1 + reward_std ≈ 0 pair is the fingerprint.

Expand: telemetry table, log-backed truncation episode, and root-cause chain

The telemetry that reveals collapse

SignalObserved valueWhat it means
reward≈ −0.1 to −0.47 (flat)Uniform negative reward across GRPO groups
completions/clipped_ratio1.0Every completion hits max_new_tokens
completions/mean_terminated_length0Nothing terminates naturally — no </think>, no \boxed{}
frac_reward_zero_std0.375 – 1.0Partial to full GRPO group collapse
importance_sampling_ratio/mean≈ 0.07IS ratio collapsing under policy drift
entropy, grad_normlow / NaNNear-zero (or explicitly broken) gradient signal

Numbers above are drawn from the most recent Lambda train.log (Qwen3-4B, single A100-40GB, Unsloth LoRA, step 1/120) and reconciled against earlier 2×H100 Qwen3-4B trainer metrics. The headline Qwen3-14B 8×A100 run in Results resolved this signature — it is what a stack that doesn't truncate looks like.

A truncation-collapsed episode, log-verified

Training emits a fixed reward_logs.jsonl schema on every step (step_index, question, remaining_budget_before, visible_response, raw_step_reward, scaled_step_reward, done_after_step, episode_reward). The sanitized excerpt below — from the single-GPU Unsloth + Qwen3-4B stack — shows episode-level telemetry when truncation collapse is in full effect.

Step 1 of 10 · remaining_budget_before = 1256
Question: The matrices … are inverses. Enter the ordered pair (a,b). The answer is 14. What is the value of unknown variable X?
visible_response: "!!!!!!!!!!!! … (truncated) !!!!!!!!!!!!"
reasoning_trace: "" (empty)
raw_step_reward: −0.11 · done_after_step: false
Per Scoring / compute_reward: correctness = −0.1 (wrong answer) plus a small cost_penalty from mild overspend past fair_share (≈ −0.01); efficiency_bonus = 0 because the step is incorrect.
Step 9 · remaining_budget_before = 451.0
visible_response: same repeated-! pattern; was_correct = false.
raw_step_reward: −0.1
Here correctness = −0.1 alone; cost_penalty ≈ 0 (no meaningful overspend past fair_share on this step).
Step 10 · episode terminal
final_observation.accuracy_so_far: 0.0
episode_reward: −1.0174
Arithmetic: ten wrong steps each carry correctness = −0.1 (≈ −1.0 summed), plus small per-step cost_penalty terms where spend_ratio > 1; the terminal term λep · episode_accuracy · budget_utilization_score is zero because episode_accuracy = 0. The logged −1.0174 is therefore dominated by wrong-answer penalties, not "ten pure cost penalties."
Trainer-side consequence: completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, grad_norm = NaN. Step 2 never runs.
Ten degenerate completions, zero correctness, zero terminal bonus by construction. The multiplicative coupling refuses to reward a policy that didn't solve anything — doing exactly what it was designed to do. The failure is upstream: the policy can't produce a syntactically valid answer because Unsloth truncated the input out from under it.

Caption: truncation-collapsed Unsloth episode on 1×A100, episode_reward = −1.0174. Grader behaves correctly; the failure is upstream in the stack.

Root cause chain

  1. max_completion_length = 4096 default + Qwen3 thinking → <think>…</think> span alone consumes 1500–4000 tokens.
  2. Completion gets truncated before the closing </think> → no \boxed{…} answer.
  3. With grading_use_visible_only = True, the empty visible tail grades as incorrect; in the latest Lambda run the visible tail is a degenerate string of ! characters (no \boxed{}, reasoning_trace empty).
  4. Uniform truncation across all completions in the GRPO group → identical rewards → std(r) = 0A_i = 0zero gradient.
  5. Policy divergence over reuse epochs pulls the IS ratio down (mean ≈ 0.07, min = 0) — noise on top of zero signal.
  6. Hard-cap multi-turn budget amplifies the effect: verbose truncated completions terminate episodes early, shrinking the learning signal further.

The most recent Lambda run (2026-04-14) adds a second failure mode on top: with Unsloth enabled and max_model_len = 3500, the accumulated prompt + completion reached 12 986 tokens. Unsloth warned "We shall truncate it ourselves" and the trainer then crashed inside _get_per_token_logps_and_entropies → selective_log_softmax with RuntimeError: Size does not match at dimension 1 expected index [2, 12760, 1] to be no larger than self [2, 3499, 151936] apart from dimension 2. Truncated indices and full-length logits disagreed, the forward broke, and grad_norm became NaN. The fix is the same as the root truncation chain above: raise max_completion_length (and the vLLM max_model_len) to match what thinking-mode actually emits, or disable thinking mode outright. LR tuning cannot fix this — the gradient is structurally zero (or undefined).

This failure mode is general, not specific to ReasoningEconomicsEnv. Any GRPO run on a hybrid-thinking model under a budget-constrained multi-turn MDP with a partially-right-but-cheap shortcut has this bug latent. We believe every future sequential-budget RL run on reasoning models needs to start from this diagnosis.

Pathology 3 — Stack non-composition; two validated branches

Signature. Unsloth for GRPO is not yet implemented! Just ignore this function. (terminal 36); FSDP2 shard-gather blocking rollout_func; vLLM colocate stealing VRAM from the policy on 40 GB A100s; repeated CUDA OOM at 7B on single H100 (qwen25_7b_4q_ultralow); Qwen3-30B-A3B and Qwen2.5-32B failures at vLLM KV-cache / communicator startup.

Root cause. TRL 1.0.0 + vLLM (colocate) + Unsloth + FSDP do not compose as a four-way intersection. Each pairwise composition has a known sharp edge (TRL PR #3582 FSDP weight-sync, Unsloth's FastLanguageModel.get_peft_model × FSDP, FSDP1 _is_root assertion under TRL's summon_full_params per child module, GuidedDecodingParams movement across vLLM versions). Specifically, Unsloth + FSDP does not work in this stack — FSDP's parameter sharding and Unsloth's fused kernels disagree on tensor shapes during the GRPO log-prob forward pass.

Resolution — two branches, chosen by model scale and LoRA need.

BranchSharding / optimizerLoRAWhere it's usedWhat it unlocks
Branch AFSDP2 (or FSDP1) via model-sharding-fsdp2.yaml / model-sharding.yaml, no CPU optimizer offloadNone — full fine-tuneMulti-GPU server-mode vLLM; Qwen2.5-3B / Qwen3-4B tranchesCleanest weight-sync to trl vllm-serve (_sync_fsdp2_params_to_vllm); no Unsloth interaction risk.
Branch BDeepSpeed ZeRO-3 (stage 3 parameter + optimizer sharding) with CPU optimizer offloadUnsloth-integrated LoRA (4-bit QLoRA, Unsloth fused kernels)Headline Qwen3-14B 8×A100 run (14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe)Our finding: this is the only configuration we found in which TRL 1.0 + GRPO + Unsloth completes end-to-end at 14B under 40 GB A100s with trainable adapters. ZeRO-3's all-gather window empirically aligned with Unsloth's forward in our runs, where FSDP2's per-module summon_full_params did not. Neither pairing has an upstream-blessed recipe; we report it as an engineering contribution, not a library guarantee.

Shared infra across both branches.

Evidence. Branch B is what 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe ran on — 20/20 optimizer steps, 480 episodes, 1920 env turns, mean reward +0.4692, env_step_error = 0. Branch A is the path the Qwen2.5-3B tranche (Supporting — 3B tranche) ran on, end-to-end without LoRA.

Per-WebSocket invariant (prerequisite for both branches). The pathologies above rest on one OpenEnv invariant: one Environment instance per WebSocket session. EpisodeSession is a context manager held for the full multi-turn episode. Violating it — e.g. opening client.sync() inside reset/step — silently collapses reward to zero (episodes come back with num_steps=1, done_after_step=true, empty final_observation). Tokenizer id alignment between env and policy, via resolve_env_tokenizer_name and REPT_MODEL_HUB_ID, is the second half of that invariant.

Takeaways (single synthesis)

The engineering story of this submission reduces to three lessons. Each maps to one of the three pathologies above; together they cover every material failure we encountered, and every subsequent design decision in Environment Design, Scoring, and Architecture follows from one of them.

  1. Variable-length rollouts break NCCL server-mode by default (Pathology 1). Fixed-count padding or a dedicated-inference topology is the only structural fix.
  2. Truncation is not a plumbing problem; it is a learning-signal problem (Pathology 2). clipped_ratio = 1 + reward_std ≈ 0 ⇒ zero gradient. No LR tune can fix that; fix it with max_completion_length, soft-budget warmup, and num_generations.
  3. The reasoning-RL stack is a non-composition, but it splits cleanly into two branches (Pathology 3). Branch A = FSDP + full fine-tune (no LoRA) for small/mid policies. Branch B = DeepSpeed ZeRO-3 + Unsloth-integrated LoRA for 14B+. In our experiments, Unsloth + GRPO completed only under ZeRO-3, not under FSDP — an empirical split, not an upstream guarantee.

These three lessons are the only things from this blog worth copying into the next OpenEnv + TRL 1.0 + multi-turn submission before writing a single line of env code.

Positioning: online, sequential, shared-budget

quadrantChart
    title LLMs + Reasoning Economics · Budget Scope vs Horizon
    x-axis "Per-query budget" --> "Shared, session-level budget"
    y-axis "Single-turn / inference-time" --> "Sequential multi-turn MDP"
    quadrant-1 "Our target"
    quadrant-2 "Multi-turn, no shared budget"
    quadrant-3 "Most prior work"
    quadrant-4 "Emerging"
    "Token-Budget / CCoT": [0.12, 0.12]
    "Chain-of-Draft (Xu 2024)": [0.18, 0.2]
    "L1 / LCPO / O1-Pruner": [0.25, 0.25]
    "Kimi K1.5 Long2Short": [0.3, 0.3]
    "SelfBudgeter": [0.4, 0.3]
    "CoT-Valve / TokenSkip": [0.15, 0.18]
    "Budget Forcing / s1": [0.22, 0.32]
    "Dynasor-CoT": [0.2, 0.15]
    "ReasoningEconomicsEnv": [0.85, 0.88]
      

Figure 2. Positioning relative to prior reasoning-economics work. Our contribution occupies the shared-budget, sequential multi-turn quadrant; every prior system compresses within a single query.

If the quadrant chart fails to render (Mermaid quadrantChart is marked experimental), the intent is: horizontal axis runs from per-query budget (left) to shared session-level budget (right); vertical axis runs from single-turn inference-time methods (bottom) to sequential multi-turn MDPs (top). Prior work clusters in the bottom-left; ReasoningEconomicsEnv is the top-right.

Why the y-axis is "Sequential multi-turn MDP" — the online-learning angle

The vertical axis is kept as a sequential multi-turn MDP deliberately: our problem is an online-decision problem, not an offline length-control problem. Every prior family in the bottom-left picks a reasoning length once per prompt, in isolation — prompt-guided caps, RL length rewards (including Kimi K1.5 Long2Short), SFT/distillation on shorter traces, dynamic early exit. That is an offline decision in the RL-theory sense: the policy never sees the consequence of its earlier spend affecting what's available for later prompts.

Under a shared session-level budget, the policy must revise pacing after every single observation. The remaining-budget state is non-stationary by construction: every answer shrinks the feasible set of future allocations, and a wrong call on Q1 can starve Q10 irreversibly. This is exactly the online-learning setting — a stream of observations with no resets between decisions, where the cost of a bad action compounds across the episode rather than being absorbed by an independent prompt.

Three concrete consequences of the online framing:

Everything else in the bottom-left quadrant can be served by a length-annotated fine-tune or a decoding-time heuristic. The top-right quadrant — our target — cannot; it needs online sequential learning under a shared budget, which is the setting ReasoningEconomicsEnv is built to expose.

Foundations & citations

FoundationRole in this projectCitation
GRPOCritic-free RL objective with group-relative advantages; ideal for terminal-only and sparse verifiable rewardsShao et al., arXiv:2402.03300 (DeepSeekMath)
OpenEnvGym-style reset/step, WebSocket transport, HF Space deployment, per-session state, concurrent sessionsHF Blog: Introducing OpenEnv
TRL 1.0 + rollout_funcExplicit multi-turn stepping; env_reward / env_mask contract; avoids add_response_schema Qwen3/3.5 allowlistTRL × OpenEnv docs
vLLMHigh-throughput inference; colocate and server modes; trl vllm-serve weight syncvllm-project/vllm
Kimi K1.5 / Long2ShortState-of-the-art per-query length RL; the strongest "compress within a single trace" baseline we compare againstMoonshot AI, arXiv:2501.12599 (2025)
DeepSpeed ZeRO-3Stage-3 parameter + optimizer sharding with CPU offload; the sharding backbone of Branch B (14B + Unsloth LoRA)Rajbhandari et al., arXiv:1910.02054
UnslothFused kernels for QLoRA training; we integrated it only on Branch B (ZeRO-3). Composition with TRL GRPO is an empirical finding in our stack, not a documented upstream pairing.unslothai/unsloth
FSDP2Per-parameter fully-sharded data parallel; the sharding backbone of Branch A (full fine-tune, no LoRA)PyTorch, docs
MetaMathQA (active) / NuminaMath-TIR (planned)Source dataset for episode question sampling — public, verifiable, SymPy-gradable. Current runs draw from the first 5 000 rows of MetaMathQA; NuminaMath-TIR is wired but disabled (Future Work).meta-math/MetaMathQA, AI-MO/NuminaMath-TIR on Hugging Face
LotteryElicitationEnv/PTSibling project — structural template for two-repo split, rollout_func, DDP paddingSame monorepo

Quick start

Supported quick-start path: single-GPU colocate. Multi-GPU paths (FSDP full fine-tune, and DeepSpeed ZeRO-3 + Unsloth LoRA for 14B) are described in Engineering Pathology 3; this section sticks to the simplest working configuration.

# 1. Environment (HF Space or local Docker)
# Local Docker is the most robust; point ENV_BASE_URL at http://127.0.0.1:8000.
# If you prefer the HF Space, use its direct host (https://<owner>-<space>.hf.space), not the hf.co/spaces page.
export ENV_BASE_URL="http://127.0.0.1:8000"

# 2. Training client (ReasoningEconomicsPT) — single-GPU colocate only
export REPT_ROOT="$PWD"
export REPT_VENV="$PWD/.venv"
export REPT_MODEL="Qwen/Qwen3-4B"
export CUDA_VISIBLE_DEVICES=0
export REPT_NUM_GPUS=1
export REPT_VLLM_MODE=colocate           # vLLM and policy share one GPU
export REPT_VLLM_TP=1

bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh --dry-run
bash scripts/run_grpo_lambda.sh

All episodes are seeded. Grading is deterministic (extract_boxed_answer with last-match semantics + SymPy equality). Budget resolution is fully specified by the four-priority table in Environment Design, with budget_source returned in observation metadata for audit.

Dependency pin: trl==1.0.0 + vllm==0.10.2 + transformers>=5.2,<5.4 + torch==2.8.* (see requirements.lambda.txt / requirements.carc-cu121.txt). Multi-GPU variants and the two engineering branches (FSDP / DeepSpeed ZeRO-3) are covered in Engineering Pathology 3.

Baseline execution (planned — not yet run for our checkpoint)

Mirrors the LotteryElicitationEnv baseline CLI pattern; evaluates the trained policy against the four planned baselines in Baselines.

python -m reasoning_economics_pt.eval.evaluate \
    --policy hf --model ./outputs/ckpt-last \
    --episodes 200 \
    --baselines always_same_budget,greedy_first,uniform_random_split,zero_shot_llm \
    --num_questions 4 --budget_ratio 0.8 --hard_cap_mode strict

The harness reports reward_mean, accuracy_mean, budget_utilization_clamped, overspend_tokens, average tokens per question, and questions completed, per baseline and per policy. Episodes are seeded identically across baselines so allocation deltas are directly comparable.

When every token costs, can the model learn when to think?
Shared-budget reasoning under a verifiable reward is the test. The pipeline is built. The 14B headline run lands at mean +0.47 across 480 episodes; convergence, baselines, and 10q training are next.

Future work

Conclusion

ReasoningEconomicsEnv reframes the reasoning-economics question from how short can this answer be? to how should I spend what I have left? A stateless grader, a tokenizer-native budget accountant, and a multiplicatively-coupled per-step-plus-terminal reward give us a sequential MDP where every component is auditable and every dollar of compute is accounted for in the unit system the policy actually sees.

The engineering contribution is summarized in one place: Takeaways. We do not re-enumerate it here.

The research question remains open: can a GRPO-trained LLM learn to pace its own reasoning across a shared-budget episode? The headline Qwen3-14B 8×A100 run — 480 episodes, mean reward +0.4692, max +4.52, env_step_error=0 — is the first evidence we have that the answer is yes, subject to baselines landing against the same checkpoint (designed targets in Baselines). Baseline runs against the released 14B checkpoint and the planned NuminaMath-TIR channel are the next results on the roadmap.