An OpenEnv Benchmark Where LLMs Learn to Budget Their Own Thinking Across a Shared-Budget Episode.
Frontier reasoning models — DeepSeek-R1, QwQ, Qwen3 thinking mode, the o-series — over-spend tokens on easy items and under-spend on hard ones. Chain-of-thought length is only weakly correlated with ground-truth difficulty: trivial arithmetic consumes thousands of thinking tokens, and genuinely hard items get truncated before a \boxed{}. The finding is repeated across Han et al.'s Token-Budget-Aware LLM (arXiv:2412.18547), Xu et al.'s Chain-of-Draft (arXiv:2502.18600), and Moonshot AI's Kimi K1.5 Long2Short ablations (arXiv:2501.12599).
Inference tokens are not a per-query resource. In any real deployment — an exam battery, an eval suite, a multi-turn tool loop, a long-horizon agent — they are a shared, capped resource across a sequence of prompts. Misallocation is not just slow; it is lost accuracy per dollar. A deployed reasoner has to infer difficulty from text alone, decide what a prompt is worth given what's left, conserve on easy items so it can invest on hard ones, and pace itself under irrecoverable depletion.
The four families in Prior Work each cover one axis of reasoning-length control, and each leaves the same axis empty.
| Family | What it optimizes | Axis still empty |
|---|---|---|
| Prompt-Guided Token-Budget, Chain-of-Draft, CCoT, Token Complexity | Shorten a single chain via prompting | No cross-prompt budget; no learning |
| RL with Length Reward L1/LCPO, O1-Pruner, Kimi K1.5 Long2Short, DAST, SelfBudgeter | RL-trained per-response length control. Long2Short distills a long-reasoning teacher into a shorter policy; the reward is conditioned on one chain's length. | The policy cannot trade tokens from Q1 to Q7 — it is never shown a shared budget |
| SFT / Distillation CoT-Valve, TokenSkip, Z1 | Bake shorter reasoning into the weights | Still per-prompt; no episode state |
| Dynamic Early Exit Dynasor-CoT, Budget Forcing / s1, DTSR | Decoding-time termination within one prompt | The policy has no knowledge that another prompt downstream will also need tokens |
The delta in one sentence. Kimi K1.5 Long2Short asks "when should I stop this chain?"; ReasoningEconomicsEnv asks "how should I split my budget across these N chains?" — a different action space (portfolio over prompts) under a different reward surface (joint accuracy × utilization across a battery). Long2Short has no notion of a shared episode budget; it cannot express the trade "save 400 tokens on Q1 so I have them on Q4" because Q1 and Q4 never share state in its MDP.
Long2Short is the offline, single-chain limit of this MDP (N=1, no shared budget); a numerical comparison would require degenerating our env to N=1, at which point the two methods become equivalent by construction. We therefore frame Long2Short as a special case of our formulation rather than a competing baseline.
ReasoningEconomicsEnv is an OpenEnv-native RL environment where the LLM is both the budget allocator (by choosing how long to think) and the solver (by producing the answer). The environment is a stateless grader and budget accountant. Over a multi-turn episode, the agent learns meta-reasoning: when to think long, when to think cheaply, how to trade correctness against compute under a single shared cap — with no difficulty labels.
To our knowledge, this is the first sequential multi-turn MDP that jointly incentivizes reasoning-trace reduction and answer accuracy under a shared, session-level budget. Every prior family — prompt-guided caps, RL with length rewards including Kimi K1.5 Long2Short, SFT/distillation on compressed traces, dynamic early exit — optimizes compression within a single query. None learn pacing across a sequence of queries.
Per-query budgeting is a local optimization. Shared-budget reasoning is a sequential resource-allocation problem with partial observability over future question difficulty.
| Axis | Prior work (R1 / QwQ / Xu 2024 / Kimi K1.5) | ReasoningEconomicsEnv |
|---|---|---|
| Budget scope | Per-query, isolated | Shared across N questions |
| Difficulty signal | Explicit label or classifier | Inferred from text only |
| Horizon | Single step | Sequential (N steps / episode) |
| Pacing pressure | None | Irrecoverable depletion |
| Training cost | Live API per rollout | Grader-only env (CPU) + local vLLM |
| Decision learned | How short can this answer be? | How should I spend what I have left? |
The failure modes we want to surface are distinctly sequential:
| Failure | What goes wrong |
|---|---|
| Over-invest early | Budget gone before the last (possibly hard) question arrives |
| Over-conserve | Easy questions answered; hard questions starved, cap under-used |
| Fixed pacing | Uniform spend ignores difficulty variance across items |
| Thinking-mode blowup | <think>…</think> runs past max_completion_length; answer truncated, grading returns zero |
| Unit drift | Budget cap and spend tallied in different tokenizers — phantom budget |
Every row in that table is a real failure mode we hit and diagnosed end-to-end (see Engineering Lessons and Training Runs).
The reasoning-economics literature falls into four families. All four optimize per-prompt reasoning length; none expose a shared cross-prompt token budget. ReasoningEconomicsEnv is the missing fifth regime — portfolio allocation under a joint budget.
Inference-time prompting asks the model to self-regulate. No training signal, per-query scope.
| Method | Mechanism | Link |
|---|---|---|
| Token-Budget (Han et al., 2024) | LLM self-estimates a token budget per query and embeds it in the prompt to constrain CoT length; reports ~68% token reduction with minimal accuracy loss | arXiv:2412.18547 |
| Chain-of-Draft (Xu et al., 2025) | Prompts the model to write ≤5 words per reasoning step; matches CoT accuracy at ~7.6% of the tokens | arXiv:2502.18600 |
| CCoT (Renze & Guven, 2024) | Appends "be concise" to CoT prompts; reduces length with a minor accuracy penalty on weaker models | arXiv:2401.05618 |
| Token Complexity (Lee et al., 2025) | Benchmarks compression prompts (word limits, bullets, abbreviations); finds LLMs natively adjust length to difficulty even without sophisticated prompting | arXiv:2503.01141 |
Training-time methods that shape a reward around reasoning length on single-prompt responses.
| Method | Mechanism | Link |
|---|---|---|
| L1 / LCPO (Aggarwal & Welleck, 2025) | GRPO with a length-penalty reward; controls reasoning length via a "Think for N tokens" prompt prefix | arXiv:2503.04697 |
| O1-Pruner (Luo et al., 2025) | PPO with a reward that penalizes token usage relative to a target length; applied to Marco-o1 and QwQ | arXiv:2501.12570 |
| Kimi K1.5 / Long2Short (Moonshot AI, 2025) | Length-conditioned RL distillation of a long-reasoning teacher into a shorter policy. The paper ReasoningEconomicsEnv most directly contrasts with — Long2Short shortens one chain, we allocate across many. | arXiv:2501.12599 |
| DAST (Shu et al., 2025) | SimPO-based preference optimization on constructed short/long preference pairs | arXiv:2503.04472 |
| SelfBudgeter (Li et al., 2025) | Model prepends a self-predicted token budget before reasoning and is trained to respect it | arXiv:2505.11274 |
Supervised fine-tuning on shortened CoT traces. No RL, no budget state.
| Method | Mechanism | Link |
|---|---|---|
| CoT-Valve (Ma et al., 2025) | Single model trained on CoT of varying lengths; inference-time "valve" parameter controls reasoning depth | arXiv:2502.09601 |
| TokenSkip (Xia et al., 2025) | Compresses existing CoT by skipping non-essential tokens, then fine-tunes on the compressed traces | arXiv:2502.12067 |
| Z1 (Zhang et al., 2025) | SFT on compressed-thought data that shortens each reasoning step | arXiv:2504.00810 |
Decoding-time heuristics that terminate a single chain early. The policy has no knowledge of downstream prompts.
| Method | Mechanism | Link |
|---|---|---|
| Dynasor-CoT (Fu et al., 2025) | Probes intermediate answers at fixed intervals; terminates when consecutive answers agree | arXiv:2412.20993 |
| Budget Forcing / s1 (Muennighoff et al., 2025) | Forces end-of-thinking + "Final Answer:" at the max token budget; simple and strong baseline | arXiv:2501.19393 |
| DEER (Yang et al., 2025) | Detects reflection signals (e.g., "Wait,", "Let me check") in the output as dynamic exit points and terminates reasoning if deemed sufficient | arXiv:2504.15895 |
| Component | What we inherit | Link |
|---|---|---|
| OpenEnv | Gym-style reset/step over WebSocket; HF Space deployment; per-session state; concurrent sessions | HF Blog: Introducing OpenEnv |
accuracy × utilization across a battery). To our knowledge, this is the first OpenEnv-native RL environment — and the first sequential MDP of any kind — where a single reward function jointly incentivizes reasoning-trace reduction and answer accuracy across a multi-turn, shared-budget episode.Scope caveat. The novelty is the MDP, reward coupling, and budget accounting. The RL method (GRPO on verifiable math) is shared with the DeepSeekMath / Kimi K1.5 lineages; we reuse those techniques rather than propose new ones.
A stateless grader plus a budget accountant, served over OpenEnv's WebSocket protocol. The LLM is the policy. The reward is verifiable. The MDP is multi-turn. Nothing else is invented.
Each episode samples 10 math questions (configurable, num_questions) from meta-math/MetaMathQA — keyed by type (GSM_SV, MATH_FOBAR, …) and drawn from the first 5 000 rows of the dataset (subset_start_idx=0, subset_size=5000) so every run samples from the same fixed window. The agent receives one question at a time alongside its remaining budget, chooses how long to reason, and emits a single response containing its chain-of-thought and a \boxed{…} final answer. The environment grades the answer against ground truth and returns a reward and the next question — until the episode terminates (10 questions completed or budget exhausted, depending on mode).
Dataset scope today: MetaMathQA only, first 5 000 rows. AI-MO/NuminaMath-TIR is wired into the sampler (NUMINA_PROBLEM_TYPE = "NuminaMath_TIR", numina_subset_size) but kept out of the current training mix; enabling the Numina channel for an even MetaMath + Numina mix is tracked in Future Work.
The agent's action interface is deliberately minimal: raw text / JSON output, no tool-call protocol, no markdown parsing fragility. The LLM outputs a response string; the env parses it. Crucially, the training client (ReasoningEconomicsPT) never imports env Pydantic types — it speaks dict shapes over the wire, matching OpenEnv's client/server contract.
The core contract is two Pydantic types exchanged over the OpenEnv WebSocket:
# Observation (env → agent)
class ReasonBudgetObservation(Observation):
question: str # raw problem text
remaining_budget: int # tokens left in the episode
questions_remaining: int
budget_per_remaining_question: float # pacing signal
accuracy_so_far: float
episode_history: list[HistoryItem] # in-context Q/A memory
done: bool
reward: Optional[float]
metadata: dict # problem_type, total_budget, budget_source,
# budget_mode, min_tokens, max_tokens
# Action (agent → env)
class ReasonBudgetAction(Action):
response: str # thinking trace + \boxed{answer}
metadata: dict # optional tokenizer_name override,
# optional grading_response (visible tail)
Two-component, both grounded in the OpenEnv per-step plus terminal-bonus pattern. Both terms are alive at once — the per-step cost penalty rewards shorter traces, the correctness term rewards right answers, and the terminal bonus couples the two multiplicatively so neither can be sacrificed to optimize the other.
Per-step (accumulated every turn):
correctness: +1.0 if extract_boxed_answer + SymPy equality matches ground truth, else −0.1 — incentivizes answer accuracy (wrong answers carry a small per-step penalty; see Two-repo cheat sheet / compute_reward in env/reward.py).efficiency_bonus: reward for cheap correct answers on easy items — incentivizes trace reduction when correctness is preserved.cost_penalty: linear in tokens spent this turn — direct trace-length pressure.overspend_penalty: active only in soft-budget mode; 0 under hard cap.Terminal (added to the final step's reward):
Where budget_utilization_score = max(0, 1 − |spent/total_budget − target_utilization|) rewards finishing close to, but not over, the target utilization. fair_share = total_budget / num_questions is used in both efficiency_bonus and cost_penalty.
Why the product form. Compressing every trace to zero tokens (accuracy = 0) and answering correctly but wasting the budget (utilization bad) are both punished. The only way to maximize r_episode is to spend the budget well and be right.
All runs in this blog use the repo defaults from ReasoningEconomicsEnv/env/config.py and ReasoningEconomicsEnv/env/reward.py — nothing tuned per-run. They are:
| Symbol | Name | Default | Role |
|---|---|---|---|
β | beta (cost-penalty weight) | 0.05 | Linear per-step token cost: β · max(0, tokens_used / fair_share − 1). Only fires when the step overspends its fair share. |
γ | gamma (efficiency-bonus weight) | 0.1 | Reward for solving under fair share: γ · (1 − spend_ratio), correct steps only. |
λep | lambda_ep (terminal weight) | 0.5 | Scales terminal episode_accuracy × budget_utilization_score. Product form prevents unilateral optimization of either factor. |
| — | target_utilization | 0.9 | Utilization peak for budget_utilization_score; rewards finishing close to 90% of the total budget. |
| — | correctness reward | +1.0 / −0.1 | Per-step: +1 on SymPy match, −0.1 on wrong — a small negative signal so trivial "don't answer" policies lose reward. |
| — | soft_overspend_penalty | 0.25 | Active only in soft-budget mode: 0.25 · (overspend_tokens / fair_share). Hard-cap mode zeroes this term. |
| — | budget_ratio | 2.0 | Fallback total-budget multiplier when no total_budget and no tokenizer are passed (budget priority table). |
| — | num_questions / min_tokens / max_tokens / max_tokens_per_step | 10 / 10 / 800 / 2048 | Episode length and per-step token window; min_tokens also sets hard-cap early termination. |
Values are the EnvConfig dataclass defaults and the default kwargs on compute_reward / compute_episode_bonus. They were not swept in this submission; tuning β, γ, and λep jointly against baseline runs is part of Future Work.
| Mode | Behavior | Use |
|---|---|---|
| Hard-cap (default) | Per-step spend clipped to remaining_budget; episode terminates early when remaining_budget < min_tokens | Final evaluation, competition scoring |
| Soft-budget | No clipping, no early termination; overspend smoothly penalized | Training curriculum — lets the policy experience the whole episode before discipline is enforced |
Dual modes are not a convenience. Hard-cap's early termination produces zero-advantage groups in GRPO: uniform truncation across all generations → std(r)=0 → zero gradient. Soft-budget bridges that window until the policy learns to finish. This is the same pathology we diagnose in full under Engineering Lessons (Pathology 2).
A subtle bug we surfaced and fixed: per-step spend was counted with a live AutoTokenizer, but the episode cap (total_budget) was computed from an abstract config formula in an entirely different unit system. Caps and spends did not share units. The environment now resolves total_budget at reset() in priority order:
| Priority | Condition | Formula | budget_source |
|---|---|---|---|
| 1 | Client passes total_budget | Exact integer | "client" |
| 2 | Client passes tokenizer_name | budget_ratio × Σ tokenize(q_i) over all questions | "tokenizer_native" |
| 2b | Tokenizer load fails | Config formula + warning | "config" |
| 3 | Neither passed | budget_ratio × N × (min_tokens + max_tokens) / 2 | "config" |
Observation metadata returns total_budget and budget_source so the client can verify the path taken. Cap and spend now live in the same policy-token unit system. The fix is exactly the tokenizer-mismatch mitigation described in cross-chat handoff Issue 2b: aligning the env's AutoTokenizer id with the policy tokenizer via --env_tokenizer_name (or the Hub id rewritten into REPT_MODEL_HUB_ID when the checkpoint is a local path).
EnvClient, Environment, Pydantic Observation / Action. No invented abstractions.SUPPORTS_CONCURRENT_SESSIONS = True; max_concurrent_envs = 64 — built to be hammered by DDP ranks.total_budget, budget_source, budget_mode, problem_type, per-step tokenizer_name, grading_response) ride on Observation.metadata and Action.metadata. No new method signatures.extract_boxed_answer + SymPy. Per-step reward numerically bounded; episode bonus is a clamped product. No LLM judge, no circularity.The reward has two layers. Per-step reward fires every turn and sums into an episode total. The terminal bonus is added only at the final step and couples accuracy with budget utilization through a product. An optional scalar --alpha multiplies the per-step reward before episode accumulation (raw_step_reward * alpha inside EpisodeSession); beta is reserved for future shaping and currently has no effect.
| Component | When | What it rewards |
|---|---|---|
| Correctness | per step | Boxed answer matches ground truth under SymPy equality |
| Efficiency bonus | per step | Right answer on an easy item with few tokens |
| Cost penalty | per step | Linear in tokens spent this turn (full decoded response, not just visible tail) |
| Overspend penalty | per step (soft-budget only) | Smooth penalty for going over target utilization |
| Terminal bonus | last step only | λ_ep × (episode_accuracy × budget_utilization_score) |
Why multiplicative coupling. Sum-of-components rewards reward hacking: the policy can drop accuracy to zero as long as it aces utilization, or vice versa. The product kills both shortcuts: if either factor is zero, the terminal bonus is zero, regardless of how well the other is optimized. The agent has to be right and pace itself — which is the entire learning problem.
The project is two strictly separated packages: ReasoningEconomicsEnv (the OpenEnv environment) and ReasoningEconomicsPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports. The PT repo subclasses EnvClient as ReasonBudgetClient with plain dict actions and observations, so it never touches env Pydantic types.
Every code reference in this blog lives in exactly one of the two repos below. When a symbol is mentioned in later sections (Engineering Lessons, Stack Split, Quick Start), this table is the canonical resolution.
| Symbol / file | Repo | Path | Role |
|---|---|---|---|
ReasonBudgetEnvironment | Env | env/reason_budget_env.py | FastAPI + OpenEnv environment served over WebSocket; one instance per session. |
ReasonBudgetObservation / ReasonBudgetAction | Env | env/models.py | Pydantic wire types; PT never imports these, only dict shapes. |
EnvConfig | Env | env/config.py | Episode + budget defaults; overridden by REASON_BUDGET_* env vars. |
compute_reward / compute_episode_bonus | Env | env/reward.py | Per-step and terminal reward math (Scoring). |
EpisodeSampler / dataset loaders | Env | env/episode_sampler.py, data/loaders.py | MetaMathQA window (subset_size=5000); Numina wired but disabled. |
start_openenv_server.sh | Env | scripts/start_openenv_server.sh | Spawns the FastAPI WebSocket on 127.0.0.1:8000. |
ReasonBudgetClient / EpisodeSession | PT | clients/reason_budget_client.py | Dict-typed EnvClient subclass; context manager holds one WebSocket for the full episode. |
rollout_func | PT | training/rollout.py | TRL 1.0 multi-turn rollout driver; emits env_reward / env_mask. |
resolve_env_tokenizer_name | PT | training/tokenizer_sync.py | Aligns env tokenizer with policy tokenizer via REPT_MODEL_HUB_ID. |
_sync_fsdp2_params_to_vllm | PT | training/weight_sync.py | Weight-sync path to trl vllm-serve under Branch A (FSDP2). |
REPT_* env vars | PT | scripts/run_grpo_lambda.sh | REPT_MODEL, REPT_NUM_GPUS, REPT_VLLM_MODE, REPT_VLLM_TP, REPT_VLLM_PORT, REPT_MODEL_HUB_ID. |
REASON_BUDGET_* env vars | Env | env/config.py | REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME. |
reward_logs.jsonl | PT writes, mirrors Env schema | runs/<run_id>/reward_logs.jsonl | Per-step reward audit (step_index, remaining_budget_before, visible_response, raw_step_reward, scaled_step_reward, done_after_step, episode_reward). |
Shorthand used below: Env-side = sharma-yash01/ReasoningEconomicsEnv; PT-side = sharma-yash01/ReasoningEconomicsPT.
flowchart LR
Policy["Policy GPUs 0-5, ZeRO-3, CPU optimizer offload"]
VLLM["vLLM server GPUs 6-7, tensor parallel 2"]
Env["OpenEnv FastAPI WebSocket localhost 8000"]
RewardLog["reward_logs.jsonl"]
Policy -->|rollout_func| VLLM
VLLM -->|generations| Policy
Policy -->|reset, step| Env
Env -->|Observation, reward, done| Policy
Env --> RewardLog
Figure 1. 8×A100 production topology for the headline Qwen3-14B run (Branch B). GPUs 0–5 run the DeepSpeed ZeRO-3 trainer with CPU optimizer offload and Unsloth-integrated LoRA; GPUs 6–7 run trl vllm-serve with tensor_parallel_size=2; the OpenEnv server runs on a separate process listening on ws://127.0.0.1:8000. The Qwen2.5-3B tranche (3B results) runs on Branch A instead — FSDP2 sharding, full fine-tune, no LoRA. Configurable via env vars REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME; reproducible launch via start_openenv_server.sh.
Training uses GRPO (Group Relative Policy Optimization) via TRL 1.0.0's rollout_func contract, which gives us explicit control over the generate → parse → step loop. We chose rollout_func over environment_factory specifically to avoid TRL 1.0.0's Qwen3-only add_response_schema allowlist — that path is biased toward Qwen3/Qwen3.5 chat-template parsing, while we want to run Qwen2.5 and other families too.
The critical invariant is one WebSocket per episode. _rollout_one_episode runs inside with EpisodeSession(...) as session:, so reset and every step share the same socket. Per turn: trainer.vllm_generation.generate() → decode → session.apply_response(text, …) → remote step({"response": …}). The function returns prompt_ids, completion_ids, logprobs, env_mask, and env_reward; the reward hook reward_from_env(…, **kwargs) simply reads kwargs["env_reward"]. The env's tokenizer id on reset comes from resolve_env_tokenizer_name: --env_tokenizer_name if set, otherwise the tokenizer's name_or_path, falling back to --model. When the checkpoint is a local NFS path, the launcher saves the Hub id into REPT_MODEL_HUB_ID so the remote env receives an HF-resolvable id.
Hybrid-thinking models need per-family wiring. training/model_profiles.json + training/model_profiles.py provide a ModelProfileRegistry keyed on model id (exact match first, then longest-prefix), supplying chat_template_kwargs (e.g. enable_thinking), output_parser (qwen3_think or null), think-tag delimiters, and grading_use_visible_only. Env-side invariant: budget always counts _count_tokens(action.response) on the full string, while grading uses metadata["grading_response"] (visible tail) when non-empty. Budget stays honest; grading stays robust to think traces.
NCCL padding. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_object → broadcast_object_list. Our rollout is a while not session.done loop, so different ranks make different numbers of generate() calls per episode — permanent collective desync. Fix: a fixed DIST_SERVER_GENERATES_PER_EPISODE = 8 cap, with dummy 1-token generates padding each episode to exactly 8 calls. Dummies are discarded; env_reward, completion_ids, and logprobs are byte-identical to the unpadded case. This is active only in server mode with DDP; colocate + TP=1 does not enter the gather_object path.
We observed three named, reproducible pathologies when shipping GRPO + OpenEnv + vLLM on hybrid-thinking models under a shared-budget MDP: NCCL desync under variable-length rollouts in server mode; truncation-induced zero-advantage collapse when every completion in a GRPO group hits the same clip boundary; and stack non-composition across TRL + vLLM + Unsloth + FSDP. Before the headline 14B run, the dominant learning-signal failure was truncation collapse — WebSocket, padding, and tokenizer alignment could all be green while the policy still received a structurally zero gradient.
Full telemetry tables, evidence links, a log-backed truncation episode, root-cause chains, and structural fixes are documented once in Engineering Lessons (Pathology 1, Pathology 2 including an expandable log excerpt, Pathology 3, then Takeaways). This section stays short so the blog does not narrate the same three failures twice.
Runs are organized by what they contribute: (1) the headline Qwen3-14B 8×A100 completed run, our strongest positive-mean-reward evidence; (2) the Qwen2.5-3B tranche on 1×H100 that validates the pipeline end-to-end; (3) boundary / failure runs that delimit the tractable region of the TRL 1.0 / vLLM stack.
Run 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe (see Engineering Pathology 3 for topology). 20/20 optimizer steps completed with artifacts saved. First completed multi-question shared-budget GRPO training run on 14B.
| Metric | Value |
|---|---|
| Episodes | 480 |
| Env turns | 1920 |
| Mean episode reward | +0.4692 ± 0.9758 |
| Min / Max episode reward | −0.40 / +4.5205 |
| Accuracy (per-question) | 17.76% |
| Cap-hit rate | 13.54% |
env_step_error | 0 |
Why true-4q (four questions per episode), not the designed ten. The headline run sets num_questions=4, not 10, because Qwen3-14B + Unsloth LoRA + vLLM tensor-parallel inference + multi-turn thinking-mode rollouts saturated 40 GB A100s at four questions per episode with two generations per GRPO group. Pushing to ten questions per episode triggered OOM in the rollout cache on the same hardware. That reduction is a VRAM ceiling, not a claim that the full MDP is solved; the designed 10-question battery remains the target for future hardware or a smaller policy (see Pathology 3 for why LoRA + ZeRO-3 + CPU offload entered the recipe).
Three things this run shows. (1) The multiplicatively-coupled per-step + terminal reward designed in Scoring can produce positive mean under a shared budget — previously the headline 3B runs were all negative-mean. (2) The max episode reward of +4.52 means at least one episode hit both high accuracy and good utilization; the product is only large when both factors are. (3) env_step_error = 0 over 1920 turns means the OpenEnv WebSocket invariant, tokenizer alignment, and DeepSpeed ZeRO-3 / vLLM TP=2 split all held under a full training run, not just a smoke.
We have not yet run baselines against the headline 14B checkpoint. The per-question accuracy (17.76%) is therefore uncalibrated. It should be read only against the designed performance targets for each baseline in Baselines (planned), reproduced here for the reader who doesn't want to scroll:
| Baseline (planned, not run) | Designed performance target | What a converged RL policy should beat |
|---|---|---|
| uniform-random-split | Mean reward ≈ 0, accuracy distributed by dataset-difficulty mean; utilization unshaped | Lower bound — any structured policy should clear this |
| greedy-first | Accuracy capped at ≤ 1/N (25% for 4q, 10% for 10q) because later questions are starved; utilization poor on the depleted tail | Our policy must show cross-prompt pacing, not just solve Q1 |
| always-same-budget | Accuracy approaches the dataset-difficulty mean at the given total_budget / N; zero utilization shaping because allocation is oblivious to difficulty | Our policy's marginal value comes from difficulty-aware allocation, not just from having a budget at all |
| zero-shot LLM (same Qwen3-14B, no RL) | Measurable ceiling achievable without RL; expected to be strong on per-question accuracy but poor on budget_utilization_score because the base model does not pace | RL contribution = product gain (accuracy × utilization) the base model cannot reach |
17.76% per-question accuracy over the true-4q battery is meaningful only once the zero-shot LLM row above is filled in. Until then it is a raw rate, not a claim about RL uplift. Baseline evaluation against the released checkpoint is the next result in Future Work.
Validates the full pipeline end-to-end. Reward mean improves monotonically as max_completion_length and max_tokens_per_step grow — the interesting signal is the std and the positive tail, not the centered mean.
| Run | Model | Episodes | Key settings | Mean (max) |
|---|---|---|---|---|
| qwen25_3b_best_single_h100 | Qwen2.5-3B-Instruct | 87 | batch=2, gens=2, lr=7e-7, mcl=640, turns=4 | −0.4598 (+1.29) |
| qwen25_3b_better | Qwen2.5-3B-Instruct | 285 | 2 epochs, same shape | −0.3156 (+1.33) |
| qwen25_3b_more_context | Qwen2.5-3B-Instruct | 368 | 2 epochs, mcl=1024, tokens/step=512 | −0.2279 (+2.34) |
| qwen25_3b_best_v2 | Qwen2.5-3B-Instruct | 687 | 4 epochs, mcl=1024, tokens/step=1024, lr=5e-7 | −0.1476 (+2.29) |
Plumbing smokes on Qwen2.5-0.5B confirm the trainer path (loss 0.00889 / 0.006714; wall-clock 1246 s / 1279 s). These are infrastructure validation, not research signal.
Where today's stack breaks. Each row is a training-stack limit, not an env or reward-shape limit.
| Attempt | Setup | Outcome | Blocker |
|---|---|---|---|
| Unsloth 14B, 1×A100-40GB | vllm_mode=colocate, max_model_len=3500, gens=2 | 1/120 steps; grad_norm=NaN; selective_log_softmax RuntimeError | Unsloth truncation × thinking-mode completions (Pathology 2) |
| qwen25_7b_4q_ultralow | 1×H100 colocate, mcl=256 | Repeated CUDA OOM | 7B thinking context doesn't fit colocate headroom |
| Qwen3-8B sharded | 8×A100 FSDP + vLLM server | No stable reward summary | TRL/FSDP weight-sync (Pathology 3) |
| Qwen3-30B-A3B-Instruct-2507 | 8×A100 FSDP + server-mode vLLM | No completed summary | vLLM KV-cache / communicator startup |
| Qwen2.5-32B-Instruct | 8×A100 FSDP + server-mode vLLM | No completed summary | Prompt length + vLLM startup + NCCL init |
| Early Qwen2.5-14B smoke | 8×A100 FSDP + server | 2 episodes (reward −0.07, −0.30) | Superseded by the headline ZeRO-3 run |
No baselines have been computed against the headline 14B checkpoint yet. We planned four, in the spirit of LotteryElicitationEnv's baseline set, each isolating one axis of the allocation decision.
| Baseline | Policy | What it isolates |
|---|---|---|
| always-same-budget | Allocate total_budget / N per question; greedy decode within each per-question cap | Difficulty-awareness gain (is it just the total budget that matters, or how it's split?) |
| greedy-first | Spend up to cap on Q1, Q2, … until budget runs out; truncate remainder | Pacing cost (what's lost by no-foresight allocation?) |
| uniform-random-split | Dirichlet-sampled per-question allocations at the same total budget | Lower bound — beating it proves the policy learned anything structured |
| zero-shot LLM | Same base model (Qwen3-14B) with no RL, greedy decode against the same battery at the same budget | RL contribution — measurable upper bound achievable without learning |
Baseline performance targets. By construction we expect uniform-random-split → mean reward ≈ 0 (difficult to ace utilization by accident); greedy-first → accuracy capped at ≤ 1/N because later questions are starved; always-same-budget → mean reward approaches the env-difficulty mean with zero utilization shaping; zero-shot LLM is the measurable ceiling without RL shaping. A converged RL policy should beat all four on the product (accuracy × utilization), not any single factor.
We have: a completed 14B training run with positive mean reward (+0.4692), a validated 3B tranche end-to-end, a mapped set of stack boundaries, and three named pathologies (NCCL desync, truncation collapse, stack non-composition) with reproducible signatures and structural fixes. We do not yet have: baselines evaluated against the headline checkpoint, or cross-family model comparisons. Those are Future Work.
Shipping a real GRPO + OpenEnv + vLLM pipeline on a multi-turn verifiable-reward environment surfaced three major pathologies. Each one is a named learning-signal failure with a reproducible signature and a structural fix. We document them so the next OpenEnv submission can avoid the same dead ends.
Signature. all_gather_object / broadcast_object_list hang on the second optimizer step; heartbeat timeout at ~7200 s; all ranks report env_step_error=0 and ConnectionClosedError=0 on the rollout but block at the post-rollout barrier. NCCL flight recorder shows mismatched last enqueued / last completed sequence numbers across ranks — structural, not configurational.
Root cause. In vllm_mode=server with world_size > 1, every trainer.vllm_generation.generate() runs accelerate.gather_object → NCCL all_gather_object → broadcast_object_list. Our rollout is a while not session.done loop — different ranks make different numbers of generate() calls per episode. NCCL is sequence-numbered; different call counts per rank → permanent desync → eventual _pickle.UnpicklingError or a BROADCAST timeout. A secondary trigger: stale __pycache__ on NFS kept loading the pre-fix .pyc, reintroducing the desync after git pull reported the repo clean.
Evidence. Full write-ups in impl-context/dist-train-desync-issue.md and impl-context/dist-train-issue-hung-gpu.md.
Fix stack.
find $REPT_ROOT -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
find $REPT_VENV -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null
PYTHONDONTWRITEBYTECODE=1 torchrun --nproc_per_node=7 ...
generate() padding per episode. Every rank performs exactly DIST_SERVER_GENERATES_PER_EPISODE = 8 calls — real generates for active turns, 1-token dummies (discarded) for the rest. Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case.14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe.Why this matters beyond our repo. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent. Heartbeat and batch tuning only hide it longer.
Signature. From the most recent Lambda run (terminal 36, Qwen3-4B + Unsloth, step 1/120):
Unsloth: Input IDs of shape torch.Size([2, 12986]) with length 12986
> the model's max sequence length of 3500.
We shall truncate it ourselves.
RuntimeError: Size does not match at dimension 1
expected index [2, 12760, 1] to be no larger than
self [2, 3499, 151936] apart from dimension 2 # in selective_log_softmax
Upstream telemetry in the same step: completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, importance_sampling_ratio/mean = 0.072, grad_norm = NaN. Every completion is clipped, zero episodes terminate with </think> + \boxed{}, and 37% of GRPO groups have reward_std = 0 → advantages collapse to a near-constant → gradient is noise → NaN.
Root cause. Unsloth silently patches the policy tokenizer to max_seq_length = 3500 even when vLLM is configured at 32768. GRPO computes completion logits against the truncated 3499-token sequence, then tries to gather at the untruncated 12760 completion indices — the shapes disagree and the forward breaks. More broadly: when truncation is uniform across the GRPO group, every completion lands at the same wrong answer → std(r) = 0 → A_i = 0 → zero gradient. The policy also learns to emit filler (repeated !) because that is what survives the hard cap with the least cost penalty.
Fix. Two changes together, not either alone. (1) Keep Unsloth, but only inside the DeepSpeed ZeRO-3 branch we document in Pathology 3 — in our runs, ZeRO-3's all-gather window empirically aligned with Unsloth's forward where FSDP's per-module summon_full_params did not, removing the selective_log_softmax shape mismatch we hit on FSDP. (2) Independently, enforce vllm_max_model_length = tokenizer.model_max_length = actual episode budget at startup, validated via an assertion, and raise max_completion_length to give the closing </think> and \boxed{} room. Start in soft-budget mode; warm into hard-cap. Increase num_generations to break group-level uniformity. LR tuning is the wrong instinct — the gradient is structurally zero, not noisy. Branch A (FSDP, full fine-tune) sidesteps the Unsloth composition entirely and does not hit this pathology.
Why this matters beyond our repo. Any GRPO run on a hybrid-thinking model under a budget-constrained MDP with a partially-right-but-cheap shortcut has this bug latent. The clipped_ratio = 1 + reward_std ≈ 0 pair is the fingerprint.
| Signal | Observed value | What it means |
|---|---|---|
reward | ≈ −0.1 to −0.47 (flat) | Uniform negative reward across GRPO groups |
completions/clipped_ratio | 1.0 | Every completion hits max_new_tokens |
completions/mean_terminated_length | 0 | Nothing terminates naturally — no </think>, no \boxed{} |
frac_reward_zero_std | 0.375 – 1.0 | Partial to full GRPO group collapse |
importance_sampling_ratio/mean | ≈ 0.07 | IS ratio collapsing under policy drift |
entropy, grad_norm | low / NaN | Near-zero (or explicitly broken) gradient signal |
Numbers above are drawn from the most recent Lambda train.log (Qwen3-4B, single A100-40GB, Unsloth LoRA, step 1/120) and reconciled against earlier 2×H100 Qwen3-4B trainer metrics. The headline Qwen3-14B 8×A100 run in Results resolved this signature — it is what a stack that doesn't truncate looks like.
Training emits a fixed reward_logs.jsonl schema on every step (step_index, question, remaining_budget_before, visible_response, raw_step_reward, scaled_step_reward, done_after_step, episode_reward). The sanitized excerpt below — from the single-GPU Unsloth + Qwen3-4B stack — shows episode-level telemetry when truncation collapse is in full effect.
remaining_budget_before = 1256(a,b). The answer is 14. What is the value of unknown variable X?visible_response: "!!!!!!!!!!!! … (truncated) !!!!!!!!!!!!"reasoning_trace: "" (empty)raw_step_reward: −0.11 · done_after_step: falsecompute_reward: correctness = −0.1 (wrong answer) plus a small cost_penalty from mild overspend past fair_share (≈ −0.01); efficiency_bonus = 0 because the step is incorrect.
remaining_budget_before = 451.0visible_response: same repeated-! pattern; was_correct = false.raw_step_reward: −0.1correctness = −0.1 alone; cost_penalty ≈ 0 (no meaningful overspend past fair_share on this step).
final_observation.accuracy_so_far: 0.0episode_reward: −1.0174correctness = −0.1 (≈ −1.0 summed), plus small per-step cost_penalty terms where spend_ratio > 1; the terminal term λep · episode_accuracy · budget_utilization_score is zero because episode_accuracy = 0. The logged −1.0174 is therefore dominated by wrong-answer penalties, not "ten pure cost penalties."completions/clipped_ratio = 1.0, mean_terminated_length = 0, frac_reward_zero_std = 0.375, grad_norm = NaN. Step 2 never runs.
Caption: truncation-collapsed Unsloth episode on 1×A100, episode_reward = −1.0174. Grader behaves correctly; the failure is upstream in the stack.
max_completion_length = 4096 default + Qwen3 thinking → <think>…</think> span alone consumes 1500–4000 tokens.</think> → no \boxed{…} answer.grading_use_visible_only = True, the empty visible tail grades as incorrect; in the latest Lambda run the visible tail is a degenerate string of ! characters (no \boxed{}, reasoning_trace empty).std(r) = 0 → A_i = 0 → zero gradient.mean ≈ 0.07, min = 0) — noise on top of zero signal.The most recent Lambda run (2026-04-14) adds a second failure mode on top: with Unsloth enabled and max_model_len = 3500, the accumulated prompt + completion reached 12 986 tokens. Unsloth warned "We shall truncate it ourselves" and the trainer then crashed inside _get_per_token_logps_and_entropies → selective_log_softmax with RuntimeError: Size does not match at dimension 1 expected index [2, 12760, 1] to be no larger than self [2, 3499, 151936] apart from dimension 2. Truncated indices and full-length logits disagreed, the forward broke, and grad_norm became NaN. The fix is the same as the root truncation chain above: raise max_completion_length (and the vLLM max_model_len) to match what thinking-mode actually emits, or disable thinking mode outright. LR tuning cannot fix this — the gradient is structurally zero (or undefined).
This failure mode is general, not specific to ReasoningEconomicsEnv. Any GRPO run on a hybrid-thinking model under a budget-constrained multi-turn MDP with a partially-right-but-cheap shortcut has this bug latent. We believe every future sequential-budget RL run on reasoning models needs to start from this diagnosis.
Signature. Unsloth for GRPO is not yet implemented! Just ignore this function. (terminal 36); FSDP2 shard-gather blocking rollout_func; vLLM colocate stealing VRAM from the policy on 40 GB A100s; repeated CUDA OOM at 7B on single H100 (qwen25_7b_4q_ultralow); Qwen3-30B-A3B and Qwen2.5-32B failures at vLLM KV-cache / communicator startup.
Root cause. TRL 1.0.0 + vLLM (colocate) + Unsloth + FSDP do not compose as a four-way intersection. Each pairwise composition has a known sharp edge (TRL PR #3582 FSDP weight-sync, Unsloth's FastLanguageModel.get_peft_model × FSDP, FSDP1 _is_root assertion under TRL's summon_full_params per child module, GuidedDecodingParams movement across vLLM versions). Specifically, Unsloth + FSDP does not work in this stack — FSDP's parameter sharding and Unsloth's fused kernels disagree on tensor shapes during the GRPO log-prob forward pass.
Resolution — two branches, chosen by model scale and LoRA need.
| Branch | Sharding / optimizer | LoRA | Where it's used | What it unlocks |
|---|---|---|---|---|
| Branch A | FSDP2 (or FSDP1) via model-sharding-fsdp2.yaml / model-sharding.yaml, no CPU optimizer offload | None — full fine-tune | Multi-GPU server-mode vLLM; Qwen2.5-3B / Qwen3-4B tranches | Cleanest weight-sync to trl vllm-serve (_sync_fsdp2_params_to_vllm); no Unsloth interaction risk. |
| Branch B | DeepSpeed ZeRO-3 (stage 3 parameter + optimizer sharding) with CPU optimizer offload | Unsloth-integrated LoRA (4-bit QLoRA, Unsloth fused kernels) | Headline Qwen3-14B 8×A100 run (14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe) | Our finding: this is the only configuration we found in which TRL 1.0 + GRPO + Unsloth completes end-to-end at 14B under 40 GB A100s with trainable adapters. ZeRO-3's all-gather window empirically aligned with Unsloth's forward in our runs, where FSDP2's per-module summon_full_params did not. Neither pairing has an upstream-blessed recipe; we report it as an engineering contribution, not a library guarantee. |
Shared infra across both branches.
REPT_VLLM_TP GPUs (e.g. GPUs 6–7 with TP=2 on the 14B headline); policy trains on the remaining ranks. No colocate stealing VRAM from the policy.start_openenv_server.sh, WebSocket on 127.0.0.1:8000, configurable via REASON_BUDGET_NUM_QUESTIONS, REASON_BUDGET_HARD_CAP_MODE, REASON_BUDGET_BUDGET_RATIO, REASON_BUDGET_TOKENIZER_NAME. Same env binary across every run; only env vars change.trl vllm-serve on 8001 (REPT_VLLM_PORT), NCCL rendezvous on 51216, Accelerate on 29500.trl==1.0.0, vllm==0.10.2, transformers>=5.2,<5.4, torch==2.8.* (requirements.lambda.txt / requirements.carc-cu121.txt). The branches differ only in sharding recipe and LoRA presence.Evidence. Branch B is what 14b_a100x8_true4q_cap128_answerfirst_eager_zero3cpu_steps20_retry_envsafe ran on — 20/20 optimizer steps, 480 episodes, 1920 env turns, mean reward +0.4692, env_step_error = 0. Branch A is the path the Qwen2.5-3B tranche (Supporting — 3B tranche) ran on, end-to-end without LoRA.
Per-WebSocket invariant (prerequisite for both branches). The pathologies above rest on one OpenEnv invariant: one Environment instance per WebSocket session. EpisodeSession is a context manager held for the full multi-turn episode. Violating it — e.g. opening client.sync() inside reset/step — silently collapses reward to zero (episodes come back with num_steps=1, done_after_step=true, empty final_observation). Tokenizer id alignment between env and policy, via resolve_env_tokenizer_name and REPT_MODEL_HUB_ID, is the second half of that invariant.
The engineering story of this submission reduces to three lessons. Each maps to one of the three pathologies above; together they cover every material failure we encountered, and every subsequent design decision in Environment Design, Scoring, and Architecture follows from one of them.
clipped_ratio = 1 + reward_std ≈ 0 ⇒ zero gradient. No LR tune can fix that; fix it with max_completion_length, soft-budget warmup, and num_generations.These three lessons are the only things from this blog worth copying into the next OpenEnv + TRL 1.0 + multi-turn submission before writing a single line of env code.
quadrantChart
title LLMs + Reasoning Economics · Budget Scope vs Horizon
x-axis "Per-query budget" --> "Shared, session-level budget"
y-axis "Single-turn / inference-time" --> "Sequential multi-turn MDP"
quadrant-1 "Our target"
quadrant-2 "Multi-turn, no shared budget"
quadrant-3 "Most prior work"
quadrant-4 "Emerging"
"Token-Budget / CCoT": [0.12, 0.12]
"Chain-of-Draft (Xu 2024)": [0.18, 0.2]
"L1 / LCPO / O1-Pruner": [0.25, 0.25]
"Kimi K1.5 Long2Short": [0.3, 0.3]
"SelfBudgeter": [0.4, 0.3]
"CoT-Valve / TokenSkip": [0.15, 0.18]
"Budget Forcing / s1": [0.22, 0.32]
"Dynasor-CoT": [0.2, 0.15]
"ReasoningEconomicsEnv": [0.85, 0.88]
Figure 2. Positioning relative to prior reasoning-economics work. Our contribution occupies the shared-budget, sequential multi-turn quadrant; every prior system compresses within a single query.
If the quadrant chart fails to render (Mermaid quadrantChart is marked experimental), the intent is: horizontal axis runs from per-query budget (left) to shared session-level budget (right); vertical axis runs from single-turn inference-time methods (bottom) to sequential multi-turn MDPs (top). Prior work clusters in the bottom-left; ReasoningEconomicsEnv is the top-right.
The vertical axis is kept as a sequential multi-turn MDP deliberately: our problem is an online-decision problem, not an offline length-control problem. Every prior family in the bottom-left picks a reasoning length once per prompt, in isolation — prompt-guided caps, RL length rewards (including Kimi K1.5 Long2Short), SFT/distillation on shorter traces, dynamic early exit. That is an offline decision in the RL-theory sense: the policy never sees the consequence of its earlier spend affecting what's available for later prompts.
Under a shared session-level budget, the policy must revise pacing after every single observation. The remaining-budget state is non-stationary by construction: every answer shrinks the feasible set of future allocations, and a wrong call on Q1 can starve Q10 irreversibly. This is exactly the online-learning setting — a stream of observations with no resets between decisions, where the cost of a bad action compounds across the episode rather than being absorbed by an independent prompt.
Three concrete consequences of the online framing:
accuracy × utilization ties every intermediate allocation to a single episode-level outcome; the agent cannot learn local policies in isolation (see Scoring).Everything else in the bottom-left quadrant can be served by a length-annotated fine-tune or a decoding-time heuristic. The top-right quadrant — our target — cannot; it needs online sequential learning under a shared budget, which is the setting ReasoningEconomicsEnv is built to expose.
| Foundation | Role in this project | Citation |
|---|---|---|
| GRPO | Critic-free RL objective with group-relative advantages; ideal for terminal-only and sparse verifiable rewards | Shao et al., arXiv:2402.03300 (DeepSeekMath) |
| OpenEnv | Gym-style reset/step, WebSocket transport, HF Space deployment, per-session state, concurrent sessions | HF Blog: Introducing OpenEnv |
TRL 1.0 + rollout_func | Explicit multi-turn stepping; env_reward / env_mask contract; avoids add_response_schema Qwen3/3.5 allowlist | TRL × OpenEnv docs |
| vLLM | High-throughput inference; colocate and server modes; trl vllm-serve weight sync | vllm-project/vllm |
| Kimi K1.5 / Long2Short | State-of-the-art per-query length RL; the strongest "compress within a single trace" baseline we compare against | Moonshot AI, arXiv:2501.12599 (2025) |
| DeepSpeed ZeRO-3 | Stage-3 parameter + optimizer sharding with CPU offload; the sharding backbone of Branch B (14B + Unsloth LoRA) | Rajbhandari et al., arXiv:1910.02054 |
| Unsloth | Fused kernels for QLoRA training; we integrated it only on Branch B (ZeRO-3). Composition with TRL GRPO is an empirical finding in our stack, not a documented upstream pairing. | unslothai/unsloth |
| FSDP2 | Per-parameter fully-sharded data parallel; the sharding backbone of Branch A (full fine-tune, no LoRA) | PyTorch, docs |
| MetaMathQA (active) / NuminaMath-TIR (planned) | Source dataset for episode question sampling — public, verifiable, SymPy-gradable. Current runs draw from the first 5 000 rows of MetaMathQA; NuminaMath-TIR is wired but disabled (Future Work). | meta-math/MetaMathQA, AI-MO/NuminaMath-TIR on Hugging Face |
| LotteryElicitationEnv/PT | Sibling project — structural template for two-repo split, rollout_func, DDP padding | Same monorepo |
Supported quick-start path: single-GPU colocate. Multi-GPU paths (FSDP full fine-tune, and DeepSpeed ZeRO-3 + Unsloth LoRA for 14B) are described in Engineering Pathology 3; this section sticks to the simplest working configuration.
# 1. Environment (HF Space or local Docker)
# Local Docker is the most robust; point ENV_BASE_URL at http://127.0.0.1:8000.
# If you prefer the HF Space, use its direct host (https://<owner>-<space>.hf.space), not the hf.co/spaces page.
export ENV_BASE_URL="http://127.0.0.1:8000"
# 2. Training client (ReasoningEconomicsPT) — single-GPU colocate only
export REPT_ROOT="$PWD"
export REPT_VENV="$PWD/.venv"
export REPT_MODEL="Qwen/Qwen3-4B"
export CUDA_VISIBLE_DEVICES=0
export REPT_NUM_GPUS=1
export REPT_VLLM_MODE=colocate # vLLM and policy share one GPU
export REPT_VLLM_TP=1
bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh --dry-run
bash scripts/run_grpo_lambda.sh
All episodes are seeded. Grading is deterministic (extract_boxed_answer with last-match semantics + SymPy equality). Budget resolution is fully specified by the four-priority table in Environment Design, with budget_source returned in observation metadata for audit.
Dependency pin: trl==1.0.0 + vllm==0.10.2 + transformers>=5.2,<5.4 + torch==2.8.* (see requirements.lambda.txt / requirements.carc-cu121.txt). Multi-GPU variants and the two engineering branches (FSDP / DeepSpeed ZeRO-3) are covered in Engineering Pathology 3.
Mirrors the LotteryElicitationEnv baseline CLI pattern; evaluates the trained policy against the four planned baselines in Baselines.
python -m reasoning_economics_pt.eval.evaluate \
--policy hf --model ./outputs/ckpt-last \
--episodes 200 \
--baselines always_same_budget,greedy_first,uniform_random_split,zero_shot_llm \
--num_questions 4 --budget_ratio 0.8 --hard_cap_mode strict
The harness reports reward_mean, accuracy_mean, budget_utilization_clamped, overspend_tokens, average tokens per question, and questions completed, per baseline and per policy. Episodes are seeded identically across baselines so allocation deltas are directly comparable.
max_completion_length to 8192, cap thinking via max_thinking_tokens, start in soft-budget mode and warm into hard-cap, monitor the IS ratio, add a small partial-credit term for emitting </think>. The pathology in Training Pathology tells us exactly where to intervene.\boxed{} + SymPy equality. The MDP generalizes to any verifiable-answer domain, but we have not validated that claim empirically.rollout_func user on variable-length rollouts + server mode has it latent.ReasoningEconomicsEnv reframes the reasoning-economics question from how short can this answer be? to how should I spend what I have left? A stateless grader, a tokenizer-native budget accountant, and a multiplicatively-coupled per-step-plus-terminal reward give us a sequential MDP where every component is auditable and every dollar of compute is accounted for in the unit system the policy actually sees.
The engineering contribution is summarized in one place: Takeaways. We do not re-enumerate it here.
The research question remains open: can a GRPO-trained LLM learn to pace its own reasoning across a shared-budget episode? The headline Qwen3-14B 8×A100 run — 480 episodes, mean reward +0.4692, max +4.52, env_step_error=0 — is the first evidence we have that the answer is yes, subject to baselines landing against the same checkpoint (designed targets in Baselines). Baseline runs against the released 14B checkpoint and the planned NuminaMath-TIR channel are the next results on the roadmap.