Heavy Thinking as the Inner Skill in Agentic Harness
HeavySkill is a test-time scaling technique that decomposes complex reasoning into two stages:
- Parallel Reasoning — Generate K independent reasoning trajectories concurrently
- Sequential Deliberation — Synthesize trajectories through critical analysis into a superior final answer
This repository provides two modes of use:
| Mode | Description | Use Case |
|---|---|---|
| Workflow | Python async pipeline with CLI | Batch evaluation, research experiments, custom deployments |
| Skill | Pure prompt file for Claude Code / agentic harness | Interactive reasoning in AI-native IDEs |
- Heavy thinking consistently outperforms Best-of-N (majority voting) strategies
- Stronger LLMs can approach Pass@N performance through deliberation
- The depth (iterations) and width (K) of heavy thinking are scalable via RLVR
git clone https://github.com/wjn1996/HeavySkill.git
cd HeavySkill
pip install -e .python scripts/run_heavyskill.py \
--query "Find the number of paths of length 16 on an 8x8 grid that change direction exactly four times." \
--model "deepseek-r1" \
--api_base "http://localhost:8080" \
--reason_k 8 \
--summary_k 4 \
--prompt_type "stem" \
--output "outputs/result.json" \
--verboseParameters:
--reason_k: Number of parallel reasoning trajectories (default: 8)--summary_k: Number of deliberation samples (default: 4)--iterations: Iterative deliberation rounds (default: 1)--prompt_type:"general"or"stem"--language:"en"or"cn"
Using a separate deliberation model:
python scripts/run_heavyskill.py \
--query "Your problem here" \
--model "r1-distill-qwen-7b" \
--api_base "http://localhost:8080" \
--summary_model "qwen3-32b" \
--summary_api_base "http://localhost:8081" \
--reason_k 16 \
--summary_k 4Batch mode:
python scripts/run_heavyskill.py \
--input_file "examples/example_math.json" \
--model "deepseek-r1" \
--api_base "http://localhost:8080" \
--output "outputs/batch_result.json"Copy the skill file into your Claude Code skills directory:
cp skill/heavyskill.md ~/.claude/skills/heavyskill.mdThen in Claude Code, the heavy thinking protocol will be available for complex reasoning tasks. The skill instructs the model to:
- Spawn multiple independent reasoning agents in parallel
- Collect diverse reasoning trajectories
- Perform critical meta-analysis and deliberation
- Output the synthesized final answer
HeavySkill/
├── workflow/ # Mode 1: Python async pipeline
│ ├── config.py # Configuration dataclass
│ ├── parallel_reasoning.py # Stage 1: Parallel trajectory generation
│ ├── sequential_deliberation.py # Stage 2: Synthesis & deliberation
│ ├── memory_cache.py # Trajectory storage & selection
│ ├── prompts.py # Prompt templates (general, STEM, CN/EN)
│ ├── pipeline.py # Full pipeline orchestration
│ ├── utils.py # Utilities (clipping, extraction, etc.)
│ └── agent/
│ ├── base.py # Abstract agent interface
│ └── openai_compatible.py # OpenAI-compatible async API client
├── scripts/
│ ├── run_heavyskill.py # CLI entry point
│ ├── run_heavyskill.sh # Example shell script
│ └── evaluate.py # Simple accuracy evaluation
├── skill/
│ └── heavyskill.md # Pure prompt skill for agentic harness
├── examples/
│ └── example_math.json # Example input data
├── paper/
│ └── heavyskill.pdf # Paper
├── requirements.txt
└── pyproject.toml
┌─────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 1: Parallel Reasoning │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────┐ │
│ │ Thinker 1│ │ Thinker 2│ │ Thinker 3│ ...│ K │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──┬───┘ │
│ │ │ │ │ │
└────────┼─────────────┼─────────────┼─────────────┼──────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Memory Cache │
│ (Store & organize K trajectories) │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Stage 2: Sequential Deliberation │
│ │
│ - Analyze answer distribution across trajectories │
│ - Cross-validate reasoning chains │
│ - Identify logical errors & correct approaches │
│ - Synthesize final answer with critical thinking │
│ │
│ ┌─── Iterative Update (optional) ◄──┐ │
│ └───────────────────────────────────┘ │
└─────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Final Answer │
└─────────────────────────────────────────────────────────┘
The workflow supports any OpenAI-compatible API endpoint:
- vLLM serving (
--api_base http://localhost:8000) - DeepSeek API (
--api_base https://api.deepseek.com) - Together AI (
--api_base https://api.together.xyz) - OpenRouter (
--api_base https://openrouter.ai/api) - Local Ollama (
--api_base http://localhost:11434)
@article{wang2026heavyskill,
title={HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness},
author={Wang, Jianing and Guo, Linsen and Chen, Zhengyu and Guo, Qi and Zang, Hongyu and Shi, Wenjie and Ma, Haoxiang and Xi, Xiangyu and Li, Xiaoyu and Wang, Wei and Cai, Xunliang},
journal={arXiv preprint arXiv:2605.02396},
year={2026},
url={https://arxiv.org/abs/2605.02396}
}Apache-2.0