hgpo

HGPO: Hierarchy-of-Groups Policy Optimization for Long-horizon Agentic Tasks ICLR 2026

This repository provides a HGPO (Hierarchy-of-Groups Policy Optimization for Long-horizon Agentic Tasks) recipe for verl-agent, used for multi-turn agentic RL.

Motivation: Figure (a) compares trajectory-wise and stepwise policy optimization frameworks. Given two example group trajectories, Figure (b) illustrates trajectory-level and step-level grouping with their corresponding advantage estimations. Best viewed in color.

Overview of HGPO. The LLM-based agent interacts with a set of environments initialized from the same state $\bm{s}{0}$, producing four group trajectories (states with the same color are identical). HGPO comprises two key components: context-aware hierarchical grouping and adaptive weighted advantage computation. For illustration, consider the state $\bm{s}{2}$ (purple). First, HGPO assigns $\bm{s}_{2}$ into three hierarchical groups according to its historical contexts. Then, it computes the final advantage estimate by adaptively aggregating the weighted advantages from these groups.

Scripts (recipe/hgpo)

All scripts live under recipe/hgpo/, organized by model size and environment:

Script	Description	Wandb logs
`run_qwen2.5_1.5b_alfworld_train.sh`	AlfWorld training, Qwen2.5-1.5B
`run_qwen2.5_1.5b_alfworld_eval.sh`	AlfWorld evaluation, Qwen2.5-1.5B
`run_qwen2.5_7b_alfworld_train.sh`	AlfWorld training, Qwen2.5-7B
`run_qwen2.5_7b_alfworld_eval.sh`	AlfWorld evaluation, Qwen2.5-7B
`run_qwen2.5_1.5b_webshop_train.sh`	WebShop training, Qwen2.5-1.5B
`run_qwen2.5_1.5b_webshop_eval.sh`	WebShop evaluation, Qwen2.5-1.5B
`run_qwen2.5_7b_webshop_train.sh`	WebShop training, Qwen2.5-7B
`run_qwen2.5_7b_webshop_eval.sh`	WebShop evaluation, Qwen2.5-7B

Training scripts: Set history_length, group_size, mode, weight_type, length_weight_alpha, base_group, etc. Experiment names are auto-generated (e.g. k2_hgpo_length_alpha1.0_baseGroup_False).
Eval scripts: Fill in eval_experiment_names in the script (matching training experiment_name). The script parses history_length from the name (e.g. k2→2, k4→4), runs evaluation for each of seeds=(123 456 789), and writes logs to logs/<checkpoint_dir>/output_seed{seed}.log.

Environment variables

Scripts rely on the following environment variables (set as needed):

HF_HOME: Hugging Face cache directory
WANDB_API_KEY: WandB API key (optional)
WANDB_DIR: WandB log directory (optional)
CUDA_VISIBLE_DEVICES: Visible GPUs
CHECKPOINTS_DIR: Checkpoint root directory; used by both training and evaluation

Example:

export HF_HOME=/path/to/hf
export WANDB_API_KEY=your_key
export WANDB_DIR=/path/to/wandb
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CHECKPOINTS_DIR=/path/to/checkpoints

Algorithm parameters (algorithm.hgpo)

Parameter	Description	Typical values
`mode`	Within-group advantage normalization	`mean_norm` / `mean_std_norm`
`weight_type`	Within-group weight type	`length` (step-length weighting)
`length_weight_alpha`	Weight is L^alpha; alpha=0 is uniform	1.0
`base_group`	Use episode advantage as initial group in aggregation	true / false

Use together with env options such as env.history_length and env.rollout.n (rollouts per group).

Reproducing experiments

Environment and data

Install and configure AlfWorld / WebShop (see agent_system/environments).
Data is used only to set batch size and format. Prepare text data and generate parquet first:

python3 -m examples.data_preprocess.prepare --mode 'text' --train_data_size 16 --val_data_size 128

Paths are set in the scripts via data.train_files / data.val_files; defaults are $HOME/data/verl-agent/text/train.parquet and $HOME/data/verl-agent/text/test.parquet.

AlfWorld

Training (1.5B, 2 GPUs):

bash recipe/hgpo/run_qwen2.5_1.5b_alfworld_train.sh

Training (7B, 4 GPUs):

bash recipe/hgpo/run_qwen2.5_7b_alfworld_train.sh

Evaluation: Edit eval_experiment_names in the corresponding eval script (e.g. add k2_hgpo_length_alpha1.0_baseGroup_False), then run:

# 1.5B
bash recipe/hgpo/run_qwen2.5_1.5b_alfworld_eval.sh

# 7B
bash recipe/hgpo/run_qwen2.5_7b_alfworld_eval.sh

In AlfWorld eval scripts, val_out controls the validation set: val_out=True for in-domain, val_out=False for out-of-domain (some scripts use the variable name eval_out with the same meaning).

WebShop

Training:

# 1.5B, 2 GPUs
bash recipe/hgpo/run_qwen2.5_1.5b_webshop_train.sh

# 7B, 4 GPUs
bash recipe/hgpo/run_qwen2.5_7b_webshop_train.sh

Evaluation: Similarly, set eval_experiment_names in the eval script (e.g. k2_hgpo_length_step30_alpha1.0), then run:

bash recipe/hgpo/run_qwen2.5_1.5b_webshop_eval.sh
# or
bash recipe/hgpo/run_qwen2.5_7b_webshop_eval.sh

Upstream dependencies (recipe self-contained parts)

This recipe is self-contained under recipe/hgpo/ for HGPO logic and trainer extensions when submitting to upstream verl-agent:

File	Description
`hgpo/core_hgpo.py`	HGPO advantage computation (self-contained)
`hgpo/hgpo_ray_trainer.py`	PPO Ray trainer with HGPO support; `adjust_batch()` runs after `compute_advantage()`

If not included upstream, you may also need:

agent_system/environments/env_manager.py: AlfWorld branch should use config.trainer.val_out to select in-domain vs out-of-domain validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

HGPO: Hierarchy-of-Groups Policy Optimization for Long-horizon Agentic Tasks ICLR 2026

Scripts (recipe/hgpo)

Environment variables

Algorithm parameters (algorithm.hgpo)

Reproducing experiments

Environment and data

AlfWorld

WebShop

Upstream dependencies (recipe self-contained parts)

Related links

Name		Name	Last commit message	Last commit date
parent directory ..
config		config
HGPO.png		HGPO.png
README.md		README.md
__init__.py		__init__.py
core_hgpo.py		core_hgpo.py
env_manager.py		env_manager.py
hgpo_ray_trainer.py		hgpo_ray_trainer.py
illustration.png		illustration.png
main_hgpo.py		main_hgpo.py
run_qwen2.5_1.5b_alfworld_eval.sh		run_qwen2.5_1.5b_alfworld_eval.sh
run_qwen2.5_1.5b_alfworld_train.sh		run_qwen2.5_1.5b_alfworld_train.sh
run_qwen2.5_1.5b_webshop_eval.sh		run_qwen2.5_1.5b_webshop_eval.sh
run_qwen2.5_1.5b_webshop_train.sh		run_qwen2.5_1.5b_webshop_train.sh
run_qwen2.5_7b_alfworld_eval.sh		run_qwen2.5_7b_alfworld_eval.sh
run_qwen2.5_7b_alfworld_train.sh		run_qwen2.5_7b_alfworld_train.sh
run_qwen2.5_7b_webshop_eval.sh		run_qwen2.5_7b_webshop_eval.sh
run_qwen2.5_7b_webshop_train.sh		run_qwen2.5_7b_webshop_train.sh
wandb_log.svg		wandb_log.svg

FilesExpand file tree

hgpo

Directory actions

More options

Directory actions

More options

Latest commit

History

hgpo

Folders and files

parent directory

README.md

HGPO: Hierarchy-of-Groups Policy Optimization for Long-horizon Agentic Tasks ICLR 2026

Scripts (recipe/hgpo)

Environment variables

Algorithm parameters (algorithm.hgpo)

Reproducing experiments

Environment and data

AlfWorld

WebShop

Upstream dependencies (recipe self-contained parts)

Related links