This repository provides a HGPO (Hierarchy-of-Groups Policy Optimization for Long-horizon Agentic Tasks) recipe for verl-agent, used for multi-turn agentic RL.
Motivation: Figure (a) compares trajectory-wise and stepwise policy optimization frameworks. Given two example group trajectories, Figure (b) illustrates trajectory-level and step-level grouping with their corresponding advantage estimations. Best viewed in color.
Overview of HGPO. The LLM-based agent interacts with a set of environments initialized from the same state $\bm{s}{0}$, producing four group trajectories (states with the same color are identical). HGPO comprises two key components: context-aware hierarchical grouping and adaptive weighted advantage computation. For illustration, consider the state $\bm{s}{2}$ (purple). First, HGPO assigns $\bm{s}_{2}$ into three hierarchical groups according to its historical contexts. Then, it computes the final advantage estimate by adaptively aggregating the weighted advantages from these groups.
All scripts live under recipe/hgpo/, organized by model size and environment:
- Training scripts: Set
history_length,group_size,mode,weight_type,length_weight_alpha,base_group, etc. Experiment names are auto-generated (e.g.k2_hgpo_length_alpha1.0_baseGroup_False). - Eval scripts: Fill in
eval_experiment_namesin the script (matching trainingexperiment_name). The script parseshistory_lengthfrom the name (e.g.k2→2,k4→4), runs evaluation for each ofseeds=(123 456 789), and writes logs tologs/<checkpoint_dir>/output_seed{seed}.log.
Scripts rely on the following environment variables (set as needed):
HF_HOME: Hugging Face cache directoryWANDB_API_KEY: WandB API key (optional)WANDB_DIR: WandB log directory (optional)CUDA_VISIBLE_DEVICES: Visible GPUsCHECKPOINTS_DIR: Checkpoint root directory; used by both training and evaluation
Example:
export HF_HOME=/path/to/hf
export WANDB_API_KEY=your_key
export WANDB_DIR=/path/to/wandb
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CHECKPOINTS_DIR=/path/to/checkpoints| Parameter | Description | Typical values |
|---|---|---|
mode |
Within-group advantage normalization | mean_norm / mean_std_norm |
weight_type |
Within-group weight type | length (step-length weighting) |
length_weight_alpha |
Weight is L^alpha; alpha=0 is uniform | 1.0 |
base_group |
Use episode advantage as initial group in aggregation | true / false |
Use together with env options such as env.history_length and env.rollout.n (rollouts per group).
- Install and configure AlfWorld / WebShop (see agent_system/environments).
- Data is used only to set batch size and format. Prepare text data and generate parquet first:
python3 -m examples.data_preprocess.prepare --mode 'text' --train_data_size 16 --val_data_size 128Paths are set in the scripts via data.train_files / data.val_files; defaults are $HOME/data/verl-agent/text/train.parquet and $HOME/data/verl-agent/text/test.parquet.
Training (1.5B, 2 GPUs):
bash recipe/hgpo/run_qwen2.5_1.5b_alfworld_train.shTraining (7B, 4 GPUs):
bash recipe/hgpo/run_qwen2.5_7b_alfworld_train.shEvaluation: Edit eval_experiment_names in the corresponding eval script (e.g. add k2_hgpo_length_alpha1.0_baseGroup_False), then run:
# 1.5B
bash recipe/hgpo/run_qwen2.5_1.5b_alfworld_eval.sh
# 7B
bash recipe/hgpo/run_qwen2.5_7b_alfworld_eval.shIn AlfWorld eval scripts, val_out controls the validation set: val_out=True for in-domain, val_out=False for out-of-domain (some scripts use the variable name eval_out with the same meaning).
Training:
# 1.5B, 2 GPUs
bash recipe/hgpo/run_qwen2.5_1.5b_webshop_train.sh
# 7B, 4 GPUs
bash recipe/hgpo/run_qwen2.5_7b_webshop_train.shEvaluation: Similarly, set eval_experiment_names in the eval script (e.g. k2_hgpo_length_step30_alpha1.0), then run:
bash recipe/hgpo/run_qwen2.5_1.5b_webshop_eval.sh
# or
bash recipe/hgpo/run_qwen2.5_7b_webshop_eval.shThis recipe is self-contained under recipe/hgpo/ for HGPO logic and trainer extensions when submitting to upstream verl-agent:
| File | Description |
|---|---|
hgpo/core_hgpo.py |
HGPO advantage computation (self-contained) |
hgpo/hgpo_ray_trainer.py |
PPO Ray trainer with HGPO support; adjust_batch() runs after compute_advantage() |
If not included upstream, you may also need:
agent_system/environments/env_manager.py: AlfWorld branch should useconfig.trainer.val_outto select in-domain vs out-of-domain validation.
@inproceedings{
he2026hierarchyofgroups,
title={Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks},
author={Shuo He and Lang Feng and Qi Wei and Xin Cheng and Lei Feng and Bo An},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=T8Dev99qnz}
}