[hardware, recipe, ci] feat: Support fsdp peft sft on npu (verl-project#2240)

zheliuyu · web-flow · commit 1bdf4d2bc718 · 2025-07-02T15:06:06.000+08:00
### What does this PR do? - Support fsdp peft sft on npu. - Add CI actions to maintain peft sft and sequence parallelism function on npu. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example Run examples/sft/gsm8k/run_qwen_05_peft_sp2_npu.sh on gpu and npu: ```xshell torchrun --standalone --nnodes=1 --nproc_per_node=8 \ -m verl.trainer.fsdp_sft_trainer \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.prompt_key=extra_info \ data.response_key=extra_info \ optim.lr=1e-4 \ data.prompt_dict_keys=['question'] \ +data.response_dict_keys=['answer'] \ data.micro_batch_size_per_gpu=64 \ model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \ trainer.default_local_dir=$save_path \ trainer.project_name=gsm8k-sft \ trainer.experiment_name=gsm8k-sft-qwen-2.5-0.5b-instruct \ trainer.logger=['console'] \ trainer.total_epochs=2 \ trainer.default_hdfs_dir=null $@ \ model.lora_rank=32 \ model.lora_alpha=16 \ model.target_modules=all-linear \ model.strategy=fsdp \ ulysses_sequence_parallel_size=2 \ use_remove_padding=true ``` Mean absolute error of train loss: ![train_loss_mae](https://github.com/user-attachments/assets/f0c436ae-4d92-44c9-bca8-0b7cde1f4cfe) ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes Enable sp: ```xhell --ulysses_sequence_parallel_size=2 --use_remove_padding=true ``` NPU does not support sdpa2, so we need to set model.strategy: ``` --model.strategy=sdpa ``` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
diff --git a/.github/workflows/e2e_ascend.yml b/.github/workflows/e2e_ascend.yml
@@ -120,10 +120,10 @@ jobs:
         run: |
           ray stop --force
           python3 examples/data_preprocess/geo3k.py
-      - name: Running gsm8k e2e training tests with LoRA on ASCEND NPU
+      - name: Running gsm8k e2e training tests with peft sft on ASCEND NPU
         run: |
           ray stop --force
-          bash tests/special_e2e/sft/run_sft.sh
+          bash tests/special_npu/run_qwen2_5_05b_sft_peft_sp2.sh
           rm -rf $HOME/ckpts
       - name: Running gsm8k e2e training tests with GRPO on ASCEND NPU
         run: |
diff --git a/docs/ascend_tutorial/ascend_quick_start.rst b/docs/ascend_tutorial/ascend_quick_start.rst
@@ -10,7 +10,7 @@ Last updated: 06/17/2025.
 
 Atlas 200T A2 Box16
 
-Atlas 800T A2
+Atlas 900 A2 PODc
 
 
 安装
@@ -47,7 +47,7 @@ vllm & vllm-ascend
     # for Atlas 200T A2 Box16
     VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
     
-    # for Atlas 800T A2
+    # for Atlas 900 A2 PODc
     VLLM_TARGET_DEVICE=empty pip install -e .
 
 .. code-block:: bash
@@ -172,7 +172,8 @@ vllm & vllm-ascend
 +-----------+-------------------------+-------------+-------------------+----------------------+
 |   DAPO    | Qwen2.5-7B-instruct     |    3.83%    |        pending    |  Atlas 200T A2 Box16 |
 +-----------+-------------------------+-------------+-------------------+----------------------+
-
+|  SFT-PEFT | Qwen2.5-0.5B-instruct   |    0.06%    |        0.305      |  Atlas 900 A2 PODc   |
++-----------+-------------------------+-------------+-------------------+----------------------+
 
 精度对比说明
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/examples/sft/gsm8k/run_qwen2_5_05b_sft_peft_sp2_npu.sh b/examples/sft/gsm8k/run_qwen2_5_05b_sft_peft_sp2_npu.sh
@@ -0,0 +1,36 @@
+set -x
+
+if [ "$#" -lt 2 ]; then
+    echo "Usage: run_qwen2_5_05b_sft_peft_sp2_npu.sh <nproc_per_node> <save_path> [other_configs...]"
+    exit 1
+fi
+
+nproc_per_node=$1
+save_path=$2
+
+# Shift the arguments so $@ refers to the rest
+shift 2
+
+torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
+     -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=extra_info \
+    data.response_key=extra_info \
+    optim.lr=1e-4 \
+    data.prompt_dict_keys=['question'] \
+    +data.response_dict_keys=['answer'] \
+    data.micro_batch_size_per_gpu=64 \
+    model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
+    trainer.default_local_dir=$save_path \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-qwen-2.5-0.5b-instruct \
+    trainer.logger=['console'] \
+    trainer.total_epochs=2 \
+    trainer.default_hdfs_dir=null $@ \
+    model.lora_rank=32 \
+    model.lora_alpha=16 \
+    model.target_modules=all-linear \
+    model.strategy=fsdp \
+    ulysses_sequence_parallel_size=2 \
+    use_remove_padding=true
diff --git a/tests/special_npu/run_qwen2_5_05b_sft_peft_sp2.sh b/tests/special_npu/run_qwen2_5_05b_sft_peft_sp2.sh
@@ -0,0 +1,30 @@
+set -x
+
+mkdir -p ./save_ckpts
+
+torchrun --standalone --nnodes=1 --nproc_per_node=8 \
+     -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=extra_info \
+    data.response_key=extra_info \
+    optim.lr=1e-4 \
+    data.prompt_dict_keys=['question'] \
+    +data.response_dict_keys=['answer'] \
+    data.micro_batch_size_per_gpu=32 \
+    model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
+    trainer.default_local_dir=./save_ckpts \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-qwen-2.5-0.5b-instruct \
+    trainer.logger=['console'] \
+    trainer.total_epochs=1 \
+    trainer.total_training_steps=1 \
+    trainer.default_hdfs_dir=null $@ \
+    model.lora_rank=32 \
+    model.lora_alpha=16 \
+    model.target_modules=all-linear \
+    model.strategy=fsdp \
+    ulysses_sequence_parallel_size=2 \
+    use_remove_padding=true
+
+rm -rf ./outputs ./save_ckpts