Skip to content

Commit 2a25e31

Browse files
authored
[doc] feat: FSDP forward prefetch and entropy memory optimizations (verl-project#2322)
### What does this PR do? @eric-haibin-lin As this comment says verl-project#1927 (comment), add FSDP forward prefetch and entropy calculation memory optimization to performance tuning guide. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
1 parent 1bdf4d2 commit 2a25e31

1 file changed

Lines changed: 25 additions & 1 deletion

File tree

docs/perf/perf_tuning.rst

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Performance Tuning Guide
33

44
Last updated: 06/23/2025.
55

6-
Author: `Guangming Sheng <https://github.com/PeterSH6>`_
6+
Author: `Guangming Sheng <https://github.com/PeterSH6>`_, `Jiali Zheng <https://github.com/CurryRice233>`_
77

88
In this section, we will discuss how to tune the performance of all the stages in verl, including:
99

@@ -19,6 +19,10 @@ In this section, we will discuss how to tune the performance of all the stages i
1919

2020
6. LigerKernel for SFT performance optimization
2121

22+
7. Forward prefetch in FSDP training backend
23+
24+
8. Memory optimization for entropy calculation from logits
25+
2226
Rollout Generation Tuning
2327
--------------------------
2428

@@ -173,3 +177,23 @@ LigerKernel is a high-performance kernel for Supervised Fine-Tuning (SFT) that c
173177

174178
3. LigerKernel is particularly useful for improving training performance in SFT scenarios.
175179

180+
Forward prefetch in FSDP training backend
181+
----------------------
182+
183+
During the training phase, users can enable forward prefetching in FSDP by setting ``fsdp_config.forward_prefetch=True``. For example, ``actor_rollout_ref.actor.fsdp_config.forward_prefetch=True``. This configuration prefetches the next forward-pass all-gather operation before completing the current forward computation, overlapping communication with computation and improving efficiency. For further details, refer to the `FSDP forward_pefetch <https://docs.pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp>`_ documentation.
184+
185+
.. note::
186+
Backward prefetch is unsupported because the ``BACKWARD_POST`` policy may prefetch incorrectly in nested-module cases. For details, see the `FSDP documentation <https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md?plain=1#L70>`_
187+
188+
Memory optimization for entropy calculation from logits
189+
----------------------
190+
191+
The ``logits`` tensor (typically of shape ``[bsz*seq_len, voc]``) can consume significant memory. When using ``compute_entropy_from_logits``, memory usage reaches approximately ``[bsz*seq_len, voc] × (4 bytes (float32) + 2 bytes (autocast for softmax+logsumexp) + 1 byte (softmax output))``.
192+
193+
To reduce this memory peak, enable chunked computation by setting:
194+
``actor_rollout_ref.ref.entropy_from_logits_with_chunking = True``
195+
This processes the tensor in chunks of shape ``[chunk_size, voc]`` (e.g., 2048) rather than the full sequence length, exclusively during the model's forward pass.
196+
197+
Additionally, during training, standard gradient checkpointing (``enable_gradient_checkpointing=True``) does not apply to entropy calculations. To reduce memory peaks in this context, set:
198+
``actor_rollout_ref.actor.entropy_checkpointing = True``
199+
This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.

0 commit comments

Comments
 (0)