Skip to content

Commit a53fb30

Browse files
authored
[ckpt] fix: edit esi doc (verl-project#2354)
This PR addresses the "ESI" comprehension issue left by the previous PR (verl-project#2192). This PR refines `ppo_trainer.yaml` by expanding the esi_redundant_time comment to define ESI (Elastic Server Instance) and draw a parallel to a training plan. In `ray_trainer.py`, it clarifies ESI-related checkpoint-saving conditions. These edits boost code readability and maintainability.
1 parent 18c6ffc commit a53fb30

2 files changed

Lines changed: 12 additions & 1 deletion

File tree

verl/trainer/config/ppo_trainer.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -998,7 +998,11 @@ trainer:
998998
# Save frequency (by iteration) for model checkpoints
999999
save_freq: -1
10001000

1001-
# ESI redundant time (in seconds) for model checkpointsAdd commentMore actions
1001+
# ESI refers to the elastic server instance used during training, similar to the training plan. For example,
1002+
# if you purchase 10 hours of computing power, the ESI will automatically shut down after 10 hours of training.
1003+
# To ensure a checkpoint is saved before ESI shuts down, the system will start saving a checkpoint in advance.
1004+
# The advance time is calculated as: Advance Time = Longest historical step duration + Checkpoint save duration + esi_redundant_time.
1005+
# Here, esi_redundant_time is a user-defined value that further extends the advance time for added safety.
10021006
esi_redundant_time: 0
10031007

10041008
# Resume mode: "auto", "disable", or "resume_path"

verl/trainer/ppo/ray_trainer.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1318,10 +1318,17 @@ def fit(self):
13181318
last_val_metrics = val_metrics
13191319
metrics.update(val_metrics)
13201320

1321+
# Check if the ESI (Elastic Server Instance)/training plan is close to expiration.
13211322
esi_close_to_expiration = should_save_ckpt_esi(
13221323
max_steps_duration=self.max_steps_duration,
13231324
redundant_time=self.config.trainer.esi_redundant_time,
13241325
)
1326+
# Check if the conditions for saving a checkpoint are met.
1327+
# The conditions include a mandatory condition (1) and one of the following optional conditions (2, 3, or 4):
1328+
# 1. The save frequency is set to a positive value.
1329+
# 2. It's the last training step.
1330+
# 3. The current step number is a multiple of the save frequency.
1331+
# 4. The ESI(Elastic Server Instance)/training plan is close to expiration.
13251332
if self.config.trainer.save_freq > 0 and (
13261333
is_last_step
13271334
or self.global_steps % self.config.trainer.save_freq == 0

0 commit comments

Comments
 (0)