Hidden Details#
- t3w.core._find_free_port()#
find a not used port for DDP.
- t3w.core._subprocess(rank, loop: EvalLoop | TrainLoop)#
spawned sub processes entrypoint function.
a
EvalLooporTrainLoopwill spawn multiple processes if theTopLevelModuleit attacted to are told to move its parameters to multiple devices byTopLevelModule.to(). This implies a distributed execution of the loop. The sub processes will then init a communication group, actually place model to the target devices, wrap it with torch’s :class:DistributedDataParallelwrapper, and call the loop’s__call__again like a single process execution. Finally, it cleans up context and exit to the spawning point of the father process.
- class t3w.TopLevelModule(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0)
A central API that manages the
nn.Modulerelated features.- This top level module helps its owner loop on infrastructure the code, and helps user by providing useful low-level utilities, including
manages model checkpoint saving and loading,
moves user_model to other device(s) and trigger DDP execution mode,
computes losses and metrics specified by the loop.
Note
We delegate losses computation to
TrainLoop(seeTrainLoop.__init__()), while the loop delegates it further toTopLevelModule.forward(). This is clever because torch’sDataDistributedParallelclass wraps the top level module instead of the user’s model therefore always has a loss tensor as output and be able to find unused parameters during training, while user can stick to the suggested standard OOP paradigm in the userspace by operating on theIMiniBatchinstance.- _fix_optim_states()
_LRScheduler of PyTorch makes a monkey patch on the optimizer to help detect the proper order of calling
optim.stepandlr_scheduler.step, this patch will be lost though in a multi-processing spawning. We reproduce this monkey patch in the sub process to repress the warning information.
- static _parse_multi_device(devices: str) → List[device]
- Parameters:
devices (-) – e.g. “cuda”, “cuda:0”, “cuda:0,1”, “cuda:0-2”, “cuda:0-1,3”.
- Returns:
List[“torch.device”]
- class t3w.core._TrainingProgress(step: int = 0, epoch: int = 0)#
a class that count the total training steps and epochs number of the model.
This progress is a part of the state_dict of the
TopLevelModule, andTrainLoopmakes use of it to resume training,SaveBestModelsSideEffectmakes use of it to label the checkpoint saving filename. So it is “the training progress” instead of “a training progress” of the model.- step: int = 0#
number of the total times the
optim.stepmethod has been called.Note
It is not about how many actual iteration the for loop has run, it is the step of the optimizer has updated the model. Consider a gradient accumulation case, only after multiple “iteration steps” will the optimizer step once, and the
inc_step()will also be called only once.
- epoch: int = 0#
number of the total training epoch.
- inc_step()#
increase the training step by 1.
- inc_epoch()#
increase the training epoch by 1.
- __init__(step: int = 0, epoch: int = 0) → None#