Hidden Details#
- t3w.core._find_free_port()#
find a not used port for DDP.
- t3w.core._subprocess(rank, loop: EvalLoop | TrainLoop)#
spawned sub processes entrypoint function.
a
EvalLoop
orTrainLoop
will spawn multiple processes if theTopLevelModule
it attacted to are told to move its parameters to multiple devices byTopLevelModule.to()
. This implies a distributed execution of the loop. The sub processes will then init a communication group, actually place model to the target devices, wrap it with torch’s :class:DistributedDataParallel
wrapper, and call the loop’s__call__
again like a single process execution. Finally, it cleans up context and exit to the spawning point of the father process.
- class t3w.TopLevelModule(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0)
A central API that manages the
nn.Module
related features.- This top level module helps its owner loop on infrastructure the code, and helps user by providing useful low-level utilities, including
manages model checkpoint saving and loading,
moves user_model to other device(s) and trigger DDP execution mode,
computes losses and metrics specified by the loop.
Note
We delegate losses computation to
TrainLoop
(seeTrainLoop.__init__()
), while the loop delegates it further toTopLevelModule.forward()
. This is clever because torch’sDataDistributedParallel
class wraps the top level module instead of the user’s model therefore always has a loss tensor as output and be able to find unused parameters during training, while user can stick to the suggested standard OOP paradigm in the userspace by operating on theIMiniBatch
instance.- _fix_optim_states()
_LRScheduler of PyTorch makes a monkey patch on the optimizer to help detect the proper order of calling
optim.step
andlr_scheduler.step
, this patch will be lost though in a multi-processing spawning. We reproduce this monkey patch in the sub process to repress the warning information.
- static _parse_multi_device(devices: str) → List[device]
- Parameters:
devices (-) – e.g. “cuda”, “cuda:0”, “cuda:0,1”, “cuda:0-2”, “cuda:0-1,3”.
- Returns:
List[“torch.device”]
- class t3w.core._TrainingProgress(step: int = 0, epoch: int = 0)#
a class that count the total training steps and epochs number of the model.
This progress is a part of the state_dict of the
TopLevelModule
, andTrainLoop
makes use of it to resume training,SaveBestModelsSideEffect
makes use of it to label the checkpoint saving filename. So it is “the training progress” instead of “a training progress” of the model.- step: int = 0#
number of the total times the
optim.step
method has been called.Note
It is not about how many actual iteration the for loop has run, it is the step of the optimizer has updated the model. Consider a gradient accumulation case, only after multiple “iteration steps” will the optimizer step once, and the
inc_step()
will also be called only once.
- epoch: int = 0#
number of the total training epoch.
- inc_step()#
increase the training step by 1.
- inc_epoch()#
increase the training epoch by 1.
- __init__(step: int = 0, epoch: int = 0) → None#