Hidden Details#

t3w.core._find_free_port()#

find a not used port for DDP.

https://stackoverflow.com/a/45690594

t3w.core._subprocess(rank, loop: EvalLoop | TrainLoop)#

spawned sub processes entrypoint function.

a EvalLoop or TrainLoop will spawn multiple processes if the TopLevelModule it attacted to are told to move its parameters to multiple devices by TopLevelModule.to(). This implies a distributed execution of the loop. The sub processes will then init a communication group, actually place model to the target devices, wrap it with torch’s :class:DistributedDataParallel wrapper, and call the loop’s __call__ again like a single process execution. Finally, it cleans up context and exit to the spawning point of the father process.

Parameters:
  • rank (int) – the index of the subprocess, starting from 0 to len(distributed_devices) - 1.

  • loop (EvalLoop | TrainLoop) – the entire loop context is passed to the subprocess.

class t3w.TopLevelModule(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0)

A central API that manages the nn.Module related features.

This top level module helps its owner loop on infrastructure the code, and helps user by providing useful low-level utilities, including
  • manages model checkpoint saving and loading,

  • moves user_model to other device(s) and trigger DDP execution mode,

  • computes losses and metrics specified by the loop.

Note

We delegate losses computation to TrainLoop (see TrainLoop.__init__()), while the loop delegates it further to TopLevelModule.forward(). This is clever because torch’s DataDistributedParallel class wraps the top level module instead of the user’s model therefore always has a loss tensor as output and be able to find unused parameters during training, while user can stick to the suggested standard OOP paradigm in the userspace by operating on the IMiniBatch instance.

_fix_optim_states()

_LRScheduler of PyTorch makes a monkey patch on the optimizer to help detect the proper order of calling optim.step and lr_scheduler.step, this patch will be lost though in a multi-processing spawning. We reproduce this monkey patch in the sub process to repress the warning information.

static _parse_multi_device(devices: str) List[device]
Parameters:

devices (-) – e.g. “cuda”, “cuda:0”, “cuda:0,1”, “cuda:0-2”, “cuda:0-1,3”.

Returns:

List[“torch.device”]

class t3w.core._TrainingProgress(step: int = 0, epoch: int = 0)#

a class that count the total training steps and epochs number of the model.

This progress is a part of the state_dict of the TopLevelModule, and TrainLoop makes use of it to resume training, SaveBestModelsSideEffect makes use of it to label the checkpoint saving filename. So it is “the training progress” instead of “a training progress” of the model.

step: int = 0#

number of the total times the optim.step method has been called.

Note

It is not about how many actual iteration the for loop has run, it is the step of the optimizer has updated the model. Consider a gradient accumulation case, only after multiple “iteration steps” will the optimizer step once, and the inc_step() will also be called only once.

epoch: int = 0#

number of the total training epoch.

inc_step()#

increase the training step by 1.

inc_epoch()#

increase the training epoch by 1.

__init__(step: int = 0, epoch: int = 0) None#