Core APIs#

class t3w.TopLevelModule(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0)#

A central API that manages the nn.Module related features.

This top level module helps its owner loop on infrastructure the code, and helps user by providing useful low-level utilities, including
  • manages model checkpoint saving and loading,

  • moves user_model to other device(s) and trigger DDP execution mode,

  • computes losses and metrics specified by the loop.

Note

We delegate losses computation to TrainLoop (see TrainLoop.__init__()), while the loop delegates it further to TopLevelModule.forward(). This is clever because torch’s DataDistributedParallel class wraps the top level module instead of the user’s model therefore always has a loss tensor as output and be able to find unused parameters during training, while user can stick to the suggested standard OOP paradigm in the userspace by operating on the IMiniBatch instance.

__init__(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0) None#
Parameters:
  • model (nn.Module) – user defined model.

  • optim (Optimizer, optional) – user defined optimizer. Defaults to None.

  • lr_scheduler (_LRScheduler, optional) – user defined lr_scheduler. Defaults to None.

  • regularizer_reweight (float, optional) – the weight of the regularizer_loss if any. Defaults to 1.

user_model: nn.Module#

User’s model consumes and updates the input IMiniBatch data in its forward method, and optionally return a model specific regularizer loss.

optim: Optimizer#

Stores user defined optimizer, the calling is further delegated to TrainLoop.

lr_scheduler: _LRScheduler#

Stores user defined learning rate scheduler, the calling is further delegated to TrainLoop.

training_progress: _TrainingProgress#

Stores training process of current model. It is part of the checkpoint state_dict. Modification in userspace is not intended.

regularizer_reweight: float#

Stores the weight of regularizer loss if the model has one. It is NOT part of the checkpoint state_dict. Modification in userspace at training should be done through the loop argument in event on_train_step_started().

to(device)#

move model to target device or trigger DDP execution if multiple target devices are provided.

Warning

When multiple target devices are specified, the parameters will not be moved until the subprocesses are spawned by the TrainLoop or EvalLoop.

Parameters:

device (str) – “cuda:0”, “cuda:0,1”, “cuda:0-3” are all valid inputs.

forward(mb: IMiniBatch, losses: Mapping[str, ILoss] | None = None, metrics: Mapping[str, IMiniBatchMetric] | None = None, step_dict: StepReturnDict | None = None) MiniBatchFloats | IMiniBatch#

In training mode, this will call self.user_model(mb), compute losses and metrics, collect results to fill in step_dict, and return weighted sum of all losses (to be backward).

In evaluation mode, this will call self.user_model(mb) and return mb.

Parameters:
  • mb (IMiniBatch) – _description_

  • losses (Mapping[str, ILoss], optional) – _description_. Defaults to None.

  • metrics (Mapping[str, IMiniBatchMetric], optional) – _description_. Defaults to None.

  • step_dict (StepReturnDict, optional) – _description_. Defaults to None.

Returns:

_description_

Return type:

Union[MiniBatchFloats, IMiniBatch]

property device#

current device (not supporting model parallelzation)

save(path)#

save current training states to the disk.

The states to be maintained include:
  • training progress

  • user model state_dict

  • user optimizer state_dict

  • user lr_scheduler state_dict

  • current random states of random, np.random and torch.random.

Warning

No parent directory will be produced if not exist.

Parameters:

path (str) – save path.

load(path, strict=True)#

load a previous top level model checkpoint

Parameters:
  • path (str) – load file path

  • strict (bool, optional) – passed on to user_model’s load_state_dict(strict=?) option. Defaults to True.

Returns:

returned incompitable keys of user_model’s load_state_dict() method.

Return type:

_IncompitableKeys

class t3w.EvalLoop(dataset: IDataset, model: TopLevelModule | None = None, batch_size: int | None = None, metrics: Mapping[str, IDatasetMetric] = {}, medias: Sequence[IMediaProducer] = [], side_effects: Sequence[ISideEffect] = [], breakpoint: bool = False)#
__init__(dataset: IDataset, model: TopLevelModule | None = None, batch_size: int | None = None, metrics: Mapping[str, IDatasetMetric] = {}, medias: Sequence[IMediaProducer] = [], side_effects: Sequence[ISideEffect] = [], breakpoint: bool = False) None#
dataset: IDataset#
model: TopLevelModule#
metrics: Mapping[str, IDatasetMetric]#
loader: DataLoader#
batch_size: int#
property metric_values: Dict[str, float]#
mark_padding(step, mb: IMiniBatch)#
class t3w.TrainLoop(dataset: IDataset, model: TopLevelModule, losses: Mapping[str, ILoss], metrics: Mapping[str, IMiniBatchMetric] = {}, medias: Sequence[IMediaProducer] = [], batch_size: int | None = None, sampler: Sampler | None = None, num_acc_grad: int = 1, epochs: int = 100, iter_per_epoch: int | None = None, epoch_per_eval: int = 1, eval_loop: EvalLoop | None = None, side_effects: Sequence[ISideEffect] = [])#
__init__(dataset: IDataset, model: TopLevelModule, losses: Mapping[str, ILoss], metrics: Mapping[str, IMiniBatchMetric] = {}, medias: Sequence[IMediaProducer] = [], batch_size: int | None = None, sampler: Sampler | None = None, num_acc_grad: int = 1, epochs: int = 100, iter_per_epoch: int | None = None, epoch_per_eval: int = 1, eval_loop: EvalLoop | None = None, side_effects: Sequence[ISideEffect] = []) None#
Parameters:
  • model (TopLevelModule) – the top level model.

  • dataset (IDataset) – train split of the dataset.

  • losses (Mapping[str, ILoss]) – losses to be evaluated.

  • metrics (Mapping[str, IMiniBatchMetric]) – metrics to be evaluated

  • batch_size (Optional[int], optional) – train batch size. Defaults to the train_batch_size static attribute of IDatum.

  • sampler (Sampler, optional) – a custom sampler instance for the dataloader. Defaults to RandomSampler or DistributedSampler.

  • num_acc_grad (int, optional) – step interval to apply gradient descent. Defaults to 1.

  • epochs (int, optional) – total training epochs. Defaults to 100.

  • iter_per_epoch (Optional[int], optional) – manually define iteration number of an epoch. Defaults to len(dataloader).

  • epoch_per_eval (int, optional) – epoch interval to apply eval_loop. Defaults to 1.

  • eval_loop (Optional[EvalLoop], optional) – an EvalLoop instance. Defaults to None.

  • side_effects (Sequence[ISideEffect], optional) – Defaults to [].

dataset: IDataset#
model: TopLevelModule#
losses: Mapping[str, ILoss]#
metrics: Mapping[str, IDatasetMetric]#
batch_size: int#

The normal mini batch size during training.

num_acc_grad: int#
epochs: int#
eval_loop: EvalLoop#
loader: DataLoader#
step(mb: IMiniBatch)#

Implement a single train step, with gradiant accumulate.