Core APIs#
- class t3w.TopLevelModule(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0)#
A central API that manages the
nn.Module
related features.- This top level module helps its owner loop on infrastructure the code, and helps user by providing useful low-level utilities, including
manages model checkpoint saving and loading,
moves user_model to other device(s) and trigger DDP execution mode,
computes losses and metrics specified by the loop.
Note
We delegate losses computation to
TrainLoop
(seeTrainLoop.__init__()
), while the loop delegates it further toTopLevelModule.forward()
. This is clever because torch’sDataDistributedParallel
class wraps the top level module instead of the user’s model therefore always has a loss tensor as output and be able to find unused parameters during training, while user can stick to the suggested standard OOP paradigm in the userspace by operating on theIMiniBatch
instance.- __init__(model: Module, optim: Optimizer | None = None, lr_scheduler: _LRScheduler | None = None, regularizer_reweight: float = 1.0) None #
- Parameters:
model (nn.Module) – user defined model.
optim (Optimizer, optional) – user defined optimizer. Defaults to None.
lr_scheduler (_LRScheduler, optional) – user defined lr_scheduler. Defaults to None.
regularizer_reweight (float, optional) – the weight of the regularizer_loss if any. Defaults to 1.
- user_model: nn.Module#
User’s model consumes and updates the input
IMiniBatch
data in its forward method, and optionally return a model specific regularizer loss.
- lr_scheduler: _LRScheduler#
Stores user defined learning rate scheduler, the calling is further delegated to
TrainLoop
.
- training_progress: _TrainingProgress#
Stores training process of current model. It is part of the checkpoint state_dict. Modification in userspace is not intended.
- regularizer_reweight: float#
Stores the weight of regularizer loss if the model has one. It is NOT part of the checkpoint state_dict. Modification in userspace at training should be done through the
loop
argument in eventon_train_step_started()
.
- to(device)#
move model to target device or trigger DDP execution if multiple target devices are provided.
Warning
When multiple target devices are specified, the parameters will not be moved until the subprocesses are spawned by the
TrainLoop
orEvalLoop
.- Parameters:
device (str) – “cuda:0”, “cuda:0,1”, “cuda:0-3” are all valid inputs.
- forward(mb: IMiniBatch, losses: Mapping[str, ILoss] | None = None, metrics: Mapping[str, IMiniBatchMetric] | None = None, step_dict: StepReturnDict | None = None) MiniBatchFloats | IMiniBatch #
In training mode, this will call
self.user_model(mb)
, computelosses
andmetrics
, collect results to fill instep_dict
, and return weighted sum of all losses (to be backward).In evaluation mode, this will call
self.user_model(mb)
and return mb.- Parameters:
mb (IMiniBatch) – _description_
losses (Mapping[str, ILoss], optional) – _description_. Defaults to None.
metrics (Mapping[str, IMiniBatchMetric], optional) – _description_. Defaults to None.
step_dict (StepReturnDict, optional) – _description_. Defaults to None.
- Returns:
_description_
- Return type:
Union[MiniBatchFloats, IMiniBatch]
- property device#
current device (not supporting model parallelzation)
- save(path)#
save current training states to the disk.
- The states to be maintained include:
training progress
user model state_dict
user optimizer state_dict
user lr_scheduler state_dict
current random states of
random
,np.random
andtorch.random
.
Warning
No parent directory will be produced if not exist.
- Parameters:
path (str) – save path.
- load(path, strict=True)#
load a previous top level model checkpoint
- Parameters:
path (str) – load file path
strict (bool, optional) – passed on to user_model’s
load_state_dict(strict=?)
option. Defaults to True.
- Returns:
returned incompitable keys of user_model’s
load_state_dict()
method.- Return type:
_IncompitableKeys
- class t3w.EvalLoop(dataset: IDataset, model: TopLevelModule | None = None, batch_size: int | None = None, metrics: Mapping[str, IDatasetMetric] = {}, medias: Sequence[IMediaProducer] = [], side_effects: Sequence[ISideEffect] = [], breakpoint: bool = False)#
- __init__(dataset: IDataset, model: TopLevelModule | None = None, batch_size: int | None = None, metrics: Mapping[str, IDatasetMetric] = {}, medias: Sequence[IMediaProducer] = [], side_effects: Sequence[ISideEffect] = [], breakpoint: bool = False) None #
- model: TopLevelModule#
- metrics: Mapping[str, IDatasetMetric]#
- loader: DataLoader#
- batch_size: int#
- property metric_values: Dict[str, float]#
- mark_padding(step, mb: IMiniBatch)#
- class t3w.TrainLoop(dataset: IDataset, model: TopLevelModule, losses: Mapping[str, ILoss], metrics: Mapping[str, IMiniBatchMetric] = {}, medias: Sequence[IMediaProducer] = [], batch_size: int | None = None, sampler: Sampler | None = None, num_acc_grad: int = 1, epochs: int = 100, iter_per_epoch: int | None = None, epoch_per_eval: int = 1, eval_loop: EvalLoop | None = None, side_effects: Sequence[ISideEffect] = [])#
- __init__(dataset: IDataset, model: TopLevelModule, losses: Mapping[str, ILoss], metrics: Mapping[str, IMiniBatchMetric] = {}, medias: Sequence[IMediaProducer] = [], batch_size: int | None = None, sampler: Sampler | None = None, num_acc_grad: int = 1, epochs: int = 100, iter_per_epoch: int | None = None, epoch_per_eval: int = 1, eval_loop: EvalLoop | None = None, side_effects: Sequence[ISideEffect] = []) None #
- Parameters:
model (TopLevelModule) – the top level model.
dataset (IDataset) – train split of the dataset.
losses (Mapping[str, ILoss]) – losses to be evaluated.
metrics (Mapping[str, IMiniBatchMetric]) – metrics to be evaluated
batch_size (Optional[int], optional) – train batch size. Defaults to the
train_batch_size
static attribute ofIDatum
.sampler (Sampler, optional) – a custom sampler instance for the dataloader. Defaults to
RandomSampler
orDistributedSampler
.num_acc_grad (int, optional) – step interval to apply gradient descent. Defaults to 1.
epochs (int, optional) – total training epochs. Defaults to 100.
iter_per_epoch (Optional[int], optional) – manually define iteration number of an epoch. Defaults to
len(dataloader)
.epoch_per_eval (int, optional) – epoch interval to apply
eval_loop
. Defaults to 1.eval_loop (Optional[EvalLoop], optional) – an
EvalLoop
instance. Defaults to None.side_effects (Sequence[ISideEffect], optional) – Defaults to [].
- model: TopLevelModule#
- metrics: Mapping[str, IDatasetMetric]#
- batch_size: int#
The normal mini batch size during training.
- num_acc_grad: int#
- epochs: int#
- loader: DataLoader#
- step(mb: IMiniBatch)#
Implement a single train step, with gradiant accumulate.