module core.hypergraph
Global Variables
- TYPE_CHECKING
- global_shared_events
function LoadCheckpointTask
Load checkpoint from a file.
Args:
-
resume_from(str): Path to the checkpoint file. -
strict(bool): If True, raise an exception if the checkpoint file does not exist. -
tags(str): Tags to load.
Returns:
Task: Task to load the checkpoint.
function SaveCheckpointTask
Save checkpoint to a file.
Args:
-
save_to(str): Path to the checkpoint file. -
tags(str): Tags to save.
Returns:
Task: Task to save the checkpoint.
class ResumeTaskFailed
raised when task structure does not match during resuming.
class Task
A task is a unit of computation.
It can be a single node, or a graph. A task can be executed by a worker.
Args:
-
node: a node or a graph. -
name: the name of the task. -
total_steps: the total number of steps to run. -
total_epochs: the total number of epochs to run. -
config: a dict of configs.
method __init__
property global_auto_epochs
property global_auto_steps
method load_state_dict
method state_dict
class Repeat
Repeat a task for a fixed number of times.
Attributes:
-
task(Task): Task to repeat. -
repeat(int): Number of times to repeat the task. -
epoch_size(int): Number of steps per epoch. -
total_steps(int): Total number of steps. -
total_epochs(int): Total number of epochs. -
launcher(Launcher): Launcher object. -
hypergraph(Hypergraph): Hypergraph object. -
events(Events): Events object.
method __init__
method load_state_dict
method state_dict
class Counter
Counter object.
Attributes:
-
epochs(int): Number of epochs. -
steps(int): Number of steps.
method __init__
property total
method __getitem__
Get the value of the counter.
Args:
key(str): Name of the counter.
Returns:
int: Value of the counter.
method __setitem__
Set the value of the counter.
Args:
-
key(str): Name of the counter. -
value(int): Value of the counter.
Raises:
KeyError: If the key is not valid.
class GlobalCounters
Global counters object.
Attributes:
-
epochs(int): Number of epochs. -
steps(int): Number of steps.
function __init__
__init__(
steps: 'Counter' = <core.hypergraph.Counter object at 0x7ff1ec689c40>,
epochs: 'Counter' = <core.hypergraph.Counter object at 0x7ff1ec6891c0>
) → None
class HyperGraph
HyperGraph is the container for all nodes.
Attributes:
-
nodes(dict): Nodes. -
edges(dict): Edges. -
tasks(dict): Tasks. -
launchers(dict): Launchers. -
global_counters(GlobalCounters): Global counters. -
resume_from(str): Path to the checkpoint file. -
resume_tags(str): Tags to load. -
save_to(str): Path to the checkpoint file. -
save_tags(str): Tags to save. -
strict(bool): If True, raise an exception if the checkpoint file does not exist. -
dry_run(bool): If True, do not save the checkpoint. -
verbose(bool): If True, print the progress. -
logger(Logger): Logger.
Raises:
ValueError: If the tags are not valid.
method __init__
__init__(
autocast_enabled=False,
autocast_dtype=None,
grad_scaler: 'Union[bool, GradScaler]' = None
) → None
property launcher
Get the launcher.
Returns:
ElasticLauncher: Launcher.
Raises:
ValueError: If the launcher is not valid.
method __getitem__
Get a node by uid.
Args:
uid(str): Uid.
Returns:
Node: Node.
Raises:
ValueError: If the uid is not valid.
method add
Add a node.
Args:
-
name(str): Name. -
node(Node): Node. -
tags(str): Tags.
Returns:
Node: Node.
Raises:
ValueError: If the name is not valid.
method backup_source_files
Backup source files.
Args:
entrypoint(str): Entrypoint.
Raises:
ValueError: If the entrypoint is not valid.
method exec_tasks
Execute the tasks.
Args:
-
tasks(List[Task]): Tasks to execute. -
launcher(ElasticLauncher): Launcher.
Returns:
List[Task]: Tasks executed.
Raises:
ValueError: If the tasks are not valid.
method init_autocast
init_autocast(
autocast_enabled=True,
autocast_dtype=None,
grad_scaler: 'Union[bool, GradScaler]' = None
)
Initialize autocast.
Args:
-
autocast_enabled(bool): If True, enable autocast. -
autocast_dtype(str): Data type to cast the gradients to. -
grad_scaler(GradScaler): Gradient scaler.
Raises:
ValueError: If the autocast_dtype is not valid.
method init_grad_scaler
init_grad_scaler(self, grad_scaler: Union[bool, GradScaler]=False, *, init_scale=2.0 ** 16, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)
Ellipsis
method is_autocast_enabled
Check if autocast is enabled.
Returns:
bool: If True, autocast is enabled.
method is_grad_scaler_enabled
Check if the gradient scaler is enabled.
Returns:
bool: If True, the gradient scaler is enabled.
Raises:
ValueError: If the grad_scaler is not valid.
method load_checkpoint
Load the checkpoint.
Args:
-
resume_from(str): Path to the checkpoint. -
strict(bool): Whether to check the keys. -
tags(str): Tags to load.
Raises:
ValueError: If the resume_from is not valid.
method print_forward_output
print_forward_output(
*nodenames,
every=1,
total=None,
tags: 'List[str]' = '*',
train_only=True,
localrank0_only=True
)
Print forward output.
Args:
-
nodenames(str): Node names. -
every(int): Print every. -
total(int): Total. -
tags(List[str]): Tags. -
train_only(bool): Train only. -
localrank0_only(bool): Local rank 0 only.
Raises:
ValueError: If the nodenames is not valid.
method remove
Remove a node.
Args:
query(str): Query.
Raises:
ValueError: If the query is not valid.
method run
run(self, tasks, devices='auto', run_id: str='none', out_dir: str=None, resume_from: str=None, seed=0)
Ellipsis
method run
run(self, tasks, launcher: ElasticLauncher=None, run_id: str='none', out_dir: str=None, resume_from: str=None, seed=0)
Ellipsis
method run
run(self, tasks, devices='auto', run_id='none', nnodes='1:1', dist_backend='auto', monitor_interval=5, node_rank=0, master_addr='127.0.0.1', master_port=None, redirects='2', tee='1', out_dir=None, resume_from=None, seed=0, role='default', max_restarts=0, omp_num_threads=1, start_method='spawn')
Ellipsis
method run
run(self, tasks, devices='auto', run_id='none', nnodes='1:1', dist_backend='auto', monitor_interval=5, rdzv_endpoint='', rdzv_backend='static', rdzv_configs='', standalone=False, redirects='2', tee='1', out_dir=None, resume_from=None, seed=0, role='default', max_restarts=0, omp_num_threads=1, start_method='spawn')
Ellipsis
method save_checkpoint
Save the checkpoint.
Args:
-
save_to(str): Path to save the checkpoint. -
tags(str): Tags to save.
Returns:
str: Path to the checkpoint.
Raises:
ValueError: If the save_to is not valid.
method select_egraph
Select an executable graph.
Args:
query(str): Query.
Returns:
ExecutableGraph: Executable graph.
Raises:
ValueError: If the query is not valid.
method select_nodes
Select nodes.
Args:
query(str): Query.
Returns:
list: Nodes.
Raises:
ValueError: If the query is not valid.
method set_gradient_accumulate
Set the gradient accumulate steps.
Args:
every(int): Gradient accumulate steps.
Raises:
ValueError: If the every is not valid.