module llutil.launcher.launch_agent
function launch_agent
launch_agent(
config: LaunchConfig,
entrypoint: Optional[Callable, str],
args: List[Any],
ice_events: Events
) → Dict[int, Any]
class LaunchConfig
Creates a rendezvous config.
Args:
-
min_nodes
: Minimum amount of nodes that the user function will be launched on. Elastic agent ensures that the user function start only when the min_nodes amount enters the rendezvous. -
max_nodes
: Maximum amount of nodes that the user function will be launched on. -
nproc_per_node
: On each node the elastic agent will launch this amount of workers that will execute user defined function. -
rdzv_backend
: rdzv_backend to use in the rendezvous (zeus-adapter, etcd). -
rdzv_endpoint
: The endpoint of the rdzv sync. storage. -
rdzv_configs
: Key, value pair that specifies rendezvous specific configuration. -
rdzv_timeout
: Legacy argument that specifies timeout for the rendezvous. It is going to be removed in future versions, see the note below. The default timeout is 900 seconds. -
rdzv_id
: The unique run id of the job (if not passed a unique one will be deduced from run environment - flow workflow id in flow - or auto generated). -
role
: User defined role of the worker (defaults to "trainer"). -
max_restarts
: The maximum amount of restarts that elastic agent will conduct on workers before failure. -
monitor_interval
: The interval in seconds that is used by the elastic_agent as a period of monitoring workers. -
start_method
: The method is used by the elastic agent to start the workers (spawn, fork, forkserver). -
log_dir
: base log directory where log files are written. If not set, one is created in a tmp dir but NOT removed on exit. -
redirects
: configuration to redirect stdout/stderr to log files. Pass a singleStd
enum to redirect all workers, or a mapping keyed by local_rank to selectively redirect. -
tee
: configuration to "tee" stdout/stderr to console + log file. -
metrics_cfg
: configuration to initialize metrics.
..note:
rdzv_timeout
is a legacy argument that will be removed in future.
Set the timeout via rdzv_configs['timeout']
function __init__
__init__(
min_nodes: int,
max_nodes: int,
nproc_per_node: int,
run_id: str = '',
role: str = 'default_role',
rdzv_endpoint: str = '',
rdzv_backend: str = 'etcd',
rdzv_configs: Dict[str, Any] = <factory>,
rdzv_timeout: int = -1,
max_restarts: int = 3,
monitor_interval: float = 30,
start_method: str = 'spawn',
log_dir: Optional[str] = None,
redirects: Union[Std, Dict[int, Std]] = <Std.NONE: 0>,
tee: Union[Std, Dict[int, Std]] = <Std.NONE: 0>,
metrics_cfg: Dict[str, str] = <factory>
) → None