Skip to content

module llutil.launcher.launch_agent


function launch_agent

launch_agent(
    config: LaunchConfig,
    entrypoint: Optional[Callable, str],
    args: List[Any],
    ice_events: Events
)  Dict[int, Any]

class LaunchConfig

Creates a rendezvous config.

Args:

  • min_nodes: Minimum amount of nodes that the user function will be launched on. Elastic agent ensures that the user function start only when the min_nodes amount enters the rendezvous.

  • max_nodes: Maximum amount of nodes that the user function will be launched on.

  • nproc_per_node: On each node the elastic agent will launch this amount of workers that will execute user defined function.

  • rdzv_backend: rdzv_backend to use in the rendezvous (zeus-adapter, etcd).

  • rdzv_endpoint: The endpoint of the rdzv sync. storage.

  • rdzv_configs: Key, value pair that specifies rendezvous specific configuration.

  • rdzv_timeout: Legacy argument that specifies timeout for the rendezvous. It is going to be removed in future versions, see the note below. The default timeout is 900 seconds.

  • rdzv_id: The unique run id of the job (if not passed a unique one will be deduced from run environment - flow workflow id in flow - or auto generated).

  • role: User defined role of the worker (defaults to "trainer").

  • max_restarts: The maximum amount of restarts that elastic agent will conduct on workers before failure.

  • monitor_interval: The interval in seconds that is used by the elastic_agent as a period of monitoring workers.

  • start_method: The method is used by the elastic agent to start the workers (spawn, fork, forkserver).

  • log_dir: base log directory where log files are written. If not set, one is created in a tmp dir but NOT removed on exit.

  • redirects: configuration to redirect stdout/stderr to log files. Pass a single Std enum to redirect all workers, or a mapping keyed by local_rank to selectively redirect.

  • tee: configuration to "tee" stdout/stderr to console + log file.

  • metrics_cfg: configuration to initialize metrics.

..note: rdzv_timeout is a legacy argument that will be removed in future. Set the timeout via rdzv_configs['timeout']

function __init__

__init__(
    min_nodes: int,
    max_nodes: int,
    nproc_per_node: int,
    run_id: str = '',
    role: str = 'default_role',
    rdzv_endpoint: str = '',
    rdzv_backend: str = 'etcd',
    rdzv_configs: Dict[str, Any] = <factory>,
    rdzv_timeout: int = -1,
    max_restarts: int = 3,
    monitor_interval: float = 30,
    start_method: str = 'spawn',
    log_dir: Optional[str] = None,
    redirects: Union[Std, Dict[int, Std]] = <Std.NONE: 0>,
    tee: Union[Std, Dict[int, Std]] = <Std.NONE: 0>,
    metrics_cfg: Dict[str, str] = <factory>
)  None